Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral mapping ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the mapping process into three sub-tasks, including NIR texture maintenance, coarse geometry reconstruction, and RGB color prediction. Thus, we propose three key modules for each corresponding sub-task: the Texture Preserving Block (TPB), the HSV Color Feature Embedding Module (HSV-CFEM), and the Geometry Reconstruction Module (GRM). These modules contribute to our MCFNet methodically tackling spectral translation through a series of escalating resolutions, progressively enriching images with color and texture fidelity in a scale-coherent fashion. The proposed MCFNet demonstrates substantial performance gains over the NIR image colorization task. Code is released at: this https URL.
NIR-to-RGB spectral domain translation是一个具有挑战性的任务,因为NIR输入和RGB输出的固有光谱映射歧义。因此,现有的方法无法在保持纹理细节保真度和实现多样色彩变化之间实现和谐。在本文中,我们提出了一种多尺度HSV颜色特征嵌入网络(MCFNet),将映射过程分解为包括NIR纹理维护、粗几何重建和RGB颜色预测三个子任务的三个子任务。因此,我们提出了每个相应子任务的关键模块:纹理保留模块(TPB)、HSV颜色特征嵌入模块(HSV-CFEM)和几何重建模块(GRM)。这些模块通过一系列逐渐升高的分辨率,以尺度和谐的方式贡献于我们的MCFNet方法,通过一系列自适应纹理映射,实现对NIR图像颜色化的巨大性能提升。所提出的MCFNet在NIR图像颜色化任务中取得了显著的性能提升。代码发布在:https://这个URL。
https://arxiv.org/abs/2404.16685
Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.
预训练的大型文本-图像(T2I)模型,特别是适当的文本提示,在定制图像生成领域引起了越来越多的兴趣。然而,灾难性遗忘问题使得在保留学习到的样式的同时,持续生成新的用户提供样式变得困难。在本文中,我们提出了一种名为MuseumMaker的方法,使您能够在无尽的方式下根据一组自定义样式合成图像,并逐渐积累这些创意艺术作品作为一个博物馆。当面临新的定制样式时,我们开发了一种风格蒸馏损失模块,将整个数据集的风格传递给图像生成。它可以减小由于图片内容而产生的学习偏移,并解决由少样本图像引起的灾难性过拟合问题。为了处理过去学习到的样式中的灾难性遗忘,我们为共享LoRA模块设计了双重正则化,以优化模型更新方向,分别从权重和特征方面对扩散模型进行正则。同时,通过任务级别的标记学习模块,学习到一个与新样式对应的独特标记嵌入,这可以保留过去样式的历史知识,同时限制LoRA参数的数量。随着任何新的用户提供样式,我们的MuseumMaker可以捕捉到新风格的细微差别,同时保留学习到的样式的细节。在多样风格数据集上的实验结果证实了我们对MuseumMaker方法的有效性,展示了其在各种场景的稳健性和多样性。
https://arxiv.org/abs/2404.16612
Large language models (LLMs) show early signs of artificial general intelligence but struggle with hallucinations. One promising solution to mitigate these hallucinations is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. However, such a solution risks compromising privacy, as recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models. The significant advantage of LLMs over traditional pre-trained models may exacerbate these concerns. To this end, we investigate the effectiveness of reconstructing original knowledge and predicting entity attributes from these embeddings when LLMs are employed. Empirical findings indicate that LLMs significantly improve the accuracy of two evaluated tasks over those from pre-trained models, regardless of whether the texts are in-distribution or out-of-distribution. This underscores a heightened potential for LLMs to jeopardize user privacy, highlighting the negative consequences of their widespread use. We further discuss preliminary strategies to mitigate this risk.
大语言模型(LLMs)显示出早期的人工通用智能迹象,但在幻觉方面遇到困难。一种减轻这些幻觉的潜在解决方案是将外部知识存储为嵌入,有助于LLMs在检索增强生成。然而,这样的解决方案可能危及隐私,因为最近的研究表明,通过预训练语言模型可以部分重构原始文本。LLM与传统预训练模型的显著优势可能会加剧这些担忧。因此,我们研究了在LLM被应用时,从这些嵌入中恢复原始知识和预测实体属性的有效性。 实证发现表明,无论文本是否在分布内,LLM在两个评估任务中的准确率都显著高于预训练模型。这表明LLM显著提高了两个评估任务的准确性,无论这些文本是否在分布内。这凸出了LLM对用户隐私可能造成的威胁,并突出了其在广泛使用时可能产生的负面后果。我们进一步讨论了减轻这种风险的初步策略。
https://arxiv.org/abs/2404.16587
It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
研究发现,基于Transformer的语言模型具有执行基本数量推理的能力。在本文中,我们提出了一种研究这些模型内部如何表示数值数据的方法,并使用我们的建议分析ALBERT家族的语言模型。具体来说,我们提取这些模型用于表示数字和序数的 learned嵌入,并对其进行主成分分析(PCA)。PCA的结果表明,具有不同大小的ALBERT模型,在训练和初始化过程中分别进行,能够一致地使用变化最大的轴来表示各种数值概念的近似顺序。数值和它们的文本对应物分别位于不同的簇中,但在二维空间中沿着相同的方向增加。我们的研究结果表明,训练纯粹用于建模文本的语言模型可以直观地理解基本的数学概念,为与数量推理相关的自然语言处理应用程序开辟了道路。
https://arxiv.org/abs/2404.16574
A central question for cognitive science is to understand how humans process visual objects, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07\% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse objects through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG is available at \url{this https URL}.
cognitive科学的一个核心问题是了解人类如何处理视觉对象,即揭示人类低维度概念表示空间从高维视觉刺激中。生成具有控制概念的视觉刺激是关键。然而,目前还没有人工智能中的生成模型来解决这个问题。在这里,我们提出了基于概念的控制性生成(CoCoG)框架。CoCoG由两个组件组成,一个是简单而高效的AI代理,用于提取可解释的概念和在视觉相似性判断任务中预测人类决策,另一个是条件生成模型,用于根据概念生成视觉刺激。我们从两个方面评估CoCoG的表现,即人类行为预测准确性和可控制性生成能力。CoCoG与CoCoG的实验表明,1)CoCoG中的可靠概念嵌入允许在THINGS-similarity数据集中预测人类行为达到64.07%;2)通过控制概念,CoCoG可以生成多样化的对象;3)通过干预关键概念,CoCoG可以操纵人类相似性判断行为。CoCoG为我们在人类认知中的因果关系提供了具有控制概念的视觉对象。CoCoG代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.16482
Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud tokens to efficiently compute and utilize tokens proximity based on their indices during target and context selection. The sequencer also allows shared computations of the tokens proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
近年来,在点云领域自监督学习的进展已经展示了很大的潜力。然而,这些方法通常存在缺点,包括漫长的预训练时间、在输入空间进行重建的必要性,或者需要额外的模块。为了应对这些问题,我们引入了点JEPA,一种专门针对点云数据的联合嵌入预测架构。为此,我们引入一个序列器,对点云令牌进行排序,以在目标和上下文选择期间基于其索引计算并利用令牌的接近性。序列器还允许在上下文和目标选择之间共享计算令牌接近性,从而进一步提高效率。实验证明,我们的方法在获得与最先进方法竞争力的结果的同时,避免了在输入空间进行重建或添加额外模块。
https://arxiv.org/abs/2404.16432
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.
尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。
https://arxiv.org/abs/2404.16365
The unique artistic style is crucial to artists' occupational competitiveness, yet prevailing Art Commission Platforms rarely support style-based retrieval. Meanwhile, the fast-growing generative AI techniques aggravate artists' concerns about releasing personal artworks to public platforms. To achieve artistic style-based retrieval without exposing personal artworks, we propose FedStyle, a style-based federated learning crowdsourcing framework. It allows artists to train local style models and share model parameters rather than artworks for collaboration. However, most artists possess a unique artistic style, resulting in severe model drift among them. FedStyle addresses such extreme data heterogeneity by having artists learn their abstract style representations and align with the server, rather than merely aggregating model parameters lacking semantics. Besides, we introduce contrastive learning to meticulously construct the style representation space, pulling artworks with similar styles closer and keeping different ones apart in the embedding space. Extensive experiments on the proposed datasets demonstrate the superiority of FedStyle.
独特的美学风格对艺术家职业竞争力至关重要,然而现行的艺术委员会平台 rarely 支持基于风格的音乐检索。与此同时,快速增长的生成式 AI 技术使艺术家对将个人作品发布到公共平台感到担忧。为了实现基于美学风格的音乐检索而不会泄露个人作品,我们提出了 FedStyle,一种基于风格的分众学习框架。它允许艺术家训练本地风格模型并共享模型参数,而不是为了合作而共享作品。然而,大多数艺术家具有独特的艺术风格,导致他们之间的模型漂移严重。FedStyle 通过让艺术家学习其抽象风格表示来解决这种极端的数据异质性,而不是简单地聚合缺乏语义的数据参数。此外,我们还引入了对比学习来精心构建风格表示空间,将具有相似风格的作品推向更靠近,将不同风格的作品保持在空间中。在提出的数据集上进行的大量实验证明 FedStyle 的优越性。
https://arxiv.org/abs/2404.16336
In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search. We jointly learn a unified query embedding coupled with pin and product embeddings, leading to an improvement of $>8\%$ relevance, $>7\%$ engagement, and $>5\%$ ads CTR in Pinterest's production search system. The main contributors to these gains are improved content understanding, better multi-task learning, and real-time serving. We enrich our entity representations using diverse text derived from image captions from a generative LLM, historical engagement, and user-curated boards. Our multitask learning setup produces a single search query embedding in the same space as pin and product embeddings and compatible with pre-existing pin and product embeddings. We show the value of each feature through ablation studies, and show the effectiveness of a unified model compared to standalone counterparts. Finally, we share how these embeddings have been deployed across the Pinterest search stack, from retrieval to ranking, scaling to serve $300k$ requests per second at low latency. Our implementation of this work is available at this https URL.
在本文中,我们提出了OmniSearchSage,一个用于理解Pinterest搜索查询、列表和产品的多功能且可扩展的系统。我们共同学习了一个统一的查询嵌入与Pin和产品嵌入相结合,导致了在Pinterest的生成搜索系统中的相关性提高8%以上,参与度提高7%以上,广告CTR提高5%以上。这些增长的主要贡献是改进的内容理解、更好的多任务学习和实时 serving。我们通过从图像标题中提取多样性的文本来丰富实体表示,包括历史参与度和用户创建的列表。我们的多任务学习设置在相同的空间中产生了与Pin和产品嵌入兼容的单个搜索查询嵌入,并支持预先存在的Pin和产品嵌入。我们通过消融研究来展示每个功能的价值,并比较统一的模型与分离的模型的效果。最后,我们分享了这些嵌入如何应用于Pinterest搜索链的各个环节,从检索到排名,以及在低延迟下每秒服务300k请求的效果。本文中我们所做的工作的实现版本请点击以下链接获取:https://www.aclweb.org/anthology/W19-6231
https://arxiv.org/abs/2404.16260
Modern face recognition systems utilize deep neural networks to extract salient features from a face. These features denote embeddings in latent space and are often stored as templates in a face recognition system. These embeddings are susceptible to data leakage and, in some cases, can even be used to reconstruct the original face image. To prevent compromising identities, template protection schemes are commonly employed. However, these schemes may still not prevent the leakage of soft biometric information such as age, gender and race. To alleviate this issue, we propose a novel technique that combines Fully Homomorphic Encryption (FHE) with an existing template protection scheme known as PolyProtect. We show that the embeddings can be compressed and encrypted using FHE and transformed into a secure PolyProtect template using polynomial transformation, for additional protection. We demonstrate the efficacy of the proposed approach through extensive experiments on multiple datasets. Our proposed approach ensures irreversibility and unlinkability, effectively preventing the leakage of soft biometric attributes from face embeddings without compromising recognition accuracy.
现代面部识别系统利用深度神经网络从面部中提取显著特征。这些特征表示潜在空间中的嵌入,通常被面部识别系统中的模板存储。这些嵌入很容易受到数据泄漏的影响,在某些情况下,甚至可以用于重构原始面部图像。为了防止泄露身份,通常采用模板保护方案。然而,这些方案可能仍无法防止软生物特征(如年龄、性别和种族)的泄露。为了缓解这个问题,我们提出了一种结合完全同态加密(FHE)和已知模板保护方案(PolyProtect)的新技术。我们证明了使用FHE可以压缩和加密嵌入,并且可以使用多项式变换将其转换为安全的PolyProtect模板,提供额外的保护。我们通过在多个数据集上进行广泛实验,证明了所提出方法的有效性。与我们的方法相比,确保不可逆性和解链性,有效防止了未经过授权的软生物特征从面部嵌入中泄露,同时保持识别准确性。
https://arxiv.org/abs/2404.16255
Knowledge Graphs (KGs) are widely employed in artificial intelligence applications, such as question-answering and recommendation systems. However, KGs are frequently found to be incomplete. While much of the existing literature focuses on predicting missing nodes for given incomplete KG triples, there remains an opportunity to complete KGs by exploring relations between existing nodes, a task known as relation prediction. In this study, we propose a relations prediction model that harnesses both textual and structural information within KGs. Our approach integrates walks-based embeddings with language model embeddings to effectively represent nodes. We demonstrate that our model achieves competitive results in the relation prediction task when evaluated on a widely used dataset.
知识图(KGs)广泛应用于人工智能领域,如问答和推荐系统。然而,KGs经常被发现不完整。虽然现有文献主要关注预测给定不完整的KG三元组中的缺失节点,但在关系预测领域仍有机会通过探索现有节点之间的关系来完成KGs,实现名为关系预测的任务。在这项研究中,我们提出了一个利用KGs中文本和结构信息的关联预测模型。我们的方法结合了走行嵌入和语言模型嵌入,有效地表示节点。我们证明了,当在我们的广泛使用数据集上评估时,我们的模型在关系预测任务上实现了具有竞争力的结果。
https://arxiv.org/abs/2404.16206
The advent of personalized content generation by LLMs presents a novel challenge: how to efficiently adapt text to meet individual preferences without the unsustainable demand of creating a unique model for each user. This study introduces an innovative online method that employs neural bandit algorithms to dynamically optimize soft instruction embeddings based on user feedback, enhancing the personalization of open-ended text generation by white-box LLMs. Through rigorous experimentation on various tasks, we demonstrate significant performance improvements over baseline strategies. NeuralTS, in particular, leads to substantial enhancements in personalized news headline generation, achieving up to a 62.9% improvement in terms of best ROUGE scores and up to 2.76% increase in LLM-agent evaluation against the baseline.
个性化内容生成由LLM的问世带来了一个新的挑战:如何高效地将文本适应于满足个人偏好,而不会产生每个用户都要求创建独特模型的不可持续需求。本研究介绍了一种创新的方法,该方法采用神经随机游走算法动态优化基于用户反馈的软指令嵌入,从而增强LLM在开放性文本生成中的个性化。通过在各种任务上进行严谨的实验,我们证明了与基线策略相比,具有显著的性能提升。特别是,NeuralTS在个性化新闻标题生成方面取得了很大的提升,最佳ROUGE得分提高了62.9%,LLM代理评估基准测试中的评估值增加了2.76%。
https://arxiv.org/abs/2404.16115
We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks. Our code is made available at this https URL .
我们提出了GaussianTalker,一种用于实时生成具有姿态控制的说话头的新框架。它利用了3D高斯平滑(3DGS)的快速渲染能力,同时解决了直接用语音音频控制3DGS的挑战。GaussianTalker构建了一个规范的3DGS头部的表示,并与其同步变形。一个关键的见解是将3D高斯属性编码成一个共享的隐式特征表示,其中它与音频特征合并以操纵每个高斯属性。这种设计利用了空间感知特征,并强制处理相邻点之间的交互。然后将特征嵌入 feed 到一个空间-音频关注模块,该模块预测每个高斯属性的时偏移。与之前的方法相比,GaussianTalker在面部保真度、 lip 同步准确性和渲染速度方面具有优越性。具体来说,GaussianTalker实现了令人印象深刻的120 FPS的渲染速度,超过了之前的基准。我们的代码可在此处访问的 URL 下载。
https://arxiv.org/abs/2404.16012
Graph Neural Network (GNN)-based fake news detectors apply various methods to construct graphs, aiming to learn distinctive news embeddings for classification. Since the construction details are unknown for attackers in a black-box scenario, it is unrealistic to conduct the classical adversarial attacks that require a specific adjacency matrix. In this paper, we propose the first general black-box adversarial attack framework, i.e., General Attack via Fake Social Interaction (GAFSI), against detectors based on different graph structures. Specifically, as sharing is an important social interaction for GNN-based fake news detectors to construct the graph, we simulate sharing behaviors to fool the detectors. Firstly, we propose a fraudster selection module to select engaged users leveraging local and global information. In addition, a post injection module guides the selected users to create shared relations by sending posts. The sharing records will be added to the social context, leading to a general attack against different detectors. Experimental results on empirical datasets demonstrate the effectiveness of GAFSI.
基于图神经网络(GNN)的假新闻检测器应用各种方法来构建图,旨在学习分类新闻的显著特征。由于攻击者在黑盒场景中的构建细节是未知的,因此无法进行需要特定邻接矩阵的经典对抗攻击。在本文中,我们提出了第一个针对不同图结构的检测器的一般黑盒攻击框架,即通过虚假社交交互(GAFSI)进行攻击。具体来说,共享对于基于图神经网络的假新闻检测器构建图形至关重要。为了欺骗检测器,我们提出了一个欺诈者选择模块,利用本地和全局信息选择积极参与的用户。此外,一个后注入模块通过发送帖子指导选择的用户创建共享关系。共享记录将添加到社交上下文,导致对不同检测器的通用攻击。在经验数据集上的实验结果表明,GAFSI的有效性得到了充分验证。
https://arxiv.org/abs/2404.15744
This report details the development and key achievements of our latest language model designed for custom large language models. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially crafted multilingual tokenizer to enhance stability and performance. Moreover, our robust training framework incorporates advanced monitoring and rapid recovery features to ensure optimal efficiency. Our Wonton 7B model has demonstrated competitive performance on a range of multilingual and English benchmarks. Future developments will prioritize narrowing the performance gap with more extensively trained models, thereby enhancing the model's real-world efficacy and adaptability.GitHub: \url{this https URL}
本报告详细介绍了我们最新的为定制大型语言模型而设计的语言模型的开发关键成就。引入的改进包括一个支持灵活训练数据调整和课程学习的新颖在线数据调度器。模型的架构由最先进的技术 such as Rotary Positional Embeddings, QK-LayerNorm 和专门设计的多语言标记符强化稳定性 and 性能。此外,我们的稳健训练框架包括先进的监控和快速恢复功能,以确保最佳效率。我们的Wonton 7B模型在多语言和英语基准测试中表现出竞争力的性能。未来的发展将优先考虑通过更广泛训练的模型来缩小性能差距,从而增强模型的真实世界效果和适应性。 GitHub:\url{this <https://github.com> URL}
https://arxiv.org/abs/2404.15702
Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: What would happen to the GNN's decision if we were to alter the code graph into alternative structures? Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.
漏洞检测对于确保软件系统的安全可靠至关重要。近年来,图神经网络(GNNs)作为一种显著的代码嵌入方法,成为检测漏洞的突出方法,因为它们具有捕捉源代码潜在语义结构的能力。然而,由于GNNs固有的黑盒性质,它们在可解释性方面面临着重大挑战。为此,已经提出了几种基于事实推理的解释器。这些解释器通过分析对结果产生重要影响的特征来解释GNNs的预测。我们认为,这些基于事实推理的解释器无法回答关键的假设性问题:如果我们改变代码图,GNN的决策会怎样?受到人工智能中反事实推理的进展启发,我们提出了CFExplainer,一种基于GNN的漏洞检测的新反事实解释器。与基于事实推理的解释器不同,CFExplainer寻求对输入代码图的最小扰动,从而解决检测问题中的假设性问题。我们将这种扰动称为反事实解释,它可以指出检测到的漏洞的根本原因,并为开发人员提供有关采取相应措施修复漏洞的有价值的见解。在四个基于GNN的漏洞检测模型上进行的大量实验证明,CFExplainer比现有的基于事实推理的解释器更有效。
https://arxiv.org/abs/2404.15687
Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 130,000 function re-write GPT output text blocks, approximately 40,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.
生成预训练的变换器(GPT)是一种大型自然语言机器学习模型,特别擅长生成新颖且连贯的自然语言。在这项研究中,研究了 GPT 模型在生成新颖且正确的 SHA-1 哈希函数实现版本方面的能力,特别是非常不安全的实现版本。所使用的 GPT 模型包括 LLama-2-70b-chat-h、Mistral-7B-Instruct-v0.1 和 zephyr-7b-alpha。这些 GPT 模型使用修改的 localGPT 框架和 langchain,在每个函数上生成词嵌入上下文,并提供完整的源代码和头文件给模型,导致超过 130,000 个函数重写 GPT 输出文本块,其中大约 40,000 个被解析为 C 代码并后续编译。生成的代码被分析是否可编译、算法的正确性、内存泄漏、编译优化稳定性以及与参考实现的字符距离。值得注意的是,几个生成的函数变体在某些测试数据上的实现安全性非常高,但在其他测试数据上的实现是不正确的。此外,许多函数实现与 SHA-1 参考算法不正确,但生成了具有哈希函数的一些基本特征的哈希值。许多函数重写包含严重的漏洞,如内存泄漏、整数溢出、越界访问、使用未初始化值以及编译优化不稳定。编译优化设置和编译二进制文件的 SHA-256 哈希值检查用于将具有等效但可能不具有相同语法的实现聚类在一起 - 使用这种聚类在 SHA-1 代码库上生成了超过 100,000 个新颖且正确的函数版本,其中每个参考实现组件的 C 函数与原始代码不同。
https://arxiv.org/abs/2404.15681