Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
视觉语言模型使得无需重新训练即可对开放世界中的物体进行分类。虽然这种零样本范式取得了重大进展,但即使是最先进的模型在物体不与其典型描述相当时也会表现出偏斜的性能。现实世界中的苹果呈现出各种形式——从切成薄片到整个,放在桌子上或碗里——然而,标准视觉语言模型将类别的实例映射到基于类别的单个向量上。我们认为,为了在类中表示这种丰富的多样性,零样本分类应超越单一向量。我们提出了一种方法,通过推断属性来编码和解释类中的多样性,在零样本设置中不需要重新训练。我们发现在一系列包括层次结构、多样物体状态和真实世界地理多样性的大数据集上,我们的方法 consistently优于标准零样本分类。此外,我们的方法具有内在可解释性,为每个推理提供准确的解释,从而促进模型调试和提高透明度。我们还发现,我们的方法能够有效地扩展到大量的属性,以考虑多样性,从而使典型实例的预测更准确。最后,我们描述了总体和最差类准确度之间的原则性权衡,该权衡可以通过我们方法的超参数进行调整。我们希望这项工作能够推动关于零样本分类在捕捉世界多样性方面的前景以及在不牺牲性能的情况下构建透明AI系统的研究。
https://arxiv.org/abs/2404.16717
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.
我们提出了LayerSkip,这是一种加速大型语言模型(LLM)推理速度的端到端解决方案。首先,在训练过程中,我们应用层下落,对于较早的层,下落率较低,对于较晚的层,下落率较高,并且有一个早期的退出损失,其中所有Transformer层都共享相同的退出。其次,在推理过程中,我们证明了这种训练方法在较早的层上增加了早期退出模型的准确性,而没有添加任何辅助层或模块到模型中。第三,我们提出了一个新颖的自适应解码解决方案,其中我们在较早的层退出,并使用模型的剩余层来验证和纠正。我们针对LLama模型的大小和不同类型的训练进行了实验:从头预训练,继续预训练,针对特定数据领域的微调,以及针对特定任务的微调。我们实现了我们的推理解决方案,并在综述的CNN/DM文档上展示了速度提高至2.16倍,在编码上提高了1.82倍,在TOPv2语义解析任务上提高了2.0倍。
https://arxiv.org/abs/2404.16710
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
多模态基础模型,如CLIP,已经展示了令人印象深刻的零样本能力。然而,由于它们具有大量参数和高推理时间,这些模型在资源受限的环境中的应用有限。虽然现有的方法已经将整个CLIP架构缩小,但我们关注于训练更小的图像编码器变体,这对于高效的零样本分类是足够的。使用合成数据已经表明,从更大的教师表示中提取表示具有潜力,导致强大的零样本和线性探测性能。然而,我们发现,在真正的零样本设置中,这种方法在对比损失方面表现令人失望。我们发现,这种方法在合成和真实数据之间的泛化差上存在问题。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并培训学生实现零样本性能,这在与DataCompXL数据集上训练的ViT-B/32教师模型相当的四域特定数据集上。
https://arxiv.org/abs/2404.16637
Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at this https URL.
图表对于呈现和解释复杂数据关系非常重要。最近,多模态大型语言模型(MLLMs)在各种图表理解任务中表现出非凡的能力。然而,这些模型在参数和计算需求方面的庞大规模限制了其在资源受限环境中的应用。在本文中,我们提出了TinyChart,一个仅包含3B参数的高效的MLLM,用于图表理解。TinyChart克服了高效图表理解的两个关键挑战:(1)通过程序化思考(PoT)学习策略减少学习数值计算的负担,该策略训练模型生成用于数值计算的Python程序,(2)通过视觉词表合并模块减少高分辨率图像中产生的长视觉特征序列,该模块逐渐合并最相似的视觉词。大量实验证明,我们的3B TinyChart在包括ChartQA、Chart-to-Text、Chart-to-Table、OpenCQA和ChartX在内的各种图表理解基准测试中实现了最先进的性能。它优于拥有多达13B参数的ChartLlama和ChartAst等几个图表理解MLLM,并在ChartQA上的性能优于基于闭源通用MLLM GPT-4V。它还证明了其在推理过程中由于模型规模较小和视觉编码更高效而具有优越的效率。我们的代码和模型可以从该链接下载:https://url.com/
https://arxiv.org/abs/2404.16635
Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
尽管多模态大型语言模型(MLLMs)的表现非常出色,但它们的部署需要大量的计算资源。一旦恶意用户诱导高能耗和延迟时间(能源延迟成本),就会耗尽计算资源并损害服务的可用性。在本文中,我们研究了MLLMs的这一漏洞,特别是基于图像和视频的 ones,并试图通过创建一个不可察觉的扰动来诱导高能源-延迟成本 在推理过程中。我们发现,通过最大化生成序列的长度,可以轻松地操纵高能源-延迟成本,这激发了我们提出verbose样本,包括verbose图像和视频。具体来说,我们提出了两个模式非特定的损失,包括延迟结束标记(EOS)的损失和增加每个生成标记的不确定性的损失。此外,提高多样性对于鼓励更长的响应很重要,从而增加了复杂性,这激发了我们下面模式特定损失。对于verbose图像,我们提出了一个标记多样性损失,以促进多样隐藏状态。对于verbose视频,我们提出了一个帧特征多样性损失,以增加帧之间的特征多样性。为了平衡这些损失,我们提出了一个时间加权调整算法。实验证明,我们的verbose样本可以大大扩展生成序列的长度。
https://arxiv.org/abs/2404.16557
The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 times, and save up to 76% of parameters, while maintaining competitive performance. The code is available at this https URL.
最近基于MLP的局部隐式图像函数(LIIF)和后续的隐式神经表示(INR)在任意尺度超分辨率(ASSR)方面的研究取得了显著的成功,通过使用MLP解码低分辨率(LR)特征。然而,这些连续的图像表示通常在高分辨率(HR)和高维度(HD)空间中执行解码,导致计算成本增加,严重阻碍了ASSR的实际应用。为了解决这个问题,我们提出了一个新颖的潜在模块函数(LMF),它将高分辨率(HR)和高维度(HD)解码过程在LR-HD空间中进行共享隐式解码,在HR低维度(LD)空间中实现独立渲染,从而实现了连续图像表示的第一个计算最优范式。具体来说,LMF利用高维度(HD)的MLP在隐空间中生成每个LR特征向量的隐式模度。这使得在渲染空间中,模度强度与输入特征向量呈正相关,从而实现对输入图像复杂度的自适应调整。此外,我们利用模度强度与输入图像复杂度之间的正相关性,设计了一个可控制多尺度渲染(CMSR)算法,根据渲染精度调整解码效率。大量实验证明,将现有的INR为基础的ASSR方法转换为LMF可以降低计算成本至99.9%,加速推理至57倍,并节省约76%的参数,同时保持竞争力的性能。代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.16451
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
In the realm of software development, testing is crucial for ensuring software quality and adherence to requirements. However, it can be time-consuming and resource-intensive, especially when dealing with large and complex software systems. Test case prioritization (TCP) is a vital strategy to enhance testing efficiency by identifying the most critical test cases for early execution. This paper introduces a novel fuzzy logic-based approach to automate TCP, using fuzzy linguistic variables and expert-derived fuzzy rules to establish a link between test case characteristics and their prioritization. Our methodology utilizes two fuzzy variables - failure rate and execution time - alongside two crisp parameters: Prerequisite Test Case and Recently Updated Flag. Our findings demonstrate the proposed system capacity to rank test cases effectively through experimental validation on a real-world software system. The results affirm the practical applicability of our approach in optimizing the TCP and reducing the resource intensity of software testing.
在软件开发的领域,测试是确保软件质量和符合要求的关键。然而,处理大型和复杂软件系统可能需要花费很多时间和资源。测试用例优先级(TCP)是一种提高测试效率的重要策略,通过识别最 critical 的测试用例来增强测试。本文介绍了一种基于模糊逻辑的新方法来自动测试用例优先级,利用模糊语言变量和专家确定的模糊规则建立测试用例特性和优先级之间的联系。我们方法采用两个模糊变量——故障率和执行时间——与两个 crisp 参数:前提测试用例和最近更新标志。我们的研究结果表明,所提出的系统通过实验验证在真实世界软件系统上有效排名测试用例。结果证实了我们在优化测试用例和降低软件测试资源强度方面的方法的实际应用。
https://arxiv.org/abs/2404.16395
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{this https URL}.
设置标(SoM)提示揭示了GPT-4V的视觉 grounded 能力,通过使模型能够将图像上插入的标签与字母数字标签关联。这些标签用字母数字标示,可以通过文本标记进行索引以便于参考。尽管GPT-4V的表现非凡,但我们观察到其他多模态大型语言模型(MLLMs)很难理解这些视觉标签。为了推广为开源模型学习SoM提示,我们提出了一个新的学习范式:“一个一个列出项目”,它要求模型按照标签的字母数字顺序列出图像上所有视觉标签。通过将我们精心挑选的数据集与其他视觉指令调整数据集集成,我们使得现有的MLLM具有SoM提示能力。此外,我们在五个MLLM基准测试上评估了我们微调的SoM模型。我们发现,即使在相对较小的规模(10k-30k图像带有标签)下,这个新数据集也显著增强了视觉推理能力和减少了MLLM的幻觉。也许令人惊讶的是,即使在输入图像中省略了视觉标签时,这些改进也会持续。这表明“一个一个列出项目”作为一个新的为训练MLLM的新范式具有潜力,它通过在训练阶段使用视觉标签来加强对象与文本之间的对齐。最后,我们通过探测训练后的模型来理解SoM的工作原理。我们的代码和数据可在此处访问:https://this URL。
https://arxiv.org/abs/2404.16375
Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking" attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited attack success rate. In this study, to better study the jailbreak attack, we introduce the DSN (Don't Say No) attack, which prompts LLMs to not only generate affirmative responses but also novelly enhance the objective to suppress refusals. In addition, another challenge lies in jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the attack. The existing evaluation such as refusal keyword matching has its own limitation as it reveals numerous false positive and false negative instances. To overcome this challenge, we propose an ensemble evaluation pipeline incorporating Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potency of the DSN and the effectiveness of ensemble evaluation compared to baseline methods.
确保大型语言模型(LLMs)的安全对生成符合人类价值观的响应至关重要。尽管它们能够识别并避免有害查询,但LLMs仍然容易受到“破解”攻击,这种攻击是通过对LLM生成具有毒性内容的精心策划的提示来实现的。其中一种破解攻击是将任务重新建模为对抗性攻击,通过让LLM生成积极响应。然而,这种攻击类型的典型攻击成功率非常有限。 在本研究中,为了更好地研究破解攻击,我们引入了DSN(不要说“不”)攻击,该攻击要求LLM不仅生成积极响应,而且还通过增强目标来抑制拒绝。此外,另一个挑战是破解攻击的评估,因为很难直接且准确地评估攻击的危害。现有的评估方法,如拒绝关键词匹配,本身也有其局限性,因为它揭示了大量的误判和误判实例。为了克服这个挑战,我们提出了一个包含自然语言推理(NLI)矛盾评估和两个外部LLM评估器的元学习评估管道。大量的实验证明,DSN和元学习的组合比基线方法更具有威力。
https://arxiv.org/abs/2404.16369
Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
prompt learning已成为将大型预训练视觉语言模型(VLMs)适应下游任务的最具效力的范式。最近,无监督提示调整方法(如UPL和POUF)直接利用伪标签作为监督信息,在未标注数据上微调附加适应模块。然而,不准确的伪标签容易误导调整过程,导致表现不佳。针对这个问题,我们提出了 Training-Free Unsupervised Prompts (TFUP),在训练-免费和无标签的方式下,保留预训练模型的固有表示能力,并将其与基于相似度的预测概率的残差连接来增强表现。具体来说,我们将实例置信度和原型分数相结合,以选择具有代表性的样本,用于为无标签数据训练-免费的特征缓存模型(FCM)。然后,我们设计了一个多级相似度度量(MSM),考虑特征级别和语义级别的相似性,计算测试图像与缓存样本之间的距离,作为对应缓存标签的权重,生成基于相似度的预测概率。这样,TFUP 在多分类数据集上实现了惊人的表现,甚至超过了基于训练的方法。根据我们的TFUP,我们提出了一个基于训练的改进方法(TFUP-T),以进一步提高适应性能。除了标准的交叉熵损失外,TFUP-T还采用了一种额外的边际分布熵损失来约束模型从全局角度出发。我们的TFUP-T在多个基准数据集上的表现与无监督和少样本调整方法相当。特别是,TFUP-T 通过在最具挑战性的Domain-Net数据集上将POUF的分类准确率提高了3.3%而超过了该方法。
https://arxiv.org/abs/2404.16339
Besides humans and machines, Artificial Intelligence (AI) models have emerged to be another important audience of programming languages, as we come to the era of large language models (LLMs). LLMs can now excel at coding competitions and even program like developers to address various tasks, such as math calculation. Yet, the grammar and layout of existing programs are designed for humans. Particularly, abundant grammar tokens and formatting tokens are included to make the code more readable to humans. While beneficial, such a human-centric design imposes an unnecessary computational burden on LLMs where each token, either consumed or generated, consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar, which aims to represent the code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python, allowing execution via a modified AST parser. In addition, we explore methods to enable existing LLMs to proficiently understand and use SimPy, and ensure the changes remain imperceptible for human developers. Compared with the original Python, SimPy not only reduces token usage by 13.5% and 10.4% for CodeLlama and GPT-4, but can also achieve equivalent, even improved, performance over the models trained on Python code.
除了人类和机器,人工智能(AI)模型已成为编程语言的另一个重要受众,我们进入到大语言模型(LLMs)的时代。LLMs现在在编程竞赛中表现出色,甚至可以像开发者一样编写程序来解决各种任务,例如数学计算。然而,现有的程序的语法和布局是为人类设计的。特别是,丰富的语法标记和格式化标记包括在内,使得代码对人类更易阅读。尽管这种以人类为中心的设计有益,但每个标记(无论是消耗还是生成)都消耗计算资源,从而对LLMs施加不必要的计算负担。为了提高推理效率和降低计算成本,我们提出了面向人工智能的语法概念,旨在以更好地适应AI模型的运行机制来表示代码。使用面向人工智能的语法编写的代码会丢弃格式,并使用最小的标记数有效地传达代码语义。为了证明这个概念的可行性,我们探讨并实现了第一个面向人工智能的语法Python,名为Simple Python(SimPy)。通过一系列启发式规则修改原始Python语法,Simple Pytho的程序与标准Python的程序具有相同的抽象语法树(AST)结构,可以通过修改后的AST解析器进行执行。此外,我们探讨了方法,使现有的LLM能够有效地理解和使用SimPy,并确保对人类开发者来说,变化是难以察觉的。与原始Python相比,Simple Pytho不仅减少了CodeLlama和GPT-4的标记使用量13.5%和10.4%,而且还可以实现与Python代码训练的模型相等甚至更好的性能。
https://arxiv.org/abs/2404.16333
As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at this https URL.
作为自动机器学习领域新兴挑战之一,硬件感知的神经架构搜索(HW-NAS)任务可以被视为多目标优化问题(MOPs)。HW-NAS在实时语义分割(Real-time Semantic Segmentation,RSS)中的应用至关重要。为实现实时语义分割,HW-NAS在实时语义分割本身就需要在多个优化目标之间取得平衡,包括模型准确性、推理速度和硬件特定考虑。尽管HW-NAS在实时语义分割中具有重要性,但迄今为止还没有为这样的具有挑战性的任务开发基准。为了填补这一空白,我们引入了一个定制的流线来将HW-NAS在实时语义分割中的任务转化为标准的MOP。在此基础上,我们提出了一个基准测试套件,CitySeg/MOP,包含来自Cityscapes数据集的15个MOP。CitySeg/MOP测试套件已集成到EvoXBench平台中,为各种编程语言(例如Python和MATLAB)提供了一个无缝的界面来进行即时的健身评估。我们对CitySeg/MOP测试套件进行了全面评估,展示了其多才性和实用性。源代码可以从此链接获取:https://url.cn/xyz6uJ4
https://arxiv.org/abs/2404.16266
Open-source simulation tools play a crucial role for neuromorphic application engineers and hardware architects to investigate performance bottlenecks and explore design optimizations before committing to silicon. Reconfigurable Architecture for Neuromorphic Computing (RANC) is one such tool that offers ability to execute pre-trained Spiking Neural Network (SNN) models within a unified ecosystem through both software-based simulation and FPGA-based emulation. RANC has been utilized by the community with its flexible and highly parameterized design to study implementation bottlenecks, tune architectural parameters or modify neuron behavior based on application insights and study the trade space on hardware performance and network accuracy. In designing architectures for use in neuromorphic computing, there are an incredibly large number of configuration parameters such as number and precision of weights per neuron, neuron and axon counts per core, network topology, and neuron behavior. To accelerate such studies and provide users with a streamlined productive design space exploration, in this paper we introduce the GPU-based implementation of RANC. We summarize our parallelization approach and quantify the speedup gains achieved with GPU-based tick-accurate simulations across various use cases. We demonstrate up to 780 times speedup compared to serial version of the RANC simulator based on a 512 neuromorphic core MNIST inference application. We believe that the RANC ecosystem now provides a much more feasible avenue in the research of exploring different optimizations for accelerating SNNs and performing richer studies by enabling rapid convergence to optimized neuromorphic architectures.
开源模拟工具在神经形态应用工程师和硬件架构师探究性能瓶颈和探索设计优化之前提交硅片方面发挥着关键作用。可重构架构神经形态计算(RANC)是一种这样的工具,它允许在统一的生态系统中通过软件模拟和FPGA仿真执行预训练的Spiking神经网络(SNN)模型。RANC已经通过社区灵活且参数化的设计得到了广泛应用,以研究实现瓶颈、调整架构参数或根据应用洞察力修改神经元行为,并研究硬件性能和网络准确性的贸易空间。 在为神经形态计算设计架构时,有数以百万计的配置参数,如每个神经元的权重数量和精度、每个核心的神经元和轴数量、网络拓扑结构和神经元行为。为了加速这些研究并为用户提供更简便的生产设计空间探索,本文我们引入了基于GPU的RANC实现。我们总结了我们的并行方法,并定量了GPU-based tick-accurate仿真在各种用例中的速度提升。我们证明了基于512个神经形态核心的MNIST推理应用程序的串行版本与基于GPU的RANC仿真器之间的速度提升高达780倍。我们相信,RANC生态系统现在为研究探索不同的优化方法加速SNNs和进行更丰富的研究提供了更加可行的方式,通过使快速收敛到优化神经形态架构而努力。
https://arxiv.org/abs/2404.16208
Human matting is a foundation task in image and video processing, where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe, Masked Guided Gradual Human Instance Matting, which predicts alpha mattes progressively for each human instances while maintaining the computational cost, precision, and consistency. Our method leverages modern architectures, including transformer attention and sparse convolution, to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario, our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks, the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios.
人类遮罩是图像和视频处理中的一个基础任务,其中从输入中提取人类前景像素。先前的 works 要么通过额外的指导来提高准确性,要么通过在帧之间改善单个实例的时序一致性。我们提出了一种新的框架 MaGGIe,掩码引导的逐步人类实例遮罩,在预测每个人类实例的 alpha 遮罩的同时保持计算成本、精度和一致性。我们的方法利用了现代架构,包括 transformer 注意力和稀疏卷积,同时输出所有实例遮罩,而不会导致内存和延迟的爆炸。 尽管在多实例场景中保持不变的推理成本,但我们的框架在拟合真实场景中表现出稳健和多功能的性能。随着高质量图像和视频遮罩基准的提高,我们引入了一种来自公开来源的多实例合成方法,以提高模型在现实场景中的泛化能力。
https://arxiv.org/abs/2404.16035
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Magnetohydrodynamics (MHD) plays a pivotal role in describing the dynamics of plasma and conductive fluids, essential for understanding phenomena such as the structure and evolution of stars and galaxies, and in nuclear fusion for plasma motion through ideal MHD equations. Solving these hyperbolic PDEs requires sophisticated numerical methods, presenting computational challenges due to complex structures and high costs. Recent advances introduce neural operators like the Fourier Neural Operator (FNO) as surrogate models for traditional numerical analyses. This study explores a modified Flux Fourier neural operator model to approximate the numerical flux of ideal MHD, offering a novel approach that outperforms existing neural operator models by enabling continuous inference, generalization outside sampled distributions, and faster computation compared to classical numerical schemes.
磁流体动力学(MHD)在描述等离子体和导电液体的动力学中起着关键作用,这对于理解恒星和星系结构以及核聚变中的 plasma运动至关重要。解决这些超偏微分方程需要先进的数值方法,但由于复杂结构和高的计算成本,呈现了计算挑战。最近,引入了类似于傅立叶神经操作器(FNO)的神经元操作模型作为传统数值分析的替代模型。本研究探讨了修改的流量傅立叶神经操作器模型,用于近似理想MHD的数值流量,提供了一种在连续推理、扩展到非采样分布外推和与经典数值方案相比更快计算的新颖方法。
https://arxiv.org/abs/2404.16015
Large Language Models (LLMs), despite their impressive performance on a wide range of tasks, require significant GPU memory and consume substantial computational resources. In addition to model weights, the memory occupied by KV cache increases linearly with sequence length, becoming a main bottleneck for inference. In this paper, we introduce a novel approach for optimizing the KV cache which significantly reduces its memory footprint. Through a comprehensive investigation, we find that on LLaMA2 series models, (i) the similarity between adjacent tokens' query vectors is remarkably high, and (ii) current query's attention calculation can rely solely on the attention information of a small portion of the preceding queries. Based on these observations, we propose CORM, a KV cache eviction policy that dynamically retains important key-value pairs for inference without finetuning the model. We validate that CORM reduces the inference memory usage of KV cache by up to 70% without noticeable performance degradation across six tasks in LongBench.
大语言模型(LLMs)虽然在各种任务上的表现令人印象深刻,但需要大量的GPU内存,并且消耗大量的计算资源。除了模型权重外,KV缓存所占的内存随序列长度的增加而线性增加,成为推理的主要瓶颈。在本文中,我们提出了一种新的优化KV缓存的策略,显著减少了其内存足迹。通过全面的调查,我们发现,在LLaMA2系列模型中,(i)相邻词查询向量之间的相似性非常高,并且(ii)当前查询的注意力计算仅依赖于前几个查询的注意力信息。基于这些观察结果,我们提出了CORM,一种用于保留用于推理的重要键值对的KV缓存淘汰策略,而无需对模型进行微调。我们验证,CORM在LongBench中的六个任务上,将KV缓存的推理内存使用量降低至最多70%,且没有显式的性能下降。
https://arxiv.org/abs/2404.15949