Large language models (LLMs) are rapidly emerging in Artificial Intelligence (AI) applications, especially in the fields of natural language processing and generative AI. Not limited to text generation applications, these models inherently possess the opportunity to leverage prompt engineering, where the inputs of such models can be appropriately structured to articulate a model's purpose explicitly. A prominent example of this is intent-based networking, an emerging approach for automating and maintaining network operations and management. This paper presents semantic routing to achieve enhanced performance in LLM-assisted intent-based management and orchestration of 5G core networks. This work establishes an end-to-end intent extraction framework and presents a diverse dataset of sample user intents accompanied by a thorough analysis of the effects of encoders and quantization on overall system performance. The results show that using a semantic router improves the accuracy and efficiency of the LLM deployment compared to stand-alone LLMs with prompting architectures.
大语言模型(LLMs)在人工智能(AI)应用中正在迅速崛起,尤其是在自然语言处理和生成式人工智能领域。这些模型不仅限于文本生成应用,还具有利用提示工程的机会,使模型的输入适当地结构化以明确表达其目的。一个显著的例子是基于意图的网络,这是一种自动化和维护网络操作和管理的新兴方法。本文介绍了语义路由以实现LLM辅助意图基于管理的增强性能和5G核心网络的编排。这项工作建立了一个端到端的意图提取框架,并提供了丰富的用户意图数据集以及编码器和量化对整体系统性能的影响的深入分析。结果表明,使用语义路由器可以提高LLM部署的准确性和效率,与单独使用提示结构的LLM相比。
https://arxiv.org/abs/2404.15869
We present a novel approach to detecting noun abstraction within a large language model (LLM). Starting from a psychologically motivated set of noun pairs in taxonomic relationships, we instantiate surface patterns indicating hypernymy and analyze the attention matrices produced by BERT. We compare the results to two sets of counterfactuals and show that we can detect hypernymy in the abstraction mechanism, which cannot solely be related to the distributional similarity of noun pairs. Our findings are a first step towards the explainability of conceptual abstraction in LLMs.
我们提出了一个在大型语言模型(LLM)中检测名词抽象的新方法。从心理上动机的一组名词对中开始,我们实例化表明超类和分析由BERT产生的注意矩阵。我们将结果与两组反事实进行比较,并表明我们可以在抽象机制中检测超类,而不仅仅是名词对之间的分布相似性。我们的研究结果是LLM中概念抽象解释的第一步。
https://arxiv.org/abs/2404.15848
Machines that can replicate human intelligence with type 2 reasoning capabilities should be able to reason at multiple levels of spatio-temporal abstractions and scales using internal world models. Devising formalisms to develop such internal world models, which accurately reflect the causal hierarchies inherent in the dynamics of the real world, is a critical research challenge in the domains of artificial intelligence and machine learning. This thesis identifies several limitations with the prevalent use of state space models (SSMs) as internal world models and propose two new probabilistic formalisms namely Hidden-Parameter SSMs and Multi-Time Scale SSMs to address these drawbacks. The structure of graphical models in both formalisms facilitates scalable exact probabilistic inference using belief propagation, as well as end-to-end learning via backpropagation through time. This approach permits the development of scalable, adaptive hierarchical world models capable of representing nonstationary dynamics across multiple temporal abstractions and scales. Moreover, these probabilistic formalisms integrate the concept of uncertainty in world states, thus improving the system's capacity to emulate the stochastic nature of the real world and quantify the confidence in its predictions. The thesis also discuss how these formalisms are in line with related neuroscience literature on Bayesian brain hypothesis and predicitive processing. Our experiments on various real and simulated robots demonstrate that our formalisms can match and in many cases exceed the performance of contemporary transformer variants in making long-range future predictions. We conclude the thesis by reflecting on the limitations of our current models and suggesting directions for future research.
具有类型2推理能力的机器应该能够使用内部世界模型在多个层次的时空抽象思维中进行推理。在人工智能和机器学习领域为开发这种内部世界模型,准确反映真实世界动态的因果层次结构,是一个关键的研究挑战。本文提出了一种新概率形式,即隐藏参数SSM和多时间尺度SSM,用于解决使用普遍的状态空间模型(SSMs)作为内部世界模型的限制。这两种形式图模型的结构有助于使用信念传播进行可扩展的完全概率推理以及通过反向传播进行端到端学习。这种方法允许开发可扩展、自适应的层次世界模型,能够表示多个时间抽象层次和非平稳动态。此外,这些概率形式还融入了世界状态的不确定性概念,从而提高了系统模拟真实世界非随机性的能力,并估计其预测的置信度。本文还讨论了这些形式与相关神经科学文献中贝叶斯大脑假设和预测处理的关系。我们对各种真实和模拟机器人的实验证明,我们的形式可以与当代Transformer变体相匹配,并在许多情况下超过其性能。我们结论,论文通过对当前模型的局限性的反思,提出了未来研究的方向。
https://arxiv.org/abs/2404.16078
We introduce a constructive method applicable to a large number of description logics (DLs) for establishing the concept-based Beth definability property (CBP) based on sequent systems. Using the highly expressive DL RIQ as a case study, we introduce novel sequent calculi for RIQ-ontologies and show how certain interpolants can be computed from sequent calculus proofs, which permit the extraction of explicit definitions of implicitly definable concepts. To the best of our knowledge, this is the first sequent-based approach to computing interpolants and definitions within the context of DLs, as well as the first proof that RIQ enjoys the CBP. Moreover, due to the modularity of our sequent systems, our results hold for any restriction of RIQ, and are applicable to other DLs by suitable modifications.
我们提出了一个适用于大量描述逻辑(DLs)的建立基于序列系统的概念定义属性(CBP)的方法。以高度表达性的DL RIQ为例,我们引入了新的序列计算公式RIQ-ontologies,并展示了如何从序列计算公式的证明中计算某些插值多项式,从而允许从隐式定义的概念中提取显式定义。据我们所知,这是在DLs计算插值多项式和定义的第一种基于序列的方法,以及RIQ在CBP方面第一个证明。此外,由于我们序列系统的模块性,我们的结果对于任何RIQ的限制都成立,并且可以通过适当的修改应用于其他DL。
https://arxiv.org/abs/2404.15840
Knowledge Graph Completion (KGC) has garnered massive research interest recently, and most existing methods are designed following a transductive setting where all entities are observed during training. Despite the great progress on the transductive KGC, these methods struggle to conduct reasoning on emerging KGs involving unseen entities. Thus, inductive KGC, which aims to deduce missing links among unseen entities, has become a new trend. Many existing studies transform inductive KGC as a graph classification problem by extracting enclosing subgraphs surrounding each candidate triple. Unfortunately, they still face certain challenges, such as the expensive time consumption caused by the repeat extraction of enclosing subgraphs, and the deficiency of entity-independent feature learning. To address these issues, we propose a global-local anchor representation (GLAR) learning method for inductive KGC. Unlike previous methods that utilize enclosing subgraphs, we extract a shared opening subgraph for all candidates and perform reasoning on it, enabling the model to perform reasoning more efficiently. Moreover, we design some transferable global and local anchors to learn rich entity-independent features for emerging entities. Finally, a global-local graph reasoning model is applied on the opening subgraph to rank all candidates. Extensive experiments show that our GLAR outperforms most existing state-of-the-art methods.
知识图谱完成(KGC)最近吸引了大量研究兴趣,而且大多数现有方法都是在训练过程中采用转换设置的,其中所有实体都在观察过程中。尽管在转换式KGC方面取得了很大进展,但这些方法在处理涉及未见实体的 emergence KG 时仍然存在挑战。因此,归纳式KGC,旨在从未见实体中推断缺失链接,已成为一个新的趋势。许多现有研究将归纳式KGC转换为一个图分类问题,通过提取围绕每个候选三元组的内包容子图来完成。然而,他们仍然面临着某些挑战,例如由于重复提取内包容子图而产生的高时间消耗,以及实体独立特征学习不足的问题。为了解决这些问题,我们提出了一个全局-局部锚表示(GLAR)学习方法来解决归纳式KGC。与之前的方法不同,我们为所有候选者提取共享的内包容子图,并在其上进行推理,使模型能够更有效地进行推理。此外,我们还设计了一些可转移的全局和局部锚来学习新兴实体的丰富实体独立特征。最后,在打开子图上应用全局-局部图推理模型对所有候选者进行排名。大量实验证明,我们的GLAR超越了大多数现有最先进的方法。
https://arxiv.org/abs/2404.15807
Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.
多模态搜索已经成为为用户提供自然且有效表达其搜索意图的方法变得越来越重要。图像提供了所需产品的精细细节,而文本允许轻松地包括搜索修改。然而,一些现有的多模态搜索系统不可靠,未能解决简单的查询。随着自然语言文本查询的大幅波动,可能包含模糊、隐含和无关信息,这个问题变得更加严重。解决这些问题可能需要具有增强的匹配能力、推理能力和上下文感知查询解析和重写能力的系统。本文介绍了一种新颖的多模态搜索模型,在Fashion200K数据集上实现了新的性能里程碑。此外,我们提出了一种新颖的搜索接口,整合了大型语言模型(LLMs),以促进自然语言交互。此接口在与用户进行交互并考虑之前搜索的同时将查询路由到搜索系统。与我们的多模态搜索模型结合,它预示着能提供类似人类交互的购物助手的新时代的来临。
https://arxiv.org/abs/2404.15790
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785
Autonomous vehicles (AVs) heavily rely on LiDAR perception for environment understanding and navigation. LiDAR intensity provides valuable information about the reflected laser signals and plays a crucial role in enhancing the perception capabilities of AVs. However, accurately simulating LiDAR intensity remains a challenge due to the unavailability of material properties of the objects in the environment, and complex interactions between the laser beam and the environment. The proposed method aims to improve the accuracy of intensity simulation by incorporating physics-based modalities within the deep learning framework. One of the key entities that captures the interaction between the laser beam and the objects is the angle of incidence. In this work we demonstrate that the addition of the LiDAR incidence angle as a separate input to the deep neural networks significantly enhances the results. We present a comparative study between two prominent deep learning architectures: U-NET a Convolutional Neural Network (CNN), and Pix2Pix a Generative Adversarial Network (GAN). We implemented these two architectures for the intensity prediction task and used SemanticKITTI and VoxelScape datasets for experiments. The comparative analysis reveals that both architectures benefit from the incidence angle as an additional input. Moreover, the Pix2Pix architecture outperforms U-NET, especially when the incidence angle is incorporated.
自动驾驶车辆(AVs)对环境理解和导航重度依赖激光雷达感知。激光雷达强度提供了关于反射激光信号的有价值的信息,并在增强AV的感知能力中发挥了关键作用。然而,准确模拟激光雷达强度仍然是一个挑战,由于环境中物体的材料性质不可用,以及激光束与环境的复杂相互作用。所提出的方法旨在通过在深度学习框架中引入基于物理的模态来提高强度模拟的准确性。一个捕捉激光束与物体之间互动的关键实体是入射角。在本文中,我们证明了将激光雷达入射角作为额外的输入到深度神经网络可以显著增强结果。我们比较了两个著名的深度学习架构:U-NET和Pix2Pix。我们将这两个架构用于强度预测任务,并使用SemanticKITTI和VoxelScape数据集进行实验。比较分析揭示了,这两个架构都从入射角作为额外的输入受益。此外,Pix2Pix架构在纳入入射角时优于U-NET。
https://arxiv.org/abs/2404.15774
The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1) data-level bias, characterized by uneven data removal, and (2) algorithm-level bias, which leads to the contamination of the remaining dataset, thereby degrading model accuracy. In this work, we analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels. Typically, we introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset. Besides, we guide the forgetting procedure by leveraging counterfactual examples, as they maintain semantic data consistency without hurting performance on the remaining dataset. Experimental results demonstrate that our method outperforms existing machine unlearning baselines on evaluation metrics.
遗忘权(RTBF)旨在通过实施机器学习技术保护个人免受其历史行动长期持续影响。这些技术通过删除先前获得的未经授权的知识,而无需进行复杂的模型重构。然而,它们通常忽视了一个关键问题:遗忘过程的偏见。这种偏见源于两个主要来源:(1)数据级偏见,表现为数据去除不均衡,(2)算法级偏见,导致剩余数据集被污染,从而降低模型准确性。在这项工作中,我们分析了导致遗忘过程的因素,并消除了数据和算法层面的偏见。通常,我们引入了干预为基础的方法,其中知识被从有偏的数据集中消除。此外,我们通过利用反事实例子来指导遗忘过程,因为它们在不影响剩余数据集性能的情况下保持语义数据一致性。实验结果表明,我们的方法在评估指标上优于现有的机器学习去遗忘基准。
https://arxiv.org/abs/2404.15760
Graph Neural Network (GNN)-based fake news detectors apply various methods to construct graphs, aiming to learn distinctive news embeddings for classification. Since the construction details are unknown for attackers in a black-box scenario, it is unrealistic to conduct the classical adversarial attacks that require a specific adjacency matrix. In this paper, we propose the first general black-box adversarial attack framework, i.e., General Attack via Fake Social Interaction (GAFSI), against detectors based on different graph structures. Specifically, as sharing is an important social interaction for GNN-based fake news detectors to construct the graph, we simulate sharing behaviors to fool the detectors. Firstly, we propose a fraudster selection module to select engaged users leveraging local and global information. In addition, a post injection module guides the selected users to create shared relations by sending posts. The sharing records will be added to the social context, leading to a general attack against different detectors. Experimental results on empirical datasets demonstrate the effectiveness of GAFSI.
基于图神经网络(GNN)的假新闻检测器应用各种方法来构建图,旨在学习分类新闻的显著特征。由于攻击者在黑盒场景中的构建细节是未知的,因此无法进行需要特定邻接矩阵的经典对抗攻击。在本文中,我们提出了第一个针对不同图结构的检测器的一般黑盒攻击框架,即通过虚假社交交互(GAFSI)进行攻击。具体来说,共享对于基于图神经网络的假新闻检测器构建图形至关重要。为了欺骗检测器,我们提出了一个欺诈者选择模块,利用本地和全局信息选择积极参与的用户。此外,一个后注入模块通过发送帖子指导选择的用户创建共享关系。共享记录将添加到社交上下文,导致对不同检测器的通用攻击。在经验数据集上的实验结果表明,GAFSI的有效性得到了充分验证。
https://arxiv.org/abs/2404.15744
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.
选择性注意帮助我们将注意力集中在与任务相关的感官输入中的不断涌现的方面。这种感知约束使我们能够在分心的情况下稳健地推广,并研究可感知概念的新组合。变压器在架构中采用与注意力类似的观念,但是使用变换器骨干的表示学习模型(如CLIP和DINO)通常无法展示稳健性和可组合性。我们突出了一个缺失的架构先验:与人类感知不同,变压器编码不分别关注单个概念。为了应对这一问题,我们提出了SPARO,一种输出机制,将编码分为由单个注意头生成的单独关注的位置。使用SPARO与CLIP相结合,为视觉和文本模式提供归纳偏见,即视觉和文本模式是具有相同相应概念的共享组合世界的不同视图。使用SPARO,我们在CLIP(ImageNet上的改进超过+14%,SugarCrepe上的改进超过+4%)和DINO(ImageNet上的改进超过+3% each)的下游识别、鲁棒性、检索和可组合性基准测试中取得了改善,并且通过与最近邻和线性探针结合使用SPARO(改进超过+4%,从SugarCrepe的+4%到ImageNet的+9%)证明了强大的干预和选择单个SPARO概念的能力,进一步提高了下游任务的性能(从SugarCrepe的+4%到ImageNet的+9%的改进)。我们还通过消融实验和概念可视化展示了SPARO表示结构的稳健性。最后,我们提供了通过消融实验和可视化学习到的概念的见解。
https://arxiv.org/abs/2404.15721
Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: this https URL.
基于骨架的动作识别因为利用了简洁且鲁棒的骨架表示而获得了相当大的关注。然而,当前的方法通常倾向于使用单一的骨架来建模骨架模式,这可能会受到网络骨架固有缺陷的限制。为了解决这个问题,并充分利用各种网络架构的互补特点,我们提出了一种新颖的混合双分支网络(HDBN),用于鲁棒骨架 based 动作识别,该网络从图卷积网络的擅长处理图形数据和Transformer的强大的建模能力中受益。具体而言,我们提出的 HDBN 分为两个主分支:MixGCN 和 MixFormer。这两个分支分别使用 GCN 和 Transformer 建模 2D 和 3D 骨架模式。我们提出的 HDBN 在 2024 ICME Grand Challenge 多模态视频推理和分析比赛中成为了一个一流解决方案,在两个 UAV-Human 数据集的基准上实现了准确度分别为 47.95% 和 75.36%,超过了大多数现有方法。我们的代码将在这个链接上公开发布:https://this URL。
https://arxiv.org/abs/2404.15719
Recently, Segment Anything Model (SAM) shows exceptional performance in generating high-quality object masks and achieving zero-shot image segmentation. However, as a versatile vision model, SAM is primarily trained with large-scale natural light images. In underwater scenes, it exhibits substantial performance degradation due to the light scattering and absorption. Meanwhile, the simplicity of the SAM's decoder might lead to the loss of fine-grained object details. To address the above issues, we propose a novel feature learning framework named MAS-SAM for marine animal segmentation, which involves integrating effective adapters into the SAM's encoder and constructing a pyramidal decoder. More specifically, we first build a new SAM's encoder with effective adapters for underwater scenes. Then, we introduce a Hypermap Extraction Module (HEM) to generate multi-scale features for a comprehensive guidance. Finally, we propose a Progressive Prediction Decoder (PPD) to aggregate the multi-scale features and predict the final segmentation results. When grafting with the Fusion Attention Module (FAM), our method enables to extract richer marine information from global contextual cues to fine-grained local details. Extensive experiments on four public MAS datasets demonstrate that our MAS-SAM can obtain better results than other typical segmentation methods. The source code is available at this https URL.
最近,Segment Anything Model(SAM)在生成高质量的对象掩码和实现零 shot 图像分割方面表现出优异性能。然而,作为一种多功能的视觉模型,SAM主要在大型自然光图像上进行训练。在水下场景中,由于光散射和吸收,其性能显著下降。同时,SAM的解码器的简单性可能导致对细粒度物体细节的丢失。为解决上述问题,我们提出了一个名为MAS-SAM的新特征学习框架,用于海洋动物分割,其中将有效的适配器集成到SAM的编码器中并构建了一个金字塔解码器。具体来说,我们首先为水下场景构建了一个新的SAM编码器,其中包括有效的适配器。然后,我们引入了一个Hypermap Extraction Module(HEM)以生成全面的引导多尺度特征。最后,我们提出了一个Progressive Prediction Decoder(PPD)来聚合多尺度特征并预测最终分割结果。当与Fusion Attention Module(FAM)结合时,我们的方法能够从全局上下文线索中提取更丰富的海洋信息,从而从细粒度局部细节进行更精确的分割。在四个公开的MAS数据集上进行的大量实验证明,我们的MAS-SAM能够获得比其他典型分割方法更好的结果。源代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.15700
Cooperative Adaptive Cruise Control (CACC) represents a quintessential control strategy for orchestrating vehicular platoon movement within Connected and Automated Vehicle (CAV) systems, significantly enhancing traffic efficiency and reducing energy consumption. In recent years, the data-driven methods, such as reinforcement learning (RL), have been employed to address this task due to their significant advantages in terms of efficiency and flexibility. However, the delay issue, which often arises in real-world CACC systems, is rarely taken into account by current RL-based approaches. To tackle this problem, we propose a Delay-Aware Multi-Agent Reinforcement Learning (DAMARL) framework aimed at achieving safe and stable control for CACC. We model the entire decision-making process using a Multi-Agent Delay-Aware Markov Decision Process (MADA-MDP) and develop a centralized training with decentralized execution (CTDE) MARL framework for distributed control of CACC platoons. An attention mechanism-integrated policy network is introduced to enhance the performance of CAV communication and decision-making. Additionally, a velocity optimization model-based action filter is incorporated to further ensure the stability of the platoon. Experimental results across various delay conditions and platoon sizes demonstrate that our approach consistently outperforms baseline methods in terms of platoon safety, stability and overall performance.
合作自适应巡航控制(CACC)代表了一种在连接和自动驾驶车辆(CAV)系统中协调车辆编队运动的典型控制策略,显著提高了交通效率和降低了能源消耗。近年来,数据驱动的方法,如强化学习(RL),已经被采用来解决这个任务,因为它们在效率和灵活性方面具有显著优势。然而,当前基于RL的方法很少考虑到实世界CACC系统中经常出现的延迟问题。为了解决这个问题,我们提出了一个针对延迟敏感的多代理器强化学习(DAMARL)框架,旨在实现CACC的安全和稳定控制。我们使用多代理器延迟感知马尔可夫决策过程(MADA-MDP)来建模整个决策过程,并开发了一种集中训练和分布式执行(CTDE)的MARL框架,用于分布式控制CACC编队。引入了注意机制的策略网络,以提高CAV通信和决策的性能。此外,还引入了基于速度优化模型的动作滤波器,进一步确保编队的稳定性。在不同的延迟条件和编队大小等实验条件下,我们发现,我们的方法在编队安全、稳定和整体性能方面 consistently超过了基线方法。
https://arxiv.org/abs/2404.15696
Vulnerability detection is crucial for ensuring the security and reliability of software systems. Recently, Graph Neural Networks (GNNs) have emerged as a prominent code embedding approach for vulnerability detection, owing to their ability to capture the underlying semantic structure of source code. However, GNNs face significant challenges in explainability due to their inherently black-box nature. To this end, several factual reasoning-based explainers have been proposed. These explainers provide explanations for the predictions made by GNNs by analyzing the key features that contribute to the outcomes. We argue that these factual reasoning-based explanations cannot answer critical what-if questions: What would happen to the GNN's decision if we were to alter the code graph into alternative structures? Inspired by advancements of counterfactual reasoning in artificial intelligence, we propose CFExplainer, a novel counterfactual explainer for GNN-based vulnerability detection. Unlike factual reasoning-based explainers, CFExplainer seeks the minimal perturbation to the input code graph that leads to a change in the prediction, thereby addressing the what-if questions for vulnerability detection. We term this perturbation a counterfactual explanation, which can pinpoint the root causes of the detected vulnerability and furnish valuable insights for developers to undertake appropriate actions for fixing the vulnerability. Extensive experiments on four GNN-based vulnerability detection models demonstrate the effectiveness of CFExplainer over existing state-of-the-art factual reasoning-based explainers.
漏洞检测对于确保软件系统的安全可靠至关重要。近年来,图神经网络(GNNs)作为一种显著的代码嵌入方法,成为检测漏洞的突出方法,因为它们具有捕捉源代码潜在语义结构的能力。然而,由于GNNs固有的黑盒性质,它们在可解释性方面面临着重大挑战。为此,已经提出了几种基于事实推理的解释器。这些解释器通过分析对结果产生重要影响的特征来解释GNNs的预测。我们认为,这些基于事实推理的解释器无法回答关键的假设性问题:如果我们改变代码图,GNN的决策会怎样?受到人工智能中反事实推理的进展启发,我们提出了CFExplainer,一种基于GNN的漏洞检测的新反事实解释器。与基于事实推理的解释器不同,CFExplainer寻求对输入代码图的最小扰动,从而解决检测问题中的假设性问题。我们将这种扰动称为反事实解释,它可以指出检测到的漏洞的根本原因,并为开发人员提供有关采取相应措施修复漏洞的有价值的见解。在四个基于GNN的漏洞检测模型上进行的大量实验证明,CFExplainer比现有的基于事实推理的解释器更有效。
https://arxiv.org/abs/2404.15687
Affordances, a concept rooted in ecological psychology and pioneered by James J. Gibson, have emerged as a fundamental framework for understanding the dynamic relationship between individuals and their environments. Expanding beyond traditional perceptual and cognitive paradigms, affordances represent the inherent effect and action possibilities that objects offer to the agents within a given context. As a theoretical lens, affordances bridge the gap between effect and action, providing a nuanced understanding of the connections between agents' actions on entities and the effect of these actions. In this study, we propose a model that unifies object, action and effect into a single latent representation in a common latent space that is shared between all affordances that we call the affordance space. Using this affordance space, our system is able to generate effect trajectories when action and object are given and is able to generate action trajectories when effect trajectories and objects are given. In the experiments, we showed that our model does not learn the behavior of each object but it learns the affordance relations shared by the objects that we call equivalences. In addition to simulated experiments, we showed that our model can be used for direct imitation in real world cases. We also propose affordances as a base for Cross Embodiment transfer to link the actions of different robots. Finally, we introduce selective loss as a solution that allows valid outputs to be generated for indeterministic model inputs.
Affordances,这个概念源于生态心理学,是由詹姆斯·J·吉布森(James J. Gibson)先驱性地提出的,已成为理解个体与其环境之间动态关系的坚实基础。它超越了传统的感知和认知范式,代表物体在特定环境中提供的潜在效果和行动可能性。作为一个理论透镜,affordances在效果和行为之间搭建了桥梁,提供了实体中代理商行动对实体和这些行动的影响的细微理解。在这项研究中,我们提出了一个将物体、行为和效果统一为单个潜在表示的模型,称为affordance空间。利用这个affordance空间,我们的系统能够在给定动作和物体时生成效果轨迹,能够在给定效果轨迹和物体时生成行为轨迹。在实验中,我们证明了我们的模型不仅学习了每个物体的行为,还学习了我们称之为等价物的物体之间的affordance关系。除了模拟实验之外,我们还证明了我们的模型可以在现实世界 case 直接仿写。最后,我们提出了affordance作为跨身体转移的基础,将不同机器人的行动联系起来。此外,我们还引入了选择性损失作为解决方案,允许为不确定模型输入生成有效的输出。
https://arxiv.org/abs/2404.15648
Hazy images degrade visual quality, and dehazing is a crucial prerequisite for subsequent processing tasks. Most current dehazing methods rely on neural networks and face challenges such as high computational parameter pressure and weak generalization capabilities. This paper introduces PriorNet--a novel, lightweight, and highly applicable dehazing network designed to significantly improve the clarity and visual quality of hazy images while avoiding excessive detail extraction issues. The core of PriorNet is the original Multi-Dimensional Interactive Attention (MIA) mechanism, which effectively captures a wide range of haze characteristics, substantially reducing the computational load and generalization difficulties associated with complex systems. By utilizing a uniform convolutional kernel size and incorporating skip connections, we have streamlined the feature extraction process. Simplifying the number of layers and architecture not only enhances dehazing efficiency but also facilitates easier deployment on edge devices. Extensive testing across multiple datasets has demonstrated PriorNet's exceptional performance in dehazing and clarity restoration, maintaining image detail and color fidelity in single-image dehazing tasks. Notably, with a model size of just 18Kb, PriorNet showcases superior dehazing generalization capabilities compared to other methods. Our research makes a significant contribution to advancing image dehazing technology, providing new perspectives and tools for the field and related domains, particularly emphasizing the importance of improving universality and deployability.
模糊图像降低视觉质量,而去雾是后续处理任务的關鍵先決條件。目前的大多数去雾方法依賴於神經網絡,並面臨高計算參數壓力和弱泛化能力的挑戰。本文介紹了 PriorNet--一個新輕量級且高度適用的去霧網絡,旨在顯著改善模糊圖像的清晰度和視覺質量,同時避免過度細節提取問題。PriorNet 的核心是原始的多維交互注意力(MIA)機制,有效捕捉廣泛的灰霾特性,大幅降低複雜系統的計算負荷和泛化困難。通過使用相同的卷積内核大小並包含跳躍連接,我們簡化了特徵提取過程。通過在多個數據集上的測試,PriorNet 在去霧和清晰度恢復方面的表現非常出色,保持單一圖像去霧任務中的圖像細節和色彩保真度。值得注意的是,PriorNet 的模型大小僅為 18Kb,其在去霧擴展能力上比其他方法优越。我們的研究在推動圖像去霧技術發展方面做出了重要的貢獻,為該領域和相關領域提供了新的視角和工具,尤其強調了提高普遍性和部署能力的必要性。
https://arxiv.org/abs/2404.15638
Reinforcement learning (RL) with continuous state and action spaces remains one of the most challenging problems within the field. Most current learning methods focus on integral identities such as value functions to derive an optimal strategy for the learning agent. In this paper, we instead study the dual form of the original RL formulation to propose the first differential RL framework that can handle settings with limited training samples and short-length episodes. Our approach introduces Differential Policy Optimization (DPO), a pointwise and stage-wise iteration method that optimizes policies encoded by local-movement operators. We prove a pointwise convergence estimate for DPO and provide a regret bound comparable with current theoretical works. Such pointwise estimate ensures that the learned policy matches the optimal path uniformly across different steps. We then apply DPO to a class of practical RL problems which search for optimal configurations with Lagrangian rewards. DPO is easy to implement, scalable, and shows competitive results on benchmarking experiments against several popular RL methods.
强化学习(RL)在具有连续状态和动作空间的情况下仍然是最具挑战性的问题之一。大多数现有学习方法都关注于全局等式,如价值函数,以得出学习代理的最优策略。在本文中,我们研究了原始RL公式的对偶形式,以提出第一个可以处理训练样本有限且历时较短的场景的DRL框架。我们的方法引入了差分策略优化(DPO),这是一种局部运动操作符编码的点间迭代方法。我们证明了DPO的点间收敛估计,并提供了一个与当前理论工作相当的后悔边界。这样的点间估计确保了学习到的策略在不同的步骤上与最优路径保持一致。然后我们将DPO应用于一类使用拉格朗兰奖励寻找最优配置的实践RL问题中。DPO易于实现,具有可扩展性,在基准实验中与几种流行RL方法竞争,表现优异。
https://arxiv.org/abs/2404.15617
In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable, often fall short when faced with the complexity and dynamism of modern business data. Conversely, Artificial Intelligence (AI) models, particularly Large Language Models (LLMs), offer significant potential in pattern recognition and predictive analytics but can lack the precision necessary for specific business applications. This paper explores the efficacy of hybrid approaches that integrate the robustness of rule-based systems with the adaptive power of LLMs in generating actionable business insights.
在商务数据分析领域,从庞大的且多样的数据中提取可操作的见解对于明智的决策和保持竞争优势至关重要。虽然传统基于规则的系统是可靠的,但面对现代商业数据的复杂性和动态性时,常常力不从心。相反,人工智能(AI)模型,特别是大型语言模型(LLMs),在模式识别和预测分析方面具有显著潜力,但也可能缺乏特定业务应用程序所需的精度。本文探讨了将基于规则的系统的稳健性与LLM的适应性相结合产生可操作商业见解的混合方法的有效性。
https://arxiv.org/abs/2404.15604
Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction. ImplicitAVE, sourced from the MAVE dataset, is carefully curated and expanded to include implicit AVE and multimodality, resulting in a refined dataset of 68k training and 1.6k testing data across five domains. We also explore the application of multimodal large language models (MLLMs) to implicit AVE, establishing a comprehensive benchmark for MLLMs on the ImplicitAVE dataset. Six recent MLLMs with eleven variants are evaluated across diverse settings, revealing that implicit value extraction remains a challenging task for MLLMs. The contributions of this work include the development and release of ImplicitAVE, and the exploration and benchmarking of various MLLMs for implicit AVE, providing valuable insights and potential future research directions. Dataset and code are available at this https URL
目前的数据集(AVE)主要关注显式属性值,而忽略了隐含属性值,缺乏产品图像,通常不公开提供,并且不同领域的隐含属性值缺乏深入的人检查。为了应对这些限制,我们提出了ImplicitAVE,第一个公开可用的多模态数据集,用于隐含属性值提取。ImplicitAVE来自MAVE数据集,经过精心挑选和扩展,包括隐含AVE和多模态,从而形成了一个五个领域的精炼数据集,包括68k个训练数据和16k个测试数据。我们还探讨了将多模态大型语言模型(MLLMs)应用于隐含AVE的应用,为MLLMs在ImplicitAVE数据集上建立了全面的基准。六种最近的多模态大型语言模型(MLLMs)在各种设置中进行了评估,揭示了MLLMs在隐含AVE方面的挑战仍然存在。本工作的贡献包括ImplicitAVE的开发和发布,以及探索和评估各种MLLMs用于隐含AVE,为未来的研究提供了宝贵的见解和可能的研究方向。数据集和代码都可以在上述链接中找到。
https://arxiv.org/abs/2404.15592