Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.
离线多智能体强化学习(MARL)是一个令人兴奋的研究方向,它使用静态数据集来寻找最优控制策略 for 多智能体系统。尽管该领域本质上是数据驱动的,但迄今为止的努力都忽略了数据在追求最优结果的过程中。我们首先通过调查文献来证实这一说法,并表明大多数工作在生成自己的数据集时没有采用一致的方法论,并且关于这些数据集的特征提供得很稀疏。然后我们展示了忽视数据性质的严重问题,通过引人注目的例子说明,算法的性能与使用的数据集密切相关,需要在该领域建立共同的基础进行实验。为了回应这一问题,我们迈出了改善离线MARL数据使用和数据意识的重要一步,包括以下三个关键贡献:(1)生成新数据集的清晰指南;(2)使用一致存储格式并在公开可用的存储库中托管的 80 多个现有数据集的标准化;(3)一套分析工具,使我们能够更好地理解这些数据集,并有助于进一步发展。
https://arxiv.org/abs/2409.12001
End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.
端到端模型在自动驾驶感知中正逐渐成为主流。然而,无法详细分解其内部机制导致开发效果减弱,并阻碍了信任的建立。在这些问题上领先一步,我们提出了一个名为独立功能模块评估的模型(BEV-IFME),这是一种将鸟瞰感知模型的模块特征图在统一语义表示空间中与真实情况下的地面真相对比来量化其相似性的新框架,从而评估个人功能模块的训练成熟度。框架的核心在于特征图编码和表示的同步过程,通过我们提出的两阶段对齐自编码器确保保留突出信息并保持特征结构的一致性。评估功能模块训练成熟度的指标,相似度分数,与BEV指标之间展现出稳健的正相关关系,平均相关系数为0.9387,证明了该框架在评估目的上的可靠性。
https://arxiv.org/abs/2409.11969
Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive summarization with the help of salient information identified by the extractive model. Previous works that adopt this paradigm train the extractor and abstractor separately and introduce extra parameters to highlight the extracted salients to the abstractor, which results in error accumulation and additional training costs. In this paper, we first introduce a parameter-free highlight method into the encoder-decoder framework: replacing the encoder attention mask with a saliency mask in the cross-attention module to force the decoder to focus only on salient parts of the input. A preliminary analysis compares different highlight methods, demonstrating the effectiveness of our saliency mask. We further propose the novel extract-and-abstract paradigm, ExtAbs, which jointly and seamlessly performs Extractive and Abstractive summarization tasks within single encoder-decoder model to reduce error accumulation. In ExtAbs, the vanilla encoder is augmented to extract salients, and the vanilla decoder is modified with the proposed saliency mask to generate summaries. Built upon BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve superior performance than baselines on the extractive task and performs comparable, or even better than the vanilla models on the abstractive task.
提取然后抽象是一种自然流畅的范式,通过在提取模型的帮助下识别显眼的上下文信息,进行抽象性摘要。之前的工作采用这种范式分别训练提取器和抽象器,并引入了额外的参数来突出提取到的显眼信息,导致误差累积和额外的训练成本。在本文中,我们首先将参数无显着方法引入到编码器-解码器框架中:用注意力机制在交叉注意力模块中替换编码器注意力掩码,迫使解码器只关注输入中的显眼部分。初步分析比较了不同的显着方法,证明了我们的显着掩码的有效性。我们进一步提出了名为ExtAbs的新范式,它与单编码器-解码器模型一起在单个框架内协同有效地执行提取性和摘要性摘要。在ExtAbs中,标准的编码器被增强以提取显眼信息,而标准的解码器则用所提出的显眼掩码进行修改,以生成摘要。基于BART和PEGASUS,在三个数据集上的实验表明,ExtAbs在提取任务上优于基线,并且在摘要任务上与基线模型相当,或者甚至比基线模型更好。
https://arxiv.org/abs/2409.11827
The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.
事件相机在其低延迟和高动态范围的成功表明,它已经在广泛的领域取得了显著的成功。然而,社区面临着数据不足和多样性有限等挑战,通常导致过拟合和不足的特征学习。值得注意的是,事件社区中数据增强技术的探索仍然很少。本文旨在通过引入一个名为EventAug的系统化增强方案来填补这一空白,以丰富空间-时间多样性。 首先,我们提出多尺度时间整合(MSTI)来丰富对象的动态速度,然后引入空间显著事件掩码(SSEM)和时间显著事件掩码(TSEM)来丰富对象的变体。我们的EventAug可以促进模型学习更丰富的运动模式、对象变体和局部空间-时间关系,从而提高模型对各种运动速度、遮挡和操作干扰的鲁棒性。实验结果表明,我们的增强方法在不同的任务和骨干网络(例如,在DVS128手势识别任务上,准确率增加了4.87%)上都取得了显著的改进。我们的代码将公开供这个社区使用。
https://arxiv.org/abs/2409.11813
In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU$^2$Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.
在本文中,我们在现实世界的环境中通过将紧凑参考图像分割模型集成到机器人的感知模块中,进行机器人操作活动。首先,我们提出了CLIPU$^2$Net,一种专为从语言表达中进行细粒度边界和结构分割的轻量级参考图像分割模型。然后,我们将该模型部署在一个手眼协调视觉伺服系统中,实现机器人现实世界的控制。我们系统的关键在于将显着视觉信息表示为几何约束,将机器人的视觉感知与可操作命令相联系。对46个真实世界机器人操作任务的实验结果表明,我们的方法在依赖于劳动密集型特征注释的传统视觉伺服方法中表现优异,具有紧凑的解码大小6.6 MB,并在各种上下文中支持机器人控制。
https://arxiv.org/abs/2409.11518
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at this https URL.
无人机视频中的多个目标跟踪(MOT)对于计算机视觉的各种应用非常重要。当前的MOT跟踪器依赖于准确的物体检测结果和精确的目标识别(ReID)匹配。这些方法专注于优化目标的空间属性,而忽略了建模物体关系的时间线索,尤其是在具有挑战性的跟踪条件下,如物体变形和模糊等。为解决上述问题,我们提出了一个新颖的时空凝聚多目标跟踪框架(STCMOT),它利用历史嵌入特征来建模ReID和检测特征的序列顺序。具体来说,我们引入了一个时间嵌入增强模块,以增强基于相邻帧合作的个体嵌入的区分度。然后,通过一个时间检测平滑模块将轨迹嵌入传播,以挖掘时间域中的显著目标位置。在VisDrone2019和UAVDT数据集上进行的大量实验证明,我们的STCMOT在MOTA和IDF1指标上达到了最先进的水平。源代码已发布在https://这个链接上。
https://arxiv.org/abs/2409.11234
The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to $9$ different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.
近年来人工智能(AI)的进步,特别是发布了几种只有查询访问的大型模型,使得在不考虑梯度的后验梯度的情况下对深度模型的可解释性有了充分阐述。在本文中,我们提出了一个名为蒸馏辅助可解释性(DAX)的框架,旨在在模型无关的后验梯度的情况下生成一个可解释性。DAX方法将解释问题置于可学习环境中,使用掩码生成网络和差分网络。掩码生成网络学习如何生成查找输入中显著区域的乘积掩码,而学生差分网络旨在逼近黑盒模型的局部行为。我们使用基于局部扰动输入样本的DAX框架对两个网络进行联合优化,并使用输入-输出访问黑盒模型的目标。我们广泛评估DAX在不同的模式(图像和音频)上的表现,在分类设置中,使用一系列评估指标(交集与并集、删除基础和基于主观人类评估的度量标准),并将其与9种不同的方法进行了比较。在这些评估中,DAX在所有模式和评估指标上显著优于现有方法。
https://arxiv.org/abs/2409.11123
The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.
目前最先进的图像超分辨率(ISR)方法使用大量图像数据的大型数据集来应用深度学习(DL)技术。将现有的ISR工作扩展到现实世界的实例的主要限制是它们的计算和时间复杂性。在本文中,与现有方法相反,我们提出了一种新颖且计算效率高的ISR算法,该算法不依赖于图像数据集,可以学习ISR任务。所提出的算法将ISR任务从生成超分辨率(SR)图像重新定义为计算扩展降解空间中纹理的逆距离。我们引入了深度身份学习,利用降解和逆降解模型的身份关系。与自监督方法(如ZSSR)相比,所提出的方法既不依赖于ISR数据集,也不依赖于单个低分辨率(LR)图像。因此,我们将我们的模型称为使用深度身份学习进行零散ISR的模型(NSSR-DIL)。所提出的NSSR-DIL模型需要更少的计算资源,至少比10个数量级少,并在基准ISR数据集上表现出竞争力的性能。我们提案的另一个显著特点是,NSSR-DIL框架避开了对模型的重新训练,并且对不同的缩放因子X2、X3和X4保持不变。这使得我们的高效率ISR模型在现实世界中具有更强的适用性。
https://arxiv.org/abs/2409.12165
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to adaptively identify key frames and prune less informative tokens, effectively mitigating hallucinations and increasing inference throughput without compromising performance. We conduct comprehensive experiments on the DRAMA and LingoQA benchmarks, demonstrating the effectiveness of VTS in achieving up to a 33\% improvement in inference throughput and a 28\% reduction in memory usage compared to the baseline without compromising performance.
多模态大型语言模型(MLLMs)通过强大的推理能力在自动驾驶系统中显著提高了场景理解。然而,由于其庞大的参数大小和计算需求,这些模型的部署面临着显著的挑战,往往超过了车载计算的约束。一个主要局限源于需要大量视觉令牌来捕捉细粒度和长上下文视觉信息,导致延迟和内存消耗增加。为了应对这个问题,我们提出了视频令牌稀疏(VTS)方法,一种利用连续视频帧固有冗余性来显著减少总视觉令牌数量的方法,同时保留最显着的信息。VTS采用了一个轻量级的基于CNN的提议模型,自适应地识别关键帧并剪裁不相关信息令牌,有效减轻幻觉并提高推理吞吐量,同时不牺牲性能。我们在DRAMA和LingoQA基准上进行了全面的实验,证明了VTS在实现推理吞吐量提高33%和内存使用降低28%的同时,不牺牲性能。
https://arxiv.org/abs/2409.11182
This paper investigates gender bias in Large Language Model (LLM)-generated teacher evaluations in higher education setting, focusing on evaluations produced by GPT-4 across six academic subjects. By applying a comprehensive analytical framework that includes Odds Ratio (OR) analysis, Word Embedding Association Test (WEAT), sentiment analysis, and contextual analysis, this paper identified patterns of gender-associated language reflecting societal stereotypes. Specifically, words related to approachability and support were used more frequently for female instructors, while words related to entertainment were predominantly used for male instructors, aligning with the concepts of communal and agentic behaviors. The study also found moderate to strong associations between male salient adjectives and male names, though career and family words did not distinctly capture gender biases. These findings align with prior research on societal norms and stereotypes, reinforcing the notion that LLM-generated text reflects existing biases.
本文研究了在高等教育环境中,大型语言模型(LLM)生成的教师评价中的性别偏见,重点关注GPT-4在六个学科领域生成的评估。通过应用包括OR分析、词嵌入关联测试(WEAT)、情感分析和情境分析在内的全面分析框架,本文发现了反映社会刻板印象的性别相关语言模式。具体来说,与亲和力和支持相关的词汇在女性教师中更频繁使用,而与娱乐相关的词汇则主要在男性教师中使用,符合共同和代理行为的定义。此外,研究发现男性显着形容词与男性名字之间存在中等至强烈的关联,尽管职业和家庭词汇并未明显捕捉到性别偏见。这些发现与有关社会规范和刻板印象的研究相一致,巩固了LLM生成的文本反映现有偏见的观念。
https://arxiv.org/abs/2409.09652
In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.
在五角大楼遭受伪造爆炸图像的冲击之后,分辨真实图像和伪图像从未比现在更关键。我们的研究引入了一种新颖的多模态方法来检测在诸如扩散模型等新型方法盛行的环境中生成的人工图像。我们的方法 UGAD 涵盖了三个关键检测步骤:首先,我们将 RGB 图像转换为 YCbCr 通道并应用积分径操作来强调显著的径向特征。其次,使用预训练的深度学习网络进行空间平移,利用其提取最佳特征。最后,通过密集层使用 softmax 对数据进行处理进行分类。我们的方法显著增强了不同真实图像和伪图像之间的鉴别能力,据研究表明,与现有最先进的方法相比,准确率提高了 12.64%,AUC 提高了 28.43%。
https://arxiv.org/abs/2409.07913
Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $\sim 2\%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.
手术场景传达了手术质量的关键信息。像素级定位工具和解剖结构是显微镜或内窥镜手术视图的深入手术分析的第一步。这通常通过完全监督的方法来完成,这些方法是注释贪婪的,在某些情况下,需要医学专业知识。考虑到通过标准化手术工作流程获得的手术视频的丰富性,我们提出了一个注释效率的框架,用于对手术场景进行语义分割。我们利用基于图像的自监督物体发现来识别手术视频中最显眼的工具和解剖结构。这些建议在最小监督的精细化调整步骤中进一步细化。仅使用36个注释标签的无忧设置表明,与完全监督分割模型相当,具有类似的定位性能。此外,将手术阶段标签作为弱标签可以更好地引导模型注意,从而将工具定位精度提高约2%。对CADIS数据集的广泛消融研究证实了我们在发现相关手术对象时,无需或仅需监督的有效性。
https://arxiv.org/abs/2409.07801
Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient -- to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
在大规模、无结构互联网文本上的预训练,使得语言模型能够获得大量的世界知识。然而,这种知识获取是数据密集的——为了学习一个给定的事实,模型必须通过训练数百到数千个不同的表示来训练。这使得在将预训练模型适应于小领域特定文档时存在挑战,其中每个事实可能只出现很少或仅一次。我们提出了一个桥接这个差距的方案:通过合成特定领域的微小文档来合成一个大文档,然后在一个合成文档上继续预训练。我们使用EntiGraph,一种合成数据增强算法,从源文档中提取引人注目的实体,然后通过连接采样实体来生成多样化的文本。使用EntiGraph进行的合成继续预训练使语言模型能够回答问题和遵循与源文档相关的通用指令,而无需访问它们。相反,如果源文档在推理时可用,我们证明了通过我们的方法获得的知識与检索增强生成叠加。为了更好地理解这些结果,我们构建了一个简单的EntiGraph数学模型,并展示了合成数据增强如何“重新排列”知识,以实现更数据有效的学习。
https://arxiv.org/abs/2409.07431
Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as rotating and cropping alter scene composition, affecting saliency. We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes. Since saliency depends on high-level and low-level features, our approach involves learning both by incorporating photometric and semantic attributes such as color, contrast, brightness, and class. To that end, we introduce a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions. Experimental results show that our data augmentation method consistently improves the performance of various saliency models. Moreover, leveraging the augmentation features for saliency prediction yields superior performance on publicly available saliency benchmarks. Our predictions align closely with human visual attention patterns in the edited images, as validated by a user study.
我们的研究受到标记数据有限多样性和数量的限制。标准的数据增强技术如旋转和裁剪会改变场景构图,影响显著性。我们提出了一种新的数据增强方法,用于深度显著性预测,在保留真实场景复杂性和可变性的同时编辑自然图像。由于显著性取决于高级和低级特征,我们的方法通过结合图像学和语义信息进行学习,例如颜色、对比度、亮度和类别。为此,我们引入了一种基于显著性的跨注意机制,使得在图像的图像属性上进行有针对性的编辑,从而在特定图像区域内增强显著性。实验结果表明,我们的数据增强方法能够显著提高各种显著性模型的性能。此外,利用增强特征进行显著性预测在公开可用的显著性基准上表现出卓越的性能。我们预测的编辑图像与通过用户研究验证的人类视觉关注模式非常接近。
https://arxiv.org/abs/2409.07307
The growing interest in cybersecurity has significantly increased articles designing and implementing various Cyber Deception (CYDEC) mechanisms. This trend reflects the urgent need for new strategies to address cyber threats effectively. Since its emergence, CYDEC has established itself as an innovative defense against attackers, thanks to its proactive and reactive capabilities, finding applications in numerous real-life scenarios. Despite the considerable work devoted to CYDEC, the literature still presents significant gaps. In particular, there has not been (i) a comprehensive analysis of the main components characterizing CYDEC, (ii) a generic classification covering all types of solutions, nor (iii) a survey of the current state of the literature in various contexts. This article aims to fill these gaps through a detailed review of the main features that comprise CYDEC, developing a comprehensive classification taxonomy. In addition, the different frameworks used to generate CYDEC are reviewed, presenting a more comprehensive one. Existing solutions in the literature using CYDEC, both without Artificial Intelligence (AI) and with AI, are studied and compared. Finally, the most salient trends of the current state of the art are discussed, offering a list of pending challenges for future research.
随着对网络安全的日益关注,设计和实施各种网络钓鱼(CYDEC)机制的文章数量大幅增加。这一趋势反映了有效应对网络威胁紧迫需要的新策略。自其出现以来,CYDEC已经将自己定位为攻击者的创新防御,得益于其主动和反应能力,在许多现实生活中场景中得到了应用。尽管在CYDEC上投入了大量工作,但文献中仍然存在显著的空白。特别是,没有对CYDEC的主要组成部分进行全面分析,没有涵盖所有解决方案的通用分类,也没有对不同情境下CYDEC文献的现状进行调查。本文旨在填补这些空白,通过详细审查构成CYDEC的主要特点,发展全面的分类范畴。此外,对生成CYDEC的不同框架进行了审查,呈现了一个更全面的框架。现有的CYDEC解决方案,无论是否使用人工智能(AI),都被研究和比较。最后,讨论了当前技术状况的最显著趋势,为未来的研究提供了悬而未决的挑战列表。
https://arxiv.org/abs/2409.07194
Context: Generative AI (GenAI) has emerged as a transformative tool in software engineering, with requirements engineering (RE) actively exploring its potential to revolutionize processes and outcomes. The integration of GenAI into RE presents both promising opportunities and significant challenges that necessitate systematic analysis and evaluation. Objective: This paper presents a comprehensive systematic literature review (SLR) analyzing state-of-the-art applications and innovative proposals leveraging GenAI in RE. It surveys studies focusing on the utilization of GenAI to enhance RE processes while identifying key challenges and opportunities in this rapidly evolving field. Method: A rigorous SLR methodology was used to analyze 27 carefully selected primary studies in-depth. The review examined research questions pertaining to the application of GenAI across various RE phases, the models and techniques used, and the challenges encountered in implementation and adoption. Results: The most salient findings include i) a predominant focus on the early stages of RE, particularly the elicitation and analysis of requirements, indicating potential for expansion into later phases; ii) the dominance of large language models, especially the GPT series, highlighting the need for diverse AI approaches; and iii) persistent challenges in domain-specific applications and the interpretability of AI-generated outputs, underscoring areas requiring further research and development. Conclusions: The results highlight the critical need for comprehensive evaluation frameworks, improved human-AI collaboration models, and thorough consideration of ethical implications in GenAI-assisted RE. Future research should prioritize extending GenAI applications across the entire RE lifecycle, enhancing domain-specific capabilities, and developing strategies for responsible AI integration in RE practices.
背景:生成式人工智能(GenAI)已成为软件工程中的一个变革性工具,需求工程(RE)积极探讨其潜力,以彻底改变过程和结果。将GenAI集成到RE中,既带来了有前景的机会,也带来了重要的挑战,这需要进行系统分析和评估。目标:本文全面系统地回顾了利用GenAI在RE中实现最先进的应用和创新的提案。它调查了将GenAI用于增强RE过程中关键挑战和机会的研究。方法:采用严谨的SLR方法对27篇精心选择的主要研究进行了深入分析。回顾研究了GenAI在各种RE阶段的应用,模型和技术,以及在实施和采用过程中遇到的挑战。结果:最引人注目的发现包括i) RE的早期阶段,特别是需求抽取和分析,表明有向后扩展的潜力;ii) 大语言模型的主导地位,特别是GPT系列,强调了需要采用多样的人工智能方法;iii) 领域特定应用中持续存在的挑战和对AI生成的输出可解释性,需要进一步研究和开发。结论:结果强调了全面评估框架,改进的人机合作模式,以及在对GenAI辅助RE过程中充分考虑伦理 implications的重要性。未来研究应优先考虑扩展GenAI应用在整个RE生命周期的过程中,增强领域特定能力,并发展在RE实践中的负责任AI整合策略。
https://arxiv.org/abs/2409.06741
Deep learning (DL) models have shown significant potential in Alzheimer's Disease (AD) classification. However, understanding and interpreting these models remains challenging, which hinders the adoption of these models in clinical practice. Techniques such as saliency maps have been proven effective in providing visual and empirical clues about how these models work, but there still remains a gap in understanding which specific brain regions DL models focus on and whether these brain regions are pathologically associated with AD. To bridge such gap, in this study, we developed a quantitative disease-focusing strategy to first enhance the interpretability of DL models using saliency maps and brain segmentations; then we propose a disease-focus (DF) score that quantifies how much a DL model focuses on brain areas relevant to AD pathology based on clinically known MRI-based pathological regions of AD. Using this strategy, we compared several state-of-the-art DL models, including a baseline 3D ResNet model, a pretrained MedicalNet model, and a MedicalNet with data augmentation to classify patients with AD vs. cognitive normal patients using MRI data; then we evaluated these models in terms of their abilities to focus on disease-relevant regions. Our results show interesting disease-focusing patterns with different models, particularly characteristic patterns with the pretrained models and data augmentation, and also provide insight into their classification performance. These results suggest that the approach we developed for quantitatively assessing the abilities of DL models to focus on disease-relevant regions may help improve interpretability of these models for AD classification and facilitate their adoption for AD diagnosis in clinical practice. The code is publicly available at this https URL.
深度学习(DL)模型在阿尔茨海默病(AD)分类方面表现出显著潜力。然而,理解和解释这些模型仍然具有挑战性,这阻碍了将这些模型应用于临床实践。诸如局部突出图等技术已经被证明在提供关于这些模型如何工作的视觉和实证提示方面非常有效。然而,目前尚不清楚DL模型具体关注哪些大脑区域以及这些大脑区域是否与AD的病理学联系。为了弥合这一空白,在这项研究中,我们开发了一种定量疾病关注策略,首先通过局部突出图和脑分割来增强DL模型的可解释性;然后我们提出了一个疾病关注(DF)得分,用于评估基于临床MRI数据的AD病理学区域与DL模型的关注程度。使用这种策略,我们将几种最先进的DL模型(包括 baseline 3D ResNet模型、预训练的MedNet模型和经数据增强的MedNet模型)与AD患者和认知正常患者使用MRI数据进行比较,然后评估这些模型的疾病关注能力。我们的结果展示了不同模型中的有趣的疾病关注模式,特别是预训练模型和数据增强模式,还提供了对这些分类表现的理解。这些结果表明,我们为定量评估DL模型疾病关注能力所开发的方法可能有助于改善这些模型的AD分类可解释性,并促进其在临床实践中的AD诊断。代码公开可用,在此链接https://。
https://arxiv.org/abs/2409.04888
This paper introduces Top-GAP, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks. By constraining the spatial size of the learned feature representation, our method forces the network to focus on the most salient image regions, effectively reducing background influence. Using adversarial attacks and the Effective Receptive Field, we show that Top-GAP directs more attention towards object pixels rather than the background. This leads to enhanced interpretability and robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD $\epsilon=\frac{8}{255}$ and $20$ iterations while maintaining the original clean accuracy. Furthermore, we see increases of up to 5% accuracy against distribution shifts. Our approach also yields more precise object localization, as evidenced by up to 25% improvement in Intersection over Union (IOU) compared to methods like GradCAM and Recipro-CAM.
本文介绍了Top-GAP,一种新的正则化技术,可以增强卷积神经网络的可解释性和鲁棒性。通过限制学习到的特征表示的空间大小,我们的方法迫使网络集中关注最具突出性的图像区域,有效减少了背景影响。使用对抗攻击和有效感受野,我们证明了Top-GAP将更多关注点引导到目标像素而不是背景中。这导致增强的可解释性和鲁棒性。在CIFAR-10上,我们使用PGD $\epsilon=\frac{8}{255}$和20次迭代,实现了与原始干净准确度的超过50%的鲁棒性。此外,我们还观察到分布偏移时准确度的增加,最多增加了5%的准确度。我们的方法还产生了更精确的目标局部定位,这在比较GradCAM和Recipro-CAM等方法时表现尤为突出。
https://arxiv.org/abs/2409.04819
Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. this https URL
Scribble监督下显著物体检测(SSSOD)构建了在稀疏Scribble标签下从周围环境区分有吸引力的物体的分割能力。为了获得更好的分割,深度和热红外模态作为复杂场景中RGB图像的补充。现有的方法特别为RGB、RGB-Depth、RGB-Thermal和Visual-Depth-Thermal图像输入设计了各种特征提取和多模态融合策略,导致类似模型淹没。作为最近提出的Segment Anything Model(SAM)具有非凡的分割和即时交互能力,我们基于SAM提出了一个SSSOD家庭,用于不同模态的输入。首先,设计了不同的模态感知调制器以获得模态特定的知识,并与其他模态提取的模态信息协同,以提高特征集。其次,针对增强的解码能力,设计了一对同态解码器来弥合训练与无提示测试之间的差距。我们的模型在各种模态组合方面表现出卓越的性能,并刷新了最高级别的Scribble监督方法和接近于完全监督方法的水平。就是这个链接:
https://arxiv.org/abs/2409.04817
While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.
虽然大型多模态模型(LMMs)已经在许多多模态任务上取得了强大的性能,但在生成文本时,它们可能会产生幻觉。它们在从视觉数据中检测显著特征的性能也不清楚。在本文中,我们开发了一个框架,可以从混合数据中生成忠实且显著的文本,包括图像和结构数据(用知识图或表格表示)。具体来说,我们训练了一个小型的视觉批评模型,用于识别图像模态中的幻觉和非显著特征。批评模型还生成了一组显著的图像特征列表。这一信息被用于后编辑步骤,以提高生成质量。在两个数据集上的实验表明,我们的框架在忠实性和显著性方面都提高了LMM的生成质量,超过了旨在减少幻觉的最近技术。
https://arxiv.org/abs/2409.03961