In the domain of Mobility Data Science, the intricate task of interpreting models trained on trajectory data, and elucidating the spatio-temporal movement of entities, has persistently posed significant challenges. Conventional XAI techniques, although brimming with potential, frequently overlook the distinct structure and nuances inherent within trajectory data. Observing this deficiency, we introduced a comprehensive framework that harmonizes pivotal XAI techniques: LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), Saliency maps, attention mechanisms, direct trajectory visualization, and Permutation Feature Importance (PFI). Unlike conventional strategies that deploy these methods singularly, our unified approach capitalizes on the collective efficacy of these techniques, yielding deeper and more granular insights for models reliant on trajectory data. In crafting this synthesis, we effectively address the multifaceted essence of trajectories, achieving not only amplified interpretability but also a nuanced, contextually rich comprehension of model decisions. To validate and enhance our framework, we undertook a survey to gauge preferences and reception among various user demographics. Our findings underscored a dichotomy: professionals with academic orientations, particularly those in roles like Data Scientist, IT Expert, and ML Engineer, showcased a profound, technical understanding and often exhibited a predilection for amalgamated methods for interpretability. Conversely, end-users or individuals less acquainted with AI and Data Science showcased simpler inclinations, such as bar plots indicating timestep significance or visual depictions pinpointing pivotal segments of a vessel's trajectory.
在移动数据科学领域,解释在轨迹数据上训练的模型的复杂任务以及阐明实体在空间和时间运动中的运动,一直是一个具有挑战性的任务。传统的XAI方法虽然充满潜力,但通常忽视轨迹数据中固有的差异和细微之处。观察到这一不足,我们引入了一个全面的框架,协调关键的XAI方法:LIME(局部可解释模型无关的解释),SHAP(边际可解释性Additive解释),突出度图,注意机制,直接轨迹可视化和Permutation Feature Importance(PFI)。与传统的策略仅使用这些方法不同,我们的统一方法利用这些技术的集体效果,为模型依赖轨迹数据的模型提供了更深入和更细致的洞察。在构建这一综合时,我们有效解决了轨迹的多面性,不仅实现了增强的 interpretability,还获得了对模型决策的细微和丰富的理解。为了验证和完善我们的框架,我们对不同用户群体进行了调查,以评估他们的偏好和反应。我们的研究结果证实了一种二元论:具有学术取向的专业人士,特别是数据科学家、IT专家和ML工程师,表现出深刻的技术理解和往往倾向于使用可解释性方法。相反,对AI和数据科学不太熟悉的用户或个人表现出更简单的倾向,例如时间步长的条形图表示重要性,或视觉描绘船舶轨迹的关键部分。
https://arxiv.org/abs/2312.00380
Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates in feature extraction and fusion. To address this issue, we propose a surprisingly simple yet effective dual-prompt learning network (dubbed DPLNet) for training-efficient multimodal (e.g., RGB-D/T) semantic segmentation. The core of DPLNet is to directly adapt a frozen pre-trained RGB model to multimodal semantic segmentation, reducing parameter updates. For this purpose, we present two prompt learning modules, comprising multimodal prompt generator (MPG) and multimodal feature adapter (MFA). MPG works to fuse the features from different modalities in a compact manner and is inserted from shadow to deep stages to generate the multi-level multimodal prompts that are injected into the frozen backbone, while MPG adapts prompted multimodal features in the frozen backbone for better multimodal semantic segmentation. Since both the MPG and MFA are lightweight, only a few trainable parameters (3.88M, 4.4% of the pre-trained backbone parameters) are introduced for multimodal feature fusion and learning. Using a simple decoder (3.27M parameters), DPLNet achieves new state-of-the-art performance or is on a par with other complex approaches on four RGB-D/T semantic segmentation datasets while satisfying parameter efficiency. Moreover, we show that DPLNet is general and applicable to other multimodal tasks such as salient object detection and video semantic segmentation. Without special design, DPLNet outperforms many complicated models. Our code will be available at this http URL.
多模态(例如,RGB-Depth/RGB-Thermal)融合在复杂场景(例如,室内/低光条件)中具有良好的语义分割潜力。现有的方法通常在对多模态语义分割进行复杂特征融合策略的双分支编码器-解码器框架上进行微调,这会导致训练成本高昂,因为特征提取和融合过程中的参数规模巨大。为了应对这个问题,我们提出了一种简单 yet 有效的双提示学习网络(称为DPLNet),用于训练高效的多模态(例如,RGB-D/T)语义分割。DPLNet的核心是将预训练的RGB模型直接适应多模态语义分割,从而减少参数更新。为此,我们提出了两个提示学习模块,包括多模态提示生成器(MPG)和多模态特征适配器(MFA)。MPG致力于以紧凑的方式将不同模态的特征进行融合,并从 shadows 到深层阶段将多级多模态提示注入预冻的基础网络,而MFA在预冻的基础网络中适应提示的多模态特征,以实现更好的多模态语义分割。由于MPG和MFA都较轻,因此只引入了几个可训练参数(3.88M,预训练骨架参数的4.4%)。使用简单的解码器(3.27M参数),DPLNet在四个RGB-D/T语义分割数据集上实现了与最先进的模型相匹敌的性能,同时在参数效率方面与复杂方法相当。此外,我们还证明了DPLNet具有泛化能力和适用于其他多模态任务(例如,显着目标检测和视频语义分割)。在没有特殊设计的情况下,DPLNet在很多复杂的模型中具有优势。我们的代码将在此处下载:http://www.xxx。
https://arxiv.org/abs/2312.00360
Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.
Web scraped datasets易受数据污染,这可以在训练期间用于攻击深度图像分类器。由于在大型数据集上训练代价昂贵,因此通常使用一次训练并多次复用。与攻击模式不同,后门攻击通常针对特定类别而不是模型学习的任何类别。我们预期的是一种简单的攻击组合会大大增加污染样本的数量。我们证明了这种说法并不一定正确,并且存在更有效的通用数据污染攻击,通过增加污染样本的数量,可以控制将错误分类的来源类转换为目标类的 misclassifications。我们的目标是生成具有突出特征的触发器,该模型可以学习。我们打造的触发器利用了一个我们称之为类间毒传播的现象。通过从一种类别学习触发器,使模型对其他类的学习更加脆弱。我们通过在训练集上控制具有多达6000个类别的模型,同时只污染训练数据集的0.15%来证明我们通用后门攻击的有效性和稳健性。
https://arxiv.org/abs/2312.00157
Deep convolutional neural networks have been widely applied in salient object detection and have achieved remarkable results in this field. However, existing models suffer from information distortion caused by interpolation during up-sampling and down-sampling. In response to this drawback, this article starts from two directions in the network: feature and label. On the one hand, a novel cascaded interaction network with a guidance module named global-local aligned attention (GAA) is designed to reduce the negative impact of interpolation on the feature side. On the other hand, a deep supervision strategy based on edge erosion is proposed to reduce the negative guidance of label interpolation on lateral output. Extensive experiments on five popular datasets demonstrate the superiority of our method.
深度卷积神经网络已经在显着目标检测中被广泛应用,并在此领域取得了显著的成果。然而,现有的模型在向上和向下采样过程中存在信息扭曲的缺点。为了应对这一缺点,本文从网络的两个方向开始:特征和标签。一方面,设计了一个名为全局局部对齐注意力(GAA)的级联交互网络,以减少向上和向下采样过程中对特征的影响。另一方面,提出了一种基于边缘侵蚀的深度监督策略,以减少标签级联的影响。在五个流行的数据集上进行广泛的实验证明了我们方法的优势。
https://arxiv.org/abs/2311.18675
Colonoscopy screening is the gold standard procedure for assessing abnormalities in the colon and rectum, such as ulcers and cancerous polyps. Measuring the abnormal mucosal area and its 3D reconstruction can help quantify the surveyed area and objectively evaluate disease burden. However, due to the complex topology of these organs and variable physical conditions, for example, lighting, large homogeneous texture, and image modality estimating distance from the camera aka depth) is highly challenging. Moreover, most colonoscopic video acquisition is monocular, making the depth estimation a non-trivial problem. While methods in computer vision for depth estimation have been proposed and advanced on natural scene datasets, the efficacy of these techniques has not been widely quantified on colonoscopy datasets. As the colonic mucosa has several low-texture regions that are not well pronounced, learning representations from an auxiliary task can improve salient feature extraction, allowing estimation of accurate camera depths. In this work, we propose to develop a novel multi-task learning (MTL) approach with a shared encoder and two decoders, namely a surface normal decoder and a depth estimator decoder. Our depth estimator incorporates attention mechanisms to enhance global context awareness. We leverage the surface normal prediction to improve geometric feature extraction. Also, we apply a cross-task consistency loss among the two geometrically related tasks, surface normal and camera depth. We demonstrate an improvement of 14.17% on relative error and 10.4% improvement on $\delta_{1}$ accuracy over the most accurate baseline state-of-the-art BTS approach. All experiments are conducted on a recently released C3VD dataset; thus, we provide a first benchmark of state-of-the-art methods.
直肠癌筛查是评估结肠和直肠异常情况的黄金标准检查方法,例如溃疡和癌性隆起。测量异常黏膜面积及其3D重建可以帮助定量调查区域并客观评估疾病负担。然而,由于这些器官的复杂拓扑结构和变异性物理条件,例如光线、大而均匀的质地以及从相机距离计算深度,高度具有挑战性。此外,大多数直肠癌视频采集是单目镜的,使得深度估计成为一个非 trivial 问题。虽然计算机视觉深度估计方法已经在自然场景数据集上得到了提出和发展,但它们在直肠癌数据集中的有效性尚未得到充分量化。由于直肠黏膜具有多个低纹理区域,从辅助任务中学习表示可以改善突出特征提取,允许估计准确相机深度。在本文中,我们提出了一种名为多任务学习(MTL)的新方法,包括共享编码器和解码器,具体为表面法线解码器和深度估计解码器。我们的深度估计算法包括注意力机制以增强全局上下文意识。我们利用表面法线预测来改善几何特征提取。此外,我们在两个相关几何任务之间应用跨任务一致性损失,即表面法线和相机深度。我们证明了在最准确基线 state-of-the-art BTS 方法上的相对误差改善了14.17%,而在 $\delta_1$ 准确度上的改善率为10.4%。所有实验都在最近发布的 C3VD 数据集上进行;因此,我们提供了最先进的方法的第一个基准。
https://arxiv.org/abs/2311.18664
Anomaly detection (AD) is a fundamental task in computer vision. It aims to identify incorrect image data patterns which deviate from the normal ones. Conventional methods generally address AD by preparing augmented negative samples to enforce self-supervised learning. However, these techniques typically do not consider semantics during augmentation, leading to the generation of unrealistic or invalid negative samples. Consequently, the feature extraction network can be hindered from embedding critical features. In this study, inspired by visual attention learning approaches, we propose CutSwap, which leverages saliency guidance to incorporate semantic cues for augmentation. Specifically, we first employ LayerCAM to extract multilevel image features as saliency maps and then perform clustering to obtain multiple centroids. To fully exploit saliency guidance, on each map, we select a pixel pair from the cluster with the highest centroid saliency to form a patch pair. Such a patch pair includes highly similar context information with dense semantic correlations. The resulting negative sample is created by swapping the locations of the patch pair. Compared to prior augmentation methods, CutSwap generates more subtle yet realistic negative samples to facilitate quality feature learning. Extensive experimental and ablative evaluations demonstrate that our method achieves state-of-the-art AD performance on two mainstream AD benchmark datasets.
异常检测(AD)是计算机视觉中的一个基本任务,旨在识别与正常数据模式不符的图像数据模式。传统方法通常通过准备增强负样本来强制自监督学习来解决AD。然而,这些技术通常在增强过程中没有考虑语义,导致生成了不现实或无效的负样本。因此,特征提取网络可能会受到阻止关键特征嵌入的影响。在本文中,受到视觉关注学习方法的启发,我们提出了CutSwap,利用语义引导来包含语义线索进行增强。具体来说,我们首先使用LayerCAM提取多层图像特征作为语义图,然后进行聚类以获得多个圆心。为了充分利用语义引导,我们选择聚类中具有最高圆心语义引导的像素对来形成补丁对。这样的补丁对具有密集语义相关性的高相似度上下文信息。通过交换补丁对的位置,生成负样本。与先前的增强方法相比,CutSwap生成了更微妙但更真实的负样本,以促进质量特征的学习。 extensive实验和ablation评估表明,我们的方法在两个主流AD基准数据集上实现了最先进的AD性能。
https://arxiv.org/abs/2311.18332
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.
无监督视频对象分割(UVOS)的目标是在给定视频序列中检测出主要物体,而无需任何人类干预。大多数现有方法依赖于两个流式架构,在将表现和运动信息分别编码后,将它们融合以识别目标并生成物体掩码。然而,这个流程计算代价高昂,并由于正确融合两个模式的难度,导致性能低。在本文中,我们提出了一个名为SimulFlow的新UVOS模型,实现了同时进行特征提取和目标识别,从而实现高效且有效的无监督视频对象分割。具体来说,我们设计了一个名为SimulFlow Attention的新的SimulFlow注意机制,利用注意操作的灵活性,将融合特征的粗粒度预测用于在掩码区域内约束关注操作,并排除噪声的影响。由于SimulFlow Attention在视觉和光学流特征之间的双向信息流,无需额外的手设计合成的融合模块,我们只采用轻解码器获得最终预测。我们在多个基准数据集上评估我们的方法,并取得了最先进的结果。我们提出的方法不仅超越了现有方法,还解决了由于两个流式架构导致的计算复杂性和融合困难的问题。我们的模型在DAVIS-16上实现了87.4%的J&F值(在3090上的帧率为63.7 FPS),具有最低的参数(13.7 M)。我们的SimulFlow还在视频显着性物体检测数据集上获得了竞争力的结果。
https://arxiv.org/abs/2311.18286
High-dimensional time series data poses challenges due to its dynamic nature, varying lengths, and presence of missing values. This kind of data requires extensive preprocessing, limiting the applicability of existing Time Series Classification and Time Series Extrinsic Regression techniques. For this reason, we propose BORF, a Bag-Of-Receptive-Fields model, which incorporates notions from time series convolution and 1D-SAX to handle univariate and multivariate time series with varying lengths and missing values. We evaluate BORF on Time Series Classification and Time Series Extrinsic Regression tasks using the full UEA and UCR repositories, demonstrating its competitive performance against state-of-the-art methods. Finally, we outline how this representation can naturally provide saliency and feature-based explanations.
由于其动态性质、长度不同和缺失值的存在,高维时间序列数据带来了挑战。这种数据需要进行广泛的预处理,限制了现有时间序列分类和时间序列外推回归技术的适用性。因此,我们提出了BORF,一种结合了时间序列卷积和1D-SAX的概念来处理具有不同长度和缺失值的多维时间序列。我们在时间序列分类和时间序列外推回归任务上使用完整的UEA和UCR存储库评估BORF,证明了其与最先进方法竞争的实力。最后,我们说明了这种表示如何自然地提供基于特征和 saliency 的解释。
https://arxiv.org/abs/2311.18029
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an approach inspired by psychometrics, utilizing Krippendorf's alpha to quantify the benchmark reliability of post-hoc methods in image classification. The study proposes model training modifications, including feeding perturbed samples and employing focal loss, to enhance robustness and calibration. Empirical evaluations demonstrate significant improvements in benchmark reliability across metrics, datasets, and post-hoc methods. This pioneering work establishes a foundation for more reliable evaluation practices in the realm of post-hoc explanation methods, emphasizing the importance of model robustness in the assessment process.
深度神经网络在图像分类任务中具有强大的表现,但通常被视为“黑盒子”,这使得对它们决策过程的理解变得更加复杂。为了解决这个问题,各种解释方法,特别是那些产生轮廓图的方法,试图解决这个挑战。然而,信仰度指标的不一致性问题阻碍了可靠基准测试的解释方法。本文采用了一种受到心理测量学启发的方法,使用Krippendorf的alpha来量化图像分类后方法的基准可靠性。研究提出了包括喂入扰动样本和采用焦点损失等模型训练修改,以增强稳健性和可调性。实证评估表明,在指标、数据集和后处理方法上,基准可靠性都有显著提高。这项开创性的工作为后处理解释方法领域更可靠的评估实践奠定了基础,强调了在评估过程中模型稳健性的重要性。
https://arxiv.org/abs/2311.17876
Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.
生成模型可以产生非常逼真的图像。本文证明了生成的图像与真实图像具有不同的几何特征。我们建立了一个生成图像的集合,经过预处理以使简单、基于信号的分类器误认为它们是真实的。然后我们证明了预处理后的生成图像可以被只关注几何特征的分类器可靠地区分。我们使用了三个这样的分类器。这三个分类器都禁止访问图像像素,并且只关注提取的几何特征。第一个分类器观察图像的视角场,第二个分类器观察图像中检测到的线,第三个分类器观察检测到的物体与阴影之间的关系。我们的过程在多个不同生成器产生的图像中比当前最先进的基于局部信号的检测器更可靠地检测到生成图像。凸显图表明,分类器可以可靠地区分几何问题。我们得出结论,当前的生成器无法可靠地复制真实图像的几何性质。
https://arxiv.org/abs/2311.17138
Existing single-modal and multi-modal salient object detection (SOD) methods focus on designing specific architectures tailored for their respective tasks. However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs. In this paper, we make the first attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD. Nevertheless, assigning appropriate strategies to modality variable inputs is challenging. To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning, which are plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks, while only requiring few learnable parameters compared to training the entire model. Each modality-aware prompt is generated from a switchable prompt generation block, which performs structural switching solely relied on single-modal and multi-modal inputs. UniSOD achieves consistent performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD, which demonstrates that our method effectively and efficiently unifies single-modal and multi-modal SOD tasks.
目前,单独的单模态和多模态显着目标检测(SOD)方法专注于为各自的任务设计特定的架构。然而,为不同任务开发完全不同的模型会导致劳动和时间消耗,以及高计算和实际部署成本。在本文中,我们在统一的框架UniSOD中首次尝试解决单模态和多模态SOD。然而,为模态变量输入分配适当的策略具有挑战性。为此,UniSOD通过自适应提示学习生成模态意识提示,并将其插入到预训练的SOD模型中,以处理相应任务,而仅需要几个可学习参数,与训练整个模型相比,这非常少。每个模态意识提示都来自一个可切换的提示生成模块,该模块仅依赖单模态和多模态输入进行结构切换。UniSOD在RGB、RGB-D和RGB-T SOD基准数据集上实现了稳健的性能提升,这表明我们的方法有效且有效地将单模态和多模态SOD任务进行了统一。
https://arxiv.org/abs/2311.16835
Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.
体积视频,也称为全息视频,是一种描绘自然内容的新媒体,用于虚拟现实(VR)、增强现实(AR)和混合现实(MR)。预计将成为下一代视频技术,并在5G及更广泛的无线通信中成为普遍应用场景。考虑到每个用户通常只观看体积视频的部分内容,即视口,因此精确的视口预测对最佳性能至关重要。然而,关于这个话题的研究尚处于起步阶段。最后,本文提出了一个新的方法,名为Saliency and Trajectory Viewport Prediction(STVP),旨在提高体积视频流中视口预测的精度。STVP在很大程度上利用了视频复杂性信息和视口轨迹。据我们所知,这是第一个关于体积视频流中视口预测的研究。特别是,我们引入了一种名为Uniform Random Sampling(URS)的新的采样方法,以降低计算复杂度,同时保留视频特征并以有效的方式保持视频质量。然后,我们提出了一个结合空间和时间信息的静态、动态几何和颜色显着区域检测技术。最后,我们通过智能地将显着性和轨迹信息融合起来,实现了更准确的视口预测。我们对所提出的视口预测方法进行了广泛的仿真以评估其有效性。实验结果表明,与现有方法相比,所提出的方法具有优越性。数据集和源代码将在接受后公开可用。
https://arxiv.org/abs/2311.16462
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
翻译:图像分割(RIS)旨在根据语言表达提示对特定区域进行分割。现有的方法将语言特征融入视觉特征中,获得用于口罩解码的多模态特征。然而,这些方法可能会分割视觉上突出的实体,而不是正确的指称区域,因为多模态特征主导着丰富的视觉上下文。在本文中,我们提出了MARIS,一种利用Segment Anything Model(SAM)的指称图像分割方法,并引入了一种互为注意的注意力机制来增强跨模态融合。具体来说,我们的互为注意的注意力机制包括视觉引导注意力和语言引导注意力,它们双向建模视觉和语言特征之间的关系。相应地,我们设计了一个掩码解码器,以便为更一致的分割提供语言表达指导。为此,我们提出了一个多模态查询标记,以整合语言信息和同时与视觉信息交互。在三个基准数据集上的大量实验证明,我们的方法超过了最先进的RIS方法。我们的代码将公开可用。
https://arxiv.org/abs/2311.15727
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.
词汇歧义是机器翻译(\mt)中一个具有挑战性和普遍性的问题。我们通过在神经 \mt 中引入少量附加句段上下文来解决这个问题。我们的方法不需要任何语义注释,也不需要对标准模型架构进行修改。由于实际文档上下文对大多数 \mt 训练数据来说不可用,我们收集与每个输入相关的相关句子来构建伪文档。伪文档中的显眼单词 then 被编码为每个源句的前缀,以条件生成翻译。为了评估,我们发布了基于英语-德语 \mucow \cite{raganato-etal-2020-evaluation} 的翻译歧义挑战集docmucow。大量实验证明,我们的方法比强句级基线和可比较的文档级基线更有效地翻译歧义源词,同时降低训练成本。
https://arxiv.org/abs/2311.15507
Understanding the factors that determine video memorability has important applications in areas such as educational technology and advertising. Towards this goal, we investigate the semantic and temporal attention mechanisms underlying video memorability. We propose a Transformer-based model with spatio-temporal attention that matches SoTA performance on video memorability prediction on a large naturalistic video dataset. More importantly, the self-attention patterns show us where the model looks to predict memorability. We compare model attention against human gaze fixation density maps collected through a small-scale eye-tracking experiment where humans perform a video memory task. Quantitative saliency metrics show that the model attention and human gaze follow similar patterns. Furthermore, while panoptic segmentation confirms that the model and humans attend more to thing classes, stuff classes that receive increased/decreased attention tend to have higher memorability scores. We also observe that the model assigns greater importance to the initial frames, mimicking temporal attention patterns found in humans.
理解决定视频可记忆性的因素在教育技术和广告等领域具有重要的应用。为实现这一目标,我们研究了决定视频可记忆性的语义和时间注意机制。我们提出了一个基于Transformer的模型,具有空间和时间注意,在大型自然视频数据集上与SoTA性能的视频可记忆性预测相匹配。更重要的是,自注意模式告诉我们模型在预测可记忆性时关注哪里。我们比较了模型注意与通过小规模眼动实验收集的人类目光固定密度图之间的差异。量化清晰度指标显示,模型注意力和人类目光遵循类似的模式。此外,虽然全景分割证实了模型和人类更关注事物类别的关注,但受到增加/减少注意的类别的记忆得分通常较高。我们还观察到,模型对初始帧的重要性更高,模仿了人类发现的时间注意模式。
https://arxiv.org/abs/2311.16484
Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities, sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models, potentially leading to redundancy and suboptimal results. We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD.
突出物体检测(SOD)和伪装物体检测(COD)是相关但具有独特提示的的二元映射任务。这些任务涉及多个提示,共享共同的提示和独特的线索。现有的研究通常采用复杂 task-specific 专家模型,可能导致冗余和低效的结果。我们引入了VSCode,一种通用的模型,具有新颖的 2D 提示学习,来共同解决四个 SOD 任务和三个 COD 任务。我们利用 VST 作为基础模型,并在编码器-解码器架构中引入了 2D 提示,以在两个独立的方向上学习领域和任务特异性知识。提示区分损失有助于区分奇异之处,从而提高模型优化。VSCode 在六个任务上优于现有方法,在 26 个数据集上表现出色,并且通过结合 2D 提示(如 RGB-D COD)表现出对未见任务的零 shot 泛化。
https://arxiv.org/abs/2311.15011
Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and count the number of windows of a building by deploying an Unmanned Aerial Vehicle (UAV) based remote sensing system is proposed. The proposed two-stage method automates the identification and counting of windows by developing computer vision pipelines that utilize data from UAV's onboard camera and other sensors. Quantitative and Qualitative results show the effectiveness of our proposed approach in accurately detecting and counting the windows compared to the existing method.
尽管在建筑和测量领域出现了许多技术进步,但检查建筑物中突出的特征,如窗户,主要仍然是一个手动过程。此外,建筑物内窗户的数量与它遭受地震时产生的变形程度直接相关。在这项研究中,我们提出了一种通过部署无人机(UAV)的远程 sensing系统来准确检测和计数建筑物中窗户数量的方法。我们提出的两阶段方法通过利用UAV车载相机和其他传感器的数据开发了计算机视觉管道,实现了窗户的识别和计数。定量和定性结果表明,与现有方法相比,我们的方法在准确检测和计数窗户方面具有有效性。
https://arxiv.org/abs/2311.14635
Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14$\times$ reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available at this https URL.
机器学习管道对于分类任务通常会训练一个通用的模型,以实现对广泛类别的准确度。然而,典型的用户只会定期遇到有限数量的类。这种差异提供了一个机会,通过将模型定制为关注用户特定类,提高计算效率。现有工作依赖于无结构剪枝,这会在模型中引入随机的非零值,使其不适合硬件加速。或者,一些方法采用结构剪枝,例如通道剪枝,但它们通常只提供最小的压缩,可能导致模型准确度降低。 在本文中,我们提出了CRISP,一种新剪枝框架,利用混合结构稀疏模式,将细粒度N:M结构稀疏和粗粒度块稀疏相结合。我们的剪枝策略是由基于梯度的类感知重要性分数指导的,使我们能够保留对于用户特定类至关重要的权重。CRISP在ImageNet和CIFAR-100数据集上实现了与现有剪枝方法相当的高准确度,同时具有最小的内存消耗。此外,与现有剪枝方法相比,CRISP将延迟和能耗降低了14倍以上。我们的代码可在此处访问:https://www.kaggle.com/c/crisp-optimal-pruning-for-imageNet-resnet-50-vgg-16-and-mobilenetv2-datasets/
https://arxiv.org/abs/2311.14272
The novel 2019 Coronavirus disease (COVID-19) global pandemic is a defining health crisis. Recent efforts have been increasingly directed towards achieving quick and accurate detection of COVID-19 across symptomatic patients to mitigate the intensity and spread of the disease. Artificial intelligence (AI) algorithms applied to chest X-ray (CXR) images have emerged as promising diagnostic tools, and previous work has demonstrated impressive classification performances. However, such methods have faced criticisms from physicians due to their black-box reasoning process and unpredictable nature. In contrast to professional radiologist diagnosis, AI systems often lack generalizability, explainability, and robustness in the clinical decision making process. In our work, we address these issues by first proposing an extensive baseline study, training and evaluating 21 convolutional neural network (CNN) models on a diverse set of 33,000+ CXR images to classify between healthy, COVID-19, and non-COVID-19 pneumonia CXRs. Our resulting models achieved a 3-way classification accuracy, recall, and precision of up to 97.03\%, 97.97\%, and 99.95\%, respectively. Next, we investigate the effectiveness of adversarial training on model robustness and explainability via Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps. We find that adversarially trained models not only significantly outperform their standard counterparts on classifying perturbed images, but also yield saliency maps that 1) better specify clinically relevant features, 2) are robust against extraneous artifacts, and 3) agree considerably more with expert radiologist findings.
2019冠状病毒病(COVID-19)全球大流行是一个界定性的健康危机。最近的努力越来越多地转向通过症状性患者中快速和准确地检测COVID-19,以减轻疾病的强度和传播。应用于胸部X光(CXR)影像的人工智能(AI)算法已成为有前景的诊断工具,之前的经验表明具有出色的分类表现。然而,由于其黑盒推理过程和不确定的性质,这些方法从医生那里受到了批评。相比之下,专业放射科医生的诊断具有更广泛的适用性、可解释性和稳健性。在我们的工作中,我们通过首先提出一个广泛的基础研究来解决这些问题,然后在一个包含33,000+ CXR图像多样化设置中,对21个卷积神经网络(CNN)模型进行训练和评估,将这些模型分类为健康、COVID-19和非COVID-19肺炎CXR。我们的最终模型具有高达97.03%的分类精度、97.97%的召回率和99.95%的准确率。接下来,我们通过梯度加权分类激活映射(Grad-CAM)热图研究了对抗性训练对模型稳健性和可解释性的影响。我们发现,对抗性训练的模型不仅在分类扭曲图像方面显著优于标准对照模型,而且生成了对临床相关特征的显著度更高的显著性图。此外,这些模型还与专家放射科医生的发现相一致。
https://arxiv.org/abs/2311.14227
We present a novel approach for saliency prediction in images, leveraging parallel decoding in transformers to learn saliency solely from fixation maps. Models typically rely on continuous saliency maps, to overcome the difficulty of optimizing for the discrete fixation map. We attempt to replicate the experimental setup that generates saliency datasets. Our approach treats saliency prediction as a direct set prediction problem, via a global loss that enforces unique fixations prediction through bipartite matching and a transformer encoder-decoder architecture. By utilizing a fixed set of learned fixation queries, the cross-attention reasons over the image features to directly output the fixation points, distinguishing it from other modern saliency predictors. Our approach, named Saliency TRansformer (SalTR), achieves metric scores on par with state-of-the-art approaches on the Salicon and MIT300 benchmarks.
我们提出了一种新的图像显着性预测方法,利用Transformer的并行解码学习仅从聚焦图学习显着性。通常,模型依赖于连续的显着性图,以克服优化离散固定显着性图的困难。我们试图复制生成显着性数据集的实验设置。我们的方法将显着性预测视为直接集预测问题,通过全局损失通过二元匹配和一个Transformer编码器-解码器架构来强制执行独特的聚焦预测。通过使用预训练的一组学习到的聚焦查询,交叉注意力和图像特征之间理由,直接输出聚焦点,区分其与其他现代显着性预测器。我们的方法名为Saliency TRansformer(SalTR),在Salicon和MIT300基准测试上与最先进的方法达到指标得分相当。
https://arxiv.org/abs/2311.14073