We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
我们介绍了一个新的基准Blink,它专注于其他评估中没有发现的视觉感知能力。大多数Blink任务都可以通过人类“眨眼内解决”(例如,相对深度估计,视觉对应,法医检测和多视角推理)。然而,我们发现这些感知要求的任务对现有的多模态LLM构成了重大挑战,因为它们通过自然语言进行中介。Blink将14个经典计算机视觉任务重新格式化为3,807个多选题,与单张或多张图像和视觉提示搭配。虽然人类平均得到95.70%的准确率,但Blink对于现有的多模态LLM来说仍然具有令人惊讶的挑战性:即使是表现最好的GPT-4V和Gemini,其准确率也只有51.26%和45.72%,只有13.17%和7.63%高于随机猜测,表明在最近的多模态LLM中,这样的感知能力尚未“出现”。我们的分析还强调,专家CV模型本可以更好地解决这些问题,这表明未来改进的潜在途径。我们相信,Blink将激发社区帮助多模态LLM追上人类水平视觉感知。
https://arxiv.org/abs/2404.12390
One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However, these methods fail to fully exploit the local information embedded in the auxiliary dataset. In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment. We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process.
神经网络在实际应用中面临的一个挑战是,当数据不是原始训练分布时,这些模型会犯过度自信的错误。解决这个问题称为离散化(DIS)检测。许多最先进的离散化方法在训练期间使用辅助数据作为离散化数据的代理以提高性能。然而,这些方法无法充分利用辅助数据中固有的局部信息。在本文中,我们提出了利用损失函数梯度中嵌入的信息在训练过程中引导网络不仅学习每个样本的所需离散化分数,而且还要表现出类似的行为在每个样本的局部邻域内。我们还开发了一种基于能量的采样方法,以便在训练阶段让网络接触到更多的信息丰富的离散化样本。当辅助数据很大时,这尤其重要。通过在多个离散化基准上进行广泛的实验,我们提高了现有 state-of-the-art FPR95 4%。我们进一步通过认证鲁棒性和 Lipschitz 分析的透镜展示了我们工作的理论基础。在审查过程中之后,我们会公开发布我们的代码。
https://arxiv.org/abs/2404.12368
This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.
在这项研究中,我们引入了DeepLocalization,一种专为实时定位针对监控驾驶员行为的动作的创新框架。利用先进深度学习方法论的力量,我们的目标是解决驾驶员分心驾驶这一关键问题,这是导致道路事故的一个重要因素。我们的策略采用了一种双方法:利用基于图的变换点检测来精确定位时间中的动作,同时结合视频大型语言模型(Video-LLM)进行精确分类活动。通过仔细的提示工程,我们定制了Video-LLM,使其能够熟练处理驾驶活动的细节,即使数据稀疏,也能确保分类效果。经优化后,我们的框架轻便且适用于消费级GPU,因此在实际场景中具有广泛的应用前景。我们对该方法在SynDD2数据集上的测试进行了严格的评估,这是一个复杂的驾驶员分心驾驶行为基准数据集,它在事件分类和事件检测方面都取得了令人满意的成绩,证明了DeepLocalization在准确识别不同驾驶员行为及其时间发生情况方面具有巨大的潜力。
https://arxiv.org/abs/2404.12258
Anomaly detection and localization in images is a growing field in computer vision. In this area, a seemingly understudied problem is anomaly clustering, i.e., identifying and grouping different types of anomalies in a fully unsupervised manner. In this work, we propose a novel method for clustering anomalies in largely stationary images (textures) in a blind setting. That is, the input consists of normal and anomalous images without distinction and without labels. What contributes to the difficulty of the task is that anomalous regions are often small and may present only subtle changes in appearance, which can be easily overshadowed by the genuine variance in the texture. Moreover, each anomaly type may have a complex appearance distribution. We introduce a novel scheme for solving this task using a combination of blind anomaly localization and contrastive learning. By identifying the anomalous regions with high fidelity, we can restrict our focus to those regions of interest; then, contrastive learning is employed to increase the separability of different anomaly types and reduce the intra-class variation. Our experiments show that the proposed solution yields significantly better results compared to prior work, setting a new state of the art. Project page: this https URL.
图像中的异常检测和定位是一个在计算机视觉中正在快速增长的研究领域。在这个领域,一个似乎被低估的问题是不显著异常聚类,即在完全无监督的情况下识别和分组不同类型的异常。在这项工作中,我们提出了一种在盲环境中对大型静止图像(纹理)进行异常聚类的新方法。也就是说,输入由正常和异常图像组成,没有区分和标签。导致任务困难的是,异常区域通常较小,可能仅出现轻微的视觉变化,这很容易被纹理的真正方差所掩盖。此外,每种异常类型可能具有复杂的形态分布。我们使用盲异常局部化和对比学习相结合的新方法来解决这个任务。通过高保真度地识别异常区域,我们可以将关注点限制在感兴趣的区域内;然后,对比学习被用于增加不同异常类型之间的分离度,并减少类内差异。我们的实验结果表明,与之前的工作相比,所提出的解决方案取得了显著更好的结果,设定了一个新的科技水平。项目页面:https:// this URL。
https://arxiv.org/abs/2404.12246
The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.
随着虚假信息的威胁越来越大,需要自动化事实核查管道中的某些部分。识别需要核实的事实文本段落称为断言检测(CD),后者包括复杂的领域特定价值标准,通常被框架为一个排名任务。零和少样本LLM提示对于两种任务来说都是具有吸引力的选择,因为它绕过了需要标记数据集的需求,并允许直接使用口头断言和价值标准进行提示。我们评估了五种不同领域的CD/CW数据集上LLM的预测和校准准确性,每种数据集都使用不同的价值标准。我们研究了两个关键方面:(1)如何将事实性和价值标准精炼成提示;(2)为每个断言提供多少上下文。为此,我们尝试改变提示的措辞和提供给模型的上下文信息水平。我们的结果表明,最优的提示措辞是受领域影响的,增加上下文并不能提高性能,而自信分数可以直接用于产生可靠的检查价值排名。
https://arxiv.org/abs/2404.12174
Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.
姿态检测是自然语言处理中的一个关键任务,它通过文本分析来确定作者的观点。这项研究评估了姿态检测方法的演变,从早期的机器学习方法到突破性的BERT模型,最终到现代的大型语言模型(LLMs),如ChatGPT、LLLM-2和Mistral-7B。尽管ChatGPT的闭源性和相关成本带来了挑战,但像LLMa-2和Mistral-7B这样的开源模型仍然具有鼓舞人心的 alternative。最初,我们的研究专注于通过几个公开可用的数据集对ChatGPT、LLMa-2和Mistral-7B进行微调。随后,为了提供全面的比较,我们评估了这些模型在零散和少散学习场景下的性能。结果强调了LLMs在准确检测立场方面的非凡能力,所有测试模型都超过了现有基准。值得注意的是,LLMa-2和Mistral-7B展示了令人印象深刻的效率和立场检测潜力,尽管它们相对于ChatGPT来说较小。这项研究强调了LLMs在立场检测方面的潜力,并呼吁在這個領域进行更廣泛的研究。
https://arxiv.org/abs/2404.12171
Multimodal Large Language Models (MLLMs) have shown outstanding capabilities in many areas of multimodal reasoning. Therefore, we use the reasoning ability of Multimodal Large Language Models for environment description and scene understanding in complex transportation environments. In this paper, we propose AccidentBlip2, a multimodal large language model that can predict in real time whether an accident risk will occur. Our approach involves feature extraction based on the temporal scene of the six-view surround view graphs and temporal inference using the temporal blip framework through the vision transformer. We then input the generated temporal token into the MLLMs for inference to determine whether an accident will occur or not. Since AccidentBlip2 does not rely on any BEV images and LiDAR, the number of inference parameters and the inference cost of MLLMs can be significantly reduced, and it also does not incur a large training overhead during training. AccidentBlip2 outperforms existing solutions on the DeepAccident dataset and can also provide a reference solution for end-to-end automated driving accident prediction.
多模态大型语言模型(MLLMs)在许多多模态推理领域表现出卓越的性能。因此,我们使用多模态大型语言模型的推理能力来描述和理解复杂交通环境中的环境和场景。在本文中,我们提出了AccidentBlip2,一种多模态大型语言模型,可以在实时预测是否会发生事故。我们的方法基于六视图环绕视图图的时空场景特征提取和视觉变换中的时态突触框架进行推理。然后将生成的时态标记输入到MLLMs中进行推理,以确定是否会发生事故。由于AccidentBlip2不依赖于任何BEV图像和LiDAR,因此可以显著减少推理参数和训练成本,并且训练过程中不会产生大量开销。AccidentBlip2在DeepAccident数据集上优于现有解决方案,并且可以为端到端自动驾
https://arxiv.org/abs/2404.12149
This paper addresses the problem of detecting time series outliers, focusing on systems with repetitive behavior, such as industrial robots operating on production lines.Notable challenges arise from the fact that a task performed multiple times may exhibit different duration in each repetition and that the time series reported by the sensors are irregularly sampled because of data gaps. The anomaly detection approach presented in this paper consists of three stages.The first stage identifies the repetitive cycles in the lengthy time series and segments them into individual time series corresponding to one task cycle, while accounting for possible temporal distortions.The second stage computes a prototype for the cycles using a GPU-based barycenter algorithm, specifically tailored for very large time series.The third stage uses the prototype to detect abnormal cycles by computing an anomaly score for each cycle.The overall approach, named WarpEd Time Series ANomaly Detection (WETSAND), makes use of the Dynamic Time Warping algorithm and its variants because they are suited to the distorted nature of the time series.The experiments show that \wetsand scales to large signals, computes human-friendly prototypes, works with very little data, and outperforms some general purpose anomaly detection approaches such as autoencoders.
本文针对重复行为系统的异常检测问题进行了研究,重点关注工业机器人等重复性行为的系统。值得注意的是,由于一个任务在多次执行时可能表现出不同的持续时间,传感器报告的时间序列也可能因为数据缺口而呈不规则采样。本文提出的异常检测方法包括三个阶段。第一阶段通过识别漫长时间序列中的重复周期并将其分割为对应于一个任务周期的单个时间序列来确定重复周期。第二阶段使用基于GPU的聚类算法计算周期原型,特别针对非常大的时间序列。第三阶段使用原型计算每个周期的异常得分以检测异常周期。总体方法被称为WarpEd Time Series ANomaly Detection(WETSAND),因为它使用了动态时间膨胀算法及其变体,这些算法适用于扭曲的时间序列。实验结果表明,\wetsand适用于大信号,计算了人类友好型原型,使用很少的数据,并且比一些通用异常检测方法(如自编码器)表现出色。
https://arxiv.org/abs/2404.12134
This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.
本文介绍了一种名为“通过对抗性重训练增强雷达稳健攻击检测”的方法,旨在提高对抗性检测器对自适应攻击的鲁棒性,同时保持分类器性能。自适应攻击是指攻击者知道防御措施并相应地调整其策略的攻击。我们所提出的方法利用对抗性训练来增强检测攻击的能力,而不会牺牲准确性。在训练阶段,我们将对抗性样本集成到数据集中,这些样本经过优化以欺骗分类器和对抗性检测器,使对抗性检测器能够学习和适应潜在攻击场景。在CIFAR-10和SVHN数据集上的实验评估证明,与传统的检测方法相比,我们所提出的算法显著提高了检测器准确识别自适应攻击的能力——而不会牺牲准确性。
https://arxiv.org/abs/2404.12120
Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (this https URL).
利用深度学习从遥感图像中进行Change Detection(CD)的研究已经广泛展开。通常,它被视为一个像素级的标注任务,旨在将每个像素分类为发生改变或未发生改变。尽管在编码器-解码器结构中的每个像素分类网络已经表现出优势,但在各种场景中,它们仍然存在不精确的边界和对象不完整的外部边界。对于高分辨率的反射图像,部分或完全发生变化的对象更值得关注,而不是单个像素。因此,我们从掩膜预测和分类的角度重新审视了CD任务,并提出了MaskCD来通过自适应生成分类掩码来检测发生变化的部分。具体来说,它利用跨级变化表示器(CLCRP)来学习多尺度变化感知的表示,并利用变形多头自注意力(DeformMHSA)从编码特征中捕获语义关系。然后,开发了一个掩码注意力和自注意力的检测变压器(MA-DETR)解码器,用于准确地定位和识别发生变化的对象,基于掩码注意力和自注意机制。它通过将像素级表示解码为可学习掩码建议并做出最后预测来重构所需的变化对象。在五个基准数据集上的实验结果表明,与最先进的模型相比,所提出的方法表现出色。代码和预训练模型可在线获取(此https://)
https://arxiv.org/abs/2404.12081
Micro-expressions (MEs) are involuntary movements revealing people's hidden feelings, which has attracted numerous interests for its objectivity in emotion detection. However, despite its wide applications in various scenarios, micro-expression recognition (MER) remains a challenging problem in real life due to three reasons, including (i) data-level: lack of data and imbalanced classes, (ii) feature-level: subtle, rapid changing, and complex features of MEs, and (iii) decision-making-level: impact of individual differences. To address these issues, we propose a dual-branch meta-auxiliary learning method, called LightmanNet, for fast and robust micro-expression recognition. Specifically, LightmanNet learns general MER knowledge from limited data through a dual-branch bi-level optimization process: (i) In the first level, it obtains task-specific MER knowledge by learning in two branches, where the first branch is for learning MER features via primary MER tasks, while the other branch is for guiding the model obtain discriminative features via auxiliary tasks, i.e., image alignment between micro-expressions and macro-expressions since their resemblance in both spatial and temporal behavioral patterns. The two branches of learning jointly constrain the model of learning meaningful task-specific MER knowledge while avoiding learning noise or superficial connections between MEs and emotions that may damage its generalization ability. (ii) In the second level, LightmanNet further refines the learned task-specific knowledge, improving model generalization and efficiency. Extensive experiments on various benchmark datasets demonstrate the superior robustness and efficiency of LightmanNet.
微表情(MEs)是指不经意的运动,揭示了人们隐藏的感受,其对于情感检测的客观性吸引了众多关注。然而,尽管它在各种场景中具有广泛的应用,但在现实生活中,微表情识别(MER)仍然是一个具有挑战性的问题,由于以下三个原因: 1. 数据层面:数据不足和数据不平衡; 2. 特征层面:微表情的微妙、快速变化和复杂特征; 3. 决策层面:个体差异的影响。 为了应对这些问题,我们提出了一个双分支元辅助学习方法,称为LightmanNet,用于快速且稳健的微表情识别。具体来说,LightmanNet通过双分支生物级优化过程从有限的数据中学习通用MER知识:(i)在第一层,它通过两个分支获得任务特定的MER知识,第一个分支通过学习主要MER任务中的MER特征来获得,而另一个分支则通过引导模型通过辅助任务获得具有区分性的特征,即通过它们在空间和时间行为模式中的相似性来获得。两个分支的学习共同约束了学习有意义的任务特定MER知识的同时,避免了学习噪声或浅层连接可能会损害其泛化能力的可能性。(ii)在第二层,LightmanNet进一步优化了已学习的任务特定知识,提高了模型的泛化能力和效率。在各种基准数据集上的广泛实验证明,LightmanNet具有卓越的稳健性和效率。
https://arxiv.org/abs/2404.12024
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。
https://arxiv.org/abs/2404.12015
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at this https URL
基础模型,在大量数据上预训练,已经在各种下游任务中展示了出色的零样本能力。然而,在目标检测和实例分割这两个对大量人类标注依赖的基本计算机视觉任务中,基础模型如SAM和DINO很难实现令人满意的成绩。在这项研究中,我们揭示了对象边界就在那里,即这些基础模型无法区分单个对象的边界。对于第一个,我们观察到CLIP,它从未访问过任何实例级别的标注,在其特定中间层的聚类结果中可以提供高度有益的实例级别边界先验。接着,我们提出了Zip,它将CLIP和SAM在一种新颖的分类先于发现的数据管道中结合,实现无标注、复杂场景 capable的开放词汇目标检测和实例分割。我们的Zip显著提高了SAM在COCO数据集上的掩码AP,并建立了各种设置中的最先进性能,包括无需训练、自训练和标签效率微调。此外,无标注的Zip甚至与使用基本注释的最佳性能对象检测器相当。代码发布在https://这个URL上。
https://arxiv.org/abs/2404.11957
The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程,首先使用预训练的检测器检测物体建议,然后将它们输入到动作识别模型中,以提取视频特征并学习动作识别中的物体关系。然而,在物体检测阶段,动作先验未知,重要物体可能很容易被忽视,导致动作识别性能下降。在本文中,我们提出了一种端到端的物体中心动作识别框架,在同一阶段同时执行检测和交互推理。特别地,在提取视频特征的基础上,我们创建了三个并发物体检测和交互推理模块。首先,基于补丁的对象编码器生成视频补丁标记的提议。然后,一个交互式物体精炼和聚合模块确定动作识别中的重要物体,根据位置和外观调整提议得分,并将物体级信息汇总到全局视频表示中。最后,一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练,从而避免对预定义的物体检测器的过度依赖,并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验,以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析,我们证明了交互式物体在动作识别中的关键作用,并且在两个数据集上都能够超越最先进的 methods。
https://arxiv.org/abs/2404.11903
This paper focuses on explaining changes over time in globally-sourced, annual temporal data, with the specific objective of identifying pivotal factors that contribute to these temporal shifts. Leveraging such analytical frameworks can yield transformative impacts, including the informed refinement of public policy and the identification of key drivers affecting a country's economic evolution. We employ Local Interpretable Model-agnostic Explanations (LIME) to shed light on national happiness indices, economic freedom, and population metrics, spanning variable time frames. Acknowledging the presence of missing values, we employ three imputation approaches to generate robust multivariate time-series datasets apt for LIME's input requirements. Our methodology's efficacy is substantiated through a series of empirical evaluations involving multiple datasets. These evaluations include comparative analyses against random feature selection, correlation with real-world events as elucidated by LIME, and validation through Individual Conditional Expectation (ICE) plots, a state-of-the-art technique proficient in feature importance detection.
本论文重点解释了随着全球时间数据的来源和年度变化,这些时间变化背后的关键因素。利用这样的分析框架可以产生颠覆性的影响,包括公共政策的知觉优化和影响国家经济进化的关键驱动因素的识别。我们采用局部可解释模型(LIME)来阐明国家幸福指数、经济自由度和人口指标等跨变幅的时间框架。 承认缺失值的存在,我们采用三种插值方法生成适用于LIME输入需求的稳健多维时间序列数据。我们通过多个数据集的实证评估来证明我们方法的效力。这些评估包括与随机特征选择进行比较的分析,与LIME所阐明的现实事件的相关性,以及通过个体条件期望(ICE)图进行的验证,这是一种最先进的特征重要性检测技术。
https://arxiv.org/abs/2404.11874
Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间,学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统,这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息,可以进一步提高准确性。这对于单目相机系统尤为重要,因为它们缺乏明确的深度和速度测量。因此,开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器,并比较了它们的有效性,并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息,但我们的分析和性能比较结果表明,这些潜在表示空间表现出互补的优势。因此,我们开发了一个新颖的时间BEV编码器,TempBEV,它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间,并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明,TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明,我们的方法的整体有效性,以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。
https://arxiv.org/abs/2404.11803
LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
LiDAR数据集在自动驾驶中存在属性偏见,如点云密度、范围和物体尺寸等。因此,在不同的环境中训练和评估的对象检测网络通常会性能下降。域适应方法假设可以从测试分布访问未标注样本来解决这个问题。然而,在现实生活中,在训练过程中访问测试分布的未标注样本可能是不可能的。我们认为更现实和具有挑战性的方法是要求在未见过的目标领域中具有稳健性。为了应对这个问题,我们提出了双支柱的方法。首先,我们利用大多数自动驾驶数据集中存在的成对LiDAR图像数据来执行多模态目标检测。我们建议通过同时利用图像和LiDAR点云进行场景理解任务,使物体检测器对未见过的领域转移更加稳健。其次,我们训练了一个3D物体检测器,以学习不同分布中的多模态物体特征,并促进这些源域之间的特征不变性,以提高对未见过的目标领域的泛化能力。为此,我们提出了CLIX$^\text{3D}$,一个用于3D物体检测的多模态融合监督学习框架,它在不同分布的同一类样本之间进行对象特征的 alignment,同时将不同类别的特征推向远离。我们证明了,CLIX$^\text{3D}$在多个数据集变化下实现了最先进的领域泛化性能。
https://arxiv.org/abs/2404.11764
Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the first comprehensive framework for the automatic detection of communal violence markers in online Bangla content accompanying the largest collection (13K raw sentences) of social media interactions that fall under the definition of four major violence class and their 16 coarse expressions. Our workflow introduces a 7-step expert annotation process incorporating insights from social scientists, linguists, and psychologists. By presenting data statistics and benchmarking performance using this dataset, we have determined that, aside from the category of Non-communal violence, Religio-communal violence is particularly pervasive in Bangla text. Moreover, we have substantiated the effectiveness of fine-tuning language models in identifying violent comments by conducting preliminary benchmarking on the state-of-the-art Bangla deep learning model.
在南亚地区,许多不同的文化社区共存并共享资源,这导致了一种普遍存在的社区暴力现象。这些社会在自身群体中表现出强烈的纽带,对他人表现出敌意,导致经常升级为暴力冲突。为解决这个问题,我们开发了第一个全面在线论坛内容自动检测四大学术分类和社会媒体互动中出现的共同暴力标志的全面框架。我们的工作流程结合了社会科学家的见解、语言学家和心理学家的研究成果,采用了7步专家注释过程。通过展示数据统计和基准性能,我们确定,除非社区暴力外,孟加拉语文本中的宗教社区暴力尤为普遍。此外,我们还通过使用最先进的孟加拉语深度学习模型进行初步基准测试,证实了语言模型在识别暴力评论方面的有效性。
https://arxiv.org/abs/2404.11752
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
流行的表示学习方法鼓励在应用于输入时的变换下保持特征的不变性。然而,在像物体定位和分割这样的3D感知任务中,输出自然地对某些变换(例如旋转)具有等价性。通过使用鼓励在某些变换下保持特征等价的预训练损失函数,可以提供强大的自监督信号,同时保留变换前特征表示之间几何关系的信息。这可以提高在下游具有这种变换的任务的性能。在本文中,我们提出了一种空间和时间等价的表示学习框架,通过同时考虑空间和时间增强。我们的实验表明,最佳性能通过鼓励对平移、缩放和翻转、旋转和场景流动的等价性来实现。对于空间增强,我们发现,根据变换,无论是对比性目标还是类比目标都能获得最佳结果。为了利用真实的物体变形和运动,我们考虑了连续的激光雷达场景对,并开发了一个新的基于3D场景流的三等价目标,这使得整体性能得到提高。我们证明了我们的预训练方法在许多设置中优于现有的等价和不变方法。
https://arxiv.org/abs/2404.11737