Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Large-scale applications of Visual Place Recognition (VPR) require computationally efficient approaches. Further, a well-balanced combination of data-based and training-free approaches can decrease the required amount of training data and effort and can reduce the influence of distribution shifts between the training and application phases. This paper proposes a runtime and data-efficient hierarchical VPR pipeline that extends existing approaches and presents novel ideas. There are three main contributions: First, we propose Local Positional Graphs (LPG), a training-free and runtime-efficient approach to encode spatial context information of local image features. LPG can be combined with existing local feature detectors and descriptors and considerably improves the image-matching quality compared to existing techniques in our experiments. Second, we present Attentive Local SPED (ATLAS), an extension of our previous local features approach with an attention module that improves the feature quality while maintaining high data efficiency. The influence of the proposed modifications is evaluated in an extensive ablation study. Third, we present a hierarchical pipeline that exploits hyperdimensional computing to use the same local features as holistic HDC-descriptors for fast candidate selection and for candidate reranking. We combine all contributions in a runtime and data-efficient VPR pipeline that shows benefits over the state-of-the-art method Patch-NetVLAD on a large collection of standard place recognition datasets with 15$\%$ better performance in VPR accuracy, 54$\times$ faster feature comparison speed, and 55$\times$ less descriptor storage occupancy, making our method promising for real-world high-performance large-scale VPR in changing environments. Code will be made available with publication of this paper.
大规模应用视觉空间识别(VPR)需要计算效率高的方法。此外,基于数据和训练free方法的平衡组合可以减少所需的训练数据和精力,并减少训练和应用阶段之间分布漂移的影响。本文提出了一种运行时和数据效率高的分层VPR管道,扩展了现有方法并提出了新的思路。主要有三个贡献:首先,我们提出了局部位置图(LPG),一种无需训练且具有高计算效率的编码局部图像特征空间的方法。LPG可以与现有的局部特征检测器和描述器相结合,显著提高了图像匹配质量,与现有技术相比我们的实验结果。其次,我们介绍了自适应局部SPED(ATLAS),这是一种在局部特征方法上添加注意模块,以提高特征质量的同时保持高数据效率的扩展。ATLAS对所提出的修改影响的评估进行了广泛的消融研究。最后,我们提出了一个分层的VPR管道,利用多维计算来使用相同的局部特征作为整体高维HDC描述符进行快速候选选择和候选排序。我们将所有贡献集成到一个运行时和数据效率高的VPR管道中,该管道在标准place recognition数据集上的VPR准确率比Patch-NetVLAD高15%,特征比较速度快54倍,描述符存储占用减少55倍,使得我们的方法对于在变化环境中实现真实世界的宏观VPR具有前景。代码将在论文发表后公开提供。
https://arxiv.org/abs/2403.10283
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at this https URL.
深度伪造对社会的威胁和对信息网络安全的影响引起了公众的广泛关注,推动了在深度伪造视频检测领域加大投入。目前,基于3D CNN的方法在计算需求方面较高,尽管在性能方面已经取得了一定的成果。本文介绍了一种优雅而简单 yet effective 的策略,名为缩略图布局(TALL),该策略将视频剪辑转换为预定义的布局,以实现保留空间和时间依赖关系的目的。这个转换过程涉及在每帧相同的位置逐序遮罩帧。这些帧 then 被缩放成子帧并重新排列成预定义的布局,形成缩略图。TALL对模型具有模型无关性,具有显著的简单性,只需要很少的代码修改。此外,我们还引入了图推理单元(GRB)和语义一致性(SC)损失,以增强TALL,最终实现TALL++。GRB增强了不同语义区域之间的交互,以捕捉语义级别的不一致性线索。语义一致性损失对语义特征施加一致性约束,以提高模型的泛化能力。在内部数据集、跨数据集、扩散生成的图像检测和深度伪造生成方法识别等大量实验中,TALL++实现了与最先进方法相当或超越最先进方法的结果,证明了我们的方法在各种深度伪造检测问题上的有效性。代码可在此处下载:https://www.xxxxxx.com。
https://arxiv.org/abs/2403.10261
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
本文探讨了在开放领域中,视觉语言模型(VLMs)的持续学习(CL)问题,这些模型需要对来自不同可见和不可见领域的数据流进行持续更新和推理,并且需要学习新的类。这种能力对于各种开放环境中的应用程序(例如AI助手、自动驾驶系统和机器人)至关重要。目前,大多数CL研究都集中在单个域内的已知类闭包场景。像CLIP这样的大预训练VLM已经证明了卓越的零散拍摄识别能力,并且一些最近的研究利用这种能力来减轻CL中的灾难性遗忘,但他们集中在单个域数据集中的闭包CL。 大VLMs在开放域中的CL挑战非常大,因为数据集之间存在大的类别相关性和领域差异,以及预训练VLM在从新适应的数据集中学习新知识时会遗忘零散拍摄知识。在本文中,我们引入了一种新方法,称为CoLeCLIP,它基于CLIP学习开放域CL模型。它通过联合学习一组任务提示和跨领域类词汇表来解决这些挑战。在11个领域数据集上的广泛实验表明,CoLeCLIP在任务和类增益学习设置下都优于最先进的开放域CL方法。
https://arxiv.org/abs/2403.10245
In the wake of the global spread of monkeypox, accurate disease recognition has become crucial. This study introduces an improved SE-InceptionV3 model, embedding the SENet module and incorporating L2 regularization into the InceptionV3 framework to enhance monkeypox disease detection. Utilizing the Kaggle monkeypox dataset, which includes images of monkeypox and similar skin conditions, our model demonstrates a noteworthy accuracy of 96.71% on the test set, outperforming conventional methods and deep learning models. The SENet modules channel attention mechanism significantly elevates feature representation, while L2 regularization ensures robust generalization. Extensive experiments validate the models superiority in precision, recall, and F1 score, highlighting its effectiveness in differentiating monkeypox lesions in diverse and complex cases. The study not only provides insights into the application of advanced CNN architectures in medical diagnostics but also opens avenues for further research in model optimization and hyperparameter tuning for enhanced disease recognition. this https URL
在全球猴痘传播的背景下,准确的疾病识别变得至关重要。这项研究引入了一个改进的SE-InceptionV3模型,包括嵌入SENet模块和将L2正则化集成到InceptionV3框架中,以提高猴痘疾病检测的精度。利用Kaggle猴痘数据集,该数据集包括猴痘和其他皮肤病的图像,我们的模型在测试集上的准确率为96.71%,超越了传统方法和深度学习模型。SENet模块的通道注意力机制显著提高了特征表示,而L2正则化确保了鲁棒的泛化能力。大量的实验证实了该模型的精确度、召回率和F1分数的优越性,突出了其在不同复杂情况下区分猴痘病变的有效性。这项研究不仅为医疗诊断中高级CNN架构的应用提供了见解,而且为进一步研究模型优化和超参数调整以提高疾病识别打开了道路。您可以通过以下链接查看该研究:https://www.kaggle.com/intel-health/monkeypox-detection
https://arxiv.org/abs/2403.10087
Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.
大多数现有的基于一次单动作骨骼图的动作识别方法都关注原始的低级信息(例如关节位置),并可能由于局部信息丢失和泛化能力较低而受到限制。为了缓解这些限制,我们提出了一种利用包含大型语言模型(LLM)生成的文本描述来指导特征学习的方法,以实现全局-局部-全局的方式。特别是在训练过程中,我们设计了两 prompt 来从LLM获得每个动作的全球和局部文本描述。我们首先利用全局文本描述指导骨骼编码器关注有用的关节(即全局到局部)。然后我们建立了局部文本和关节特征之间的非局部交互,形成了最终的全局表示(即局部到全局)。为了减轻训练和推理阶段之间的不对称问题,我们还设计了一个双分支架构,使模型可以在没有任何文本输入的情况下进行新颖的分类推理,并将推理成本与基骨骼编码器相比微不足道。在三个不同的基准上进行的大量实验表明,CrossGLG在性能上始终优于现有SOTA方法,并且推理成本(模型大小)仅为前SOTA方法的$2.8\%$。CrossGLG还可以作为可插拔的模块,在推理过程中忽略不计成本地显著增强不同SOTA骨骼编码器的性能。源代码即将发布。
https://arxiv.org/abs/2403.10082
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32-bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements of neural network models for point cloud processing tasks, compared to full-precision point cloud networks. However, achieving a fully binary point cloud Transformer network, where all parts except the modules specific to the task are binary, poses challenges and bottlenecks in quantizing the activations of Q, K, V and self-attention in the attention module, as they do not adhere to simple probability distributions and can vary with input data. Furthermore, in our network, the binary attention module undergoes a degradation of the self-attention module due to the uniform distribution that occurs after the softmax operation. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules. We propose a novel binarization mechanism called dynamic-static hybridization. Specifically, our approach combines static binarization of the overall network model with fine granularity dynamic binarization of data-sensitive components. Furthermore, we make use of a novel hierarchical training scheme to obtain the optimal model and binarization parameters. These above improvements allow the proposed binarization method to outperform binarization methods applied to convolution neural networks when used in point cloud Transformer structures. To demonstrate the superiority of our algorithm, we conducted experiments on two different tasks: point cloud classification and place recognition.
本文提出了一种新颖的全二进制点云Transformer(FBPT)模型,该模型在机器人学和移动设备领域具有广泛的应用和扩展潜力。通过将32位全精度网络的权重和激活压缩至1位二进制值,所提出的二进制点云Transformer网络显著减少了用于点云处理任务的神经网络模型的存储空间和计算资源需求,与全精度点云网络相比。然而,实现一个完全二进制点云Transformer网络,其中所有模块都为二进制,在量化注意力的Q、K、V和自注意力模块方面存在挑战和瓶颈,因为它们不遵循简单的概率分布,并且可能随输入数据而变化。此外,在我们的网络中,由于软max操作后出现的均匀分布,二进制注意模块导致自注意力模块的降维。本文的主要目标是在使用二进制点云Transformer模块时解决性能下降问题。我们提出了名为动态-静态分层混合的新的二进制化机制。具体来说,我们的方法将整个网络模型的静态二进制化与数据敏感组件的精细颗粒度动态二进制化相结合。此外,我们还采用了一种新的分层训练方案来获得最优的模型和二进制参数。这些改进使得所提出的二进制化方法在应用于点云Transformer结构时能够优于应用于卷积神经网络的二进制化方法。为了证明我们算法的优越性,我们在两个不同任务上进行了实验:点云分类和地点识别。
https://arxiv.org/abs/2403.09998
Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at this https URL.
从人体姿态中理解人类动作对于与人类共享空间的辅助机器人来说至关重要,以便在下一个交互中做出明智和安全的决定。然而,精确的时间局部化和活动序列的标注是一个耗时且耗资的过程,所得的标签常常是嘈杂的。如果没有得到有效解决,标签噪声将影响模型的训练,导致识别质量下降。尽管解决这个问题很重要,但迄今为止还没有有效地解决基于骨骼的动作识别的标签噪声问题。在本研究中,我们通过在基于骨骼的人体动作识别方法中集成各种研究领域的标签去噪策略,为这一领域提供一个初步的基准。观察结果表明,这些基线在处理稀疏骨骼数据时的表现只是微不足道。因此,我们引入了一种新的方法,NoiseEraSAR,它结合了全局样本选择、协同教学和跨模态专家混合(CM-MOE)策略,旨在减轻标签噪声的负面影响。我们所提出的方法在既定基准上的表现优于其他基线,为现有的技术水平树立了新的标杆。本研究的源代码将在这个链接中公开:https://www. this URL。
https://arxiv.org/abs/2403.09975
Machine learning models have achieved significant milestones in various domains, for example, computer vision models have an exceptional result in object recognition, and in natural language processing, where Large Language Models (LLM) like GPT can start a conversation with human-like proficiency. However, abstract reasoning remains a challenge for these models, Can AI really thinking like a human? still be a question yet to be answered. Raven Progressive Matrices (RPM) is a metric designed to assess human reasoning capabilities. It presents a series of eight images as a problem set, where the participant should try to discover the underlying rules among these images and select the most appropriate image from eight possible options that best completes the sequence. This task always be used to test human reasoning abilities and IQ. Zhang et al proposed a dataset called RAVEN which can be used to test Machine Learning model abstract reasoning ability. In this paper, we purposed Vision Transformer Contrastive Network which build on previous work with the Contrastive Perceptual Inference network (CoPiNet), which set a new benchmark for permutationinvariant models Raven Progressive Matrices by incorporating contrast effects from psychology, cognition, and education, and extends this foundation by leveraging the cutting-edge Vision Transformer architecture. This integration aims to further refine the machine ability to process and reason about spatial-temporal information from pixel-level inputs and global wise features on RAVEN dataset.
机器学习模型在各个领域都取得了显著的里程碑,例如,在计算机视觉领域,物体识别模型的表现非常出色,在自然语言处理领域,大型语言模型(如GPT)可以达到与人类进行自然对话的水平。然而,这些模型在抽象推理方面仍然存在挑战,人工智能真的能够像人类一样思考吗?这个问题仍然需要回答。Raven Progressive Matrices(RPM)是一种衡量人类推理能力的指标。它通过一系列八个图像作为问题集,参与者需要尝试在这些图像中发现潜在规则,并从八个可能的选项中选择最能完成序列的图像。这个任务总是可以用来测试人类的推理能力和智商。张等人提出了一个名为RAVEN的数据集,可用于测试机器学习模型的抽象推理能力。在本文中,我们提出了Vision Transformer Contrastive Network,它基于之前的工作,与Contrastive Perceptual Inference网络(CoPiNet)相结合,为变换不变模型的Raven Progressive Matrices设置了一个新的基准,并利用心理学、认知和教育方面的对比效应,扩展了这一基础,通过利用最先进的Vision Transformer架构。这一整合旨在进一步改进机器在RAVEN数据集中的处理和推理能力。
https://arxiv.org/abs/2403.09962
Neural network quantization is an essential technique for deploying models on resource-constrained devices. However, its impact on model perceptual fields, particularly regarding class activation maps (CAMs), remains a significant area of investigation. In this study, we explore how quantization alters the spatial recognition ability of the perceptual field of vision models, shedding light on the alignment between CAMs and visual saliency maps across various architectures. Leveraging a dataset of 10,000 images from ImageNet, we rigorously evaluate six diverse foundational CNNs: VGG16, ResNet50, EfficientNet, MobileNet, SqueezeNet, and DenseNet. We uncover nuanced changes in CAMs and their alignment with human visual saliency maps through systematic quantization techniques applied to these models. Our findings reveal the varying sensitivities of different architectures to quantization and underscore its implications for real-world applications in terms of model performance and interpretability. The primary contribution of this work revolves around deepening our understanding of neural network quantization, providing insights crucial for deploying efficient and interpretable models in practical settings.
神经网络量化是一种在资源受限设备上部署模型的重要技术。然而,其在模型感知领域(特别是关于分类激活图(CAMs)的影响)仍有待进行深入研究。在这项研究中,我们探讨了量化如何改变视觉模型感知领域的空间识别能力,阐明各种架构中CAM与视觉显著性图之间的对齐关系。利用ImageNet数据集中的10,000个图像,我们严格评估了六个具有代表性的CNN:VGG16、ResNet50、EfficientNet、MobileNet、SqueezeNet和DenseNet。我们通过应用系统量化技术来研究这些模型对量化的敏感性,揭示了CAM和人类视觉显著性图之间的新颖变化。我们的研究结果表明,不同架构对量化的敏感程度不同,这对其在现实场景中的模型性能和可解释性具有重大影响。本工作的主要贡献在于加深我们对神经网络量化的理解,为实际场景中的高效和可解释模型部署提供了宝贵的洞见。
https://arxiv.org/abs/2403.09939
Human affective behavior analysis aims to delve into human expressions and behaviors to deepen our understanding of human emotions. Basic expression categories (EXPR) and Action Units (AUs) are two essential components in this analysis, which categorize emotions and break down facial movements into elemental units, respectively. Despite advancements, existing approaches in expression classification and AU detection often necessitate complex models and substantial computational resources, limiting their applicability in everyday settings. In this work, we introduce the first lightweight framework adept at efficiently tackling both expression classification and AU detection. This framework employs a frozen CLIP image encoder alongside a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and a loss landscape flattening strategy for improved generalization. Experimental results on the Aff-wild2 dataset demonstrate superior performance in comparison to the baseline while maintaining minimal computational demands, offering a practical solution for affective behavior analysis. The code is available at this https URL
人类情感行为分析旨在深入研究人类表达和行为,以加深我们对人类情感的理解。基本表达类别(EXPR)和动作单元(AUs)是这个分析的两个关键组成部分,它们分别对情绪进行分类并将面部表情分解成基本单元。尽管有进步,但现有的表达分类和AU检测方法通常需要复杂的模型和大量的计算资源,这限制了它们在日常生活环境中的应用。在这项工作中,我们引入了第一个轻量级框架,专门用于有效地解决表达分类和AU检测问题。这个框架采用了一冻结的CLIP图像编码器与可训练的多层感知器(MLP)相结合,并在条件价值风险(CVaR)和损失景观平滑策略的改善下,增强了鲁棒性。在Aff-wild2数据集上的实验结果表明,与基线相比,其性能卓越,同时保持最小的计算需求,为情感行为分析提供了一个实际可行的解决方案。代码可在此处访问:https://url
https://arxiv.org/abs/2403.09915
Generative AI (GenAI) is transforming creative workflows through the capability to synthesize and manipulate images via high-level prompts. Yet creatives are not well supported to receive recognition or reward for the use of their content in GenAI training. To this end, we propose ProMark, a causal attribution technique to attribute a synthetically generated image to its training data concepts like objects, motifs, templates, artists, or styles. The concept information is proactively embedded into the input training images using imperceptible watermarks, and the diffusion models (unconditional or conditional) are trained to retain the corresponding watermarks in generated images. We show that we can embed as many as $2^{16}$ unique watermarks into the training data, and each training image can contain more than one watermark. ProMark can maintain image quality whilst outperforming correlation-based attribution. Finally, several qualitative examples are presented, providing the confidence that the presence of the watermark conveys a causative relationship between training data and synthetic images.
生成式AI(GenAI)通过通过高级提示合成和操作图像,正在改变创意工作流程。然而,创作者对于在GenAI训练中使用他们的内容获得认可或奖励的支持并不好。为此,我们提出了ProMark,一种因果 attribution 技术,用于将合成生成的图像归因于其训练数据概念(如物体、主题、模板、艺术家或风格)。概念信息通过无感知的水印积极嵌入到输入训练图像中,而扩散模型(无条件或条件)被训练保留相应的水印在生成图像中。我们证明了我们可以将多达 2^16 个独特的水印融入训练数据中,每个训练图像可以包含多个水印。ProMark可以在不损害图像质量的情况下超越相关性基础 attribution。最后,我们提供了几个定性示例,提供了水印存在表示训练数据与合成图像之间因果关系的信心。
https://arxiv.org/abs/2403.09914
Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object's presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here this https URL.
类似于自主导航、水下三维重建和水上物体识别等任务在海洋机器人应用中至关重要。然而,由于动态干扰,例如随机水界面上的光线反射和折射、不规则的液体流动等因素,会导致感知和导航系统出现潜在故障。传统的计算机视觉算法很难区分真实和虚拟图像区域,从而大大复杂了任务。 虚拟图像区域是由光线通过反射或折射指向形成的明显表示,通常产生物体存在的错觉,而不需要其实际物理位置。本文提出了一种在真实和虚拟图像区域进行分割的新颖方法,利用了组合域无关信息、运动熵核和Epipolar Geometric Consistency。 我们的分割网络不需要在领域变化时重新训练。我们通过在两个不同的领域部署相同的分割网络来证明这一点:模拟和现实世界。通过创建类似于水表面复杂性的现实合成图像,为我们的网络(MARVIS)提供精细的训练数据,以有效地区分真实和虚拟图像。通过运动和几何感知设计选择以及全面的实验分析,我们在未见过的现实世界领域实现了最先进的实虚图像分割性能,达到IoU超过78%和F1- Score超过86%,同时确保具有较小的计算开销。MARVIS在单个GPU(CPU核心)上的推理率为43 FPS(8 FPS)。我们的代码和数据集可以在这个链接处获取:https://www.xxxxxxx.com。
https://arxiv.org/abs/2403.09850
3D hand poses are an under-explored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. To efficiently model hand-object interactions, we propose HandFormer, a novel multimodal transformer. HandFormer combines 3D hand poses at a high temporal resolution for fine-grained motion modeling with sparsely sampled RGB frames for encoding scene semantics. Observing the unique characteristics of hand poses, we temporally factorize hand modeling and represent each joint by its short-term trajectories. This factorized pose representation combined with sparse RGB samples is remarkably efficient and achieves high accuracy. Unimodal HandFormer with only hand poses outperforms existing skeleton-based methods at 5x fewer FLOPs. With RGB, we achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.
3D手势是对动作识别的一个未被充分探索的维度。手势既紧凑又富有信息量,可以为有限计算预算的应用程序带来很大好处。然而,仅凭手势本身无法完全理解动作,因为它们无法完全捕捉人类互动的对象和环境。为了有效地建模手与物体之间的交互,我们提出了HandFormer,一种新颖的多模态Transformer。HandFormer在高速时间分辨率下对3D手势进行建模,并使用稀疏采样RGB帧对场景语义进行编码。通过观察手势的独特特点,我们进行了时间分解的手建模,并分别用其短期轨迹表示每个关节。这种分解的手建模与稀疏RGB样本相结合,具有显著的高效性并实现了高准确度。仅使用手部姿势的Unimodal HandFormer在FLOPs上优于现有基于骨骼的方法5倍。与RGB结合,我们在Assembly101和H2O上实现了与之前骨架方法相当但显著改进的性能。
https://arxiv.org/abs/2403.09805
Weakly supervised surgical instrument segmentation with only instrument presence labels has been rarely explored in surgical domain. To mitigate the highly under-constrained challenges, we extend a two-stage weakly supervised segmentation paradigm with temporal attributes from two perspectives. From a temporal equivariance perspective, we propose a prototype-based temporal equivariance regulation loss to enhance pixel-wise consistency between adjacent features. From a semantic continuity perspective, we propose a class-aware temporal semantic continuity loss to constrain the semantic consistency between a global view of target frame and local non-discriminative regions of adjacent reference frame. To the best of our knowledge, WeakSurg is the first instrument-presence-only weakly supervised segmentation architecture to take temporal information into account for surgical scenarios. Extensive experiments are validated on Cholec80, an open benchmark for phase and instrument recognition. We annotate instance-wise instrument labels with fixed time-steps which are double checked by a clinician with 3-years experience. Our results show that WeakSurg compares favorably with state-of-the-art methods not only on semantic segmentation metrics but also on instance segmentation metrics.
在手术领域中,我们很少对仅凭器械存在标签的弱监督手术器械分割进行研究。为了减轻高度约束力的问题,我们在两个角度扩展了两级弱监督分割范式。从时间等价性角度来看,我们提出了一个基于原型的时间等价性调节损失,以增强相邻特征之间的像素级一致性。从语义连续性角度来看,我们提出了一个类感知的时间语义连续性损失,以约束目标帧全局视图与相邻参考帧局部非区分区域之间的语义一致性。据我们所知,WeakSurg是第一个考虑了手术场景中时间信息的仅凭器械存在标签的弱监督分割架构。我们在Cholec80(用于阶段和器械识别的开放基准)上进行了大量实验验证。我们通过具有3年临床经验的专业医生对实例逐个进行标注的时间步长。我们的结果表明,WeakSurg不仅在语义分割指标上与最先进的方法相当,而且在实例分割指标上也表现出了竞争力。
https://arxiv.org/abs/2403.09551
Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.
基于骨架的动作识别,这种方法根据骨架数据中关节的坐标及其骨架数据中的连接来对人类动作进行分类。虽然已经提出了用图卷积网络(GCNs)表示的骨架数据,但这些网络存在有限的感受野,受到关节连接的限制。为了应对这个限制,最近的研究引入了基于Transformer的方法。然而,在捕捉所有关节的所有帧之间的相关性时,需要大量的内存资源。为了减轻这个缺陷,我们提出了一个新的方法,称为骨骼-时间Transformer(SkateFormer),它根据不同类型的骨架-时间关系(Skate-Type)对关节和帧进行分 partition,并在每个 partition 内执行骨架-时间自注意(Skate-MSA)。我们将动作识别的关键骨架-时间关系分为四种不同的类型。这些类型结合了(i)基于物理邻近和远距离关节的两种骨架关系类型,和(ii)基于邻近和远距离帧的两种时间关系类型。通过这种分区特定的关注策略,我们的SkateFormer可以在以高效计算的方式选择性地关注关键关节和帧,实现动作自适应的动作识别。在各种基准数据集上的大量实验证实,我们的SkateFormer超越了最近的最先进方法。
https://arxiv.org/abs/2403.09508
Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video recognition and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video recognition, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at this https URL.
目前,在物体识别训练中,数据增强会忽略色差变化,因为这种变化不仅带来了对分类有害的 appearance 变化,而且实际实现起来效率低下。在这项研究中,我们研究了颜色方差在视频识别中的影响,并发现这种方差有益,因为在包含运动信息的视频中,静态外观并不重要。根据这个观察结果,我们提出了一个视频识别的数据增强方法,名为运动一致增强 (MCA),在视频中引入了外观变化,并隐含地鼓励模型优先考虑运动模式,而不是静态外观。具体来说,我们提出了一个操作 SwapMix 来有效地修改视频样本的外观,并引入了变化对齐 (VA) 来解决 SwapMix 引起的分布漂移,强制模型学习不变外观表示。对各种架构和不同数据集的全面实证评估证实了 MCA 的有效性和泛化能力,以及在其他增强方法中应用 VA 的有效性。代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2403.09506
With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at this https URL.
在对各种面部分析任务进行全面的调查和研究后,越来越多的研究者对发展统一的面部感知方法产生了浓厚的兴趣。现有的方法主要讨论了统一的表示和训练,缺乏任务的扩展性和应用效率。为解决这个问题,我们关注统一的模型结构,研究了一个面部通用模型。作为一种直观的设计,Naive Faceptor使具有相同输出形状和粒度的任务可以共享标准输出头的结构设计,从而实现提高任务扩展性的目标。此外,Faceptor还提出了一个设计良好的单编码器双解码器架构,允许任务特定的查询表示新兴的语义。这种设计在提高模型结构统一的同时,提高了存储开销的应用效率。此外,我们还引入了层注意力机制到Faceptor中,使模型能够动态选择最优层中的特征来执行所需任务。通过在13个面部感知数据集上进行联合训练,Faceptor在面部关键点定位、面部解析、年龄估计、表情识别、二进制属性分类和面部识别等方面取得了惊人的性能,超越了大多数专用方法。我们的训练框架也可以应用于辅助监督学习,在数据稀疏任务(如年龄估计和表情识别)中显著提高性能。代码和模型将在这个https:// URL上公开发布。
https://arxiv.org/abs/2403.09500