This paper presents the UniMER dataset to provide the first study on Mathematical Expression Recognition (MER) towards complex real-world scenarios. The UniMER dataset consists of a large-scale training set UniMER-1M offering an unprecedented scale and diversity with one million training instances and a meticulously designed test set UniMER-Test that reflects a diverse range of formula distributions prevalent in real-world scenarios. Therefore, the UniMER dataset enables the training of a robust and high-accuracy MER model and comprehensive evaluation of model performance. Moreover, we introduce the Universal Mathematical Expression Recognition Network (UniMERNet), an innovative framework designed to enhance MER in practical scenarios. UniMERNet incorporates a Length-Aware Module to process formulas of varied lengths efficiently, thereby enabling the model to handle complex mathematical expressions with greater accuracy. In addition, UniMERNet employs our UniMER-1M data and image augmentation techniques to improve the model's robustness under different noise conditions. Our extensive experiments demonstrate that UniMERNet outperforms existing MER models, setting a new benchmark in various scenarios and ensuring superior recognition quality in real-world applications. The dataset and model are available at this https URL.
本文介绍了UniMER数据集,以提供数学表达识别(MER)在复杂现实场景中的第一研究。UniMER数据集包括一个大规模训练集UniMER-1M,提供前所未有的规模和多样性,以及一个精心设计的测试集UniMER-Test,反映了现实场景中普遍存在的公式分布。因此,UniMER数据集使得训练具有稳健和高精度的MER模型,全面评估模型性能成为可能。此外,我们引入了通用数学表达识别网络(UniMERNet),一种旨在增强MER在实际场景中的框架。UniMERNet包括一个长度感知模块,以处理不同长度的公式,从而使模型能够更准确地处理复杂数学表达。此外,UniMERNet利用我们的UniMER-1M数据和图像增强技术,在不同噪声条件下提高模型的稳健性。我们广泛的实验证明,UniMERNet在各种场景中优于现有MER模型,为各种应用场景树立了新的基准,并确保在现实应用中具有卓越的识别质量。数据集和模型可通过此链接获取:https://url.cn/xyz6h
https://arxiv.org/abs/2404.15254
Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
面部识别应用程序与数据集的大小、深度学习模型的复杂性和计算能力成正比增长。然而,尽管深度学习模型不断进化变得更具弹性和计算能力在增加,但可用的数据集正在减少和移除。隐私和伦理问题在这些领域内具有相关性。通过生成人工智能,研究人员在开发完全 synthetic 数据集以供训练面部识别系统方面付出了努力。然而,最近的研究成果尚不能达到与基于真实数据的先进模型的性能相当的水平。为了研究在真实和合成数据上训练模型的性能漂移,我们利用大规模属性分类器(MAC)为四个数据集:两个真实和两个合成创建注释:从这些注释,我们研究了每个属性的所有四个数据集中的分布。此外,我们进一步研究了真实和合成数据在属性集上的差异。通过Kullback-Leibler散度比较,我们发现了真实和合成样本之间的差异。有趣的是,我们已经证实,尽管真实样本足以解释合成分布,但相反的说法并不完全正确。
https://arxiv.org/abs/2404.15234
Human decision-making often relies on visual information from multiple perspectives or views. In contrast, machine learning-based object recognition utilizes information from a single image of the object. However, the information conveyed by a single image may not be sufficient for accurate decision-making, particularly in complex recognition problems. The utilization of multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance. This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks. Specifically, we focus on deep learning-based and transformer-based techniques, as they are widely utilized and have achieved state-of-the-art performance. We provide detailed information about existing deep learning-based and transformer-based multi-view 3D object recognition models, including the most commonly used 3D datasets, camera configurations and number of views, view selection strategies, pre-trained CNN architectures, fusion strategies, and recognition performance on 3D classification and 3D retrieval tasks. Additionally, we examine various computer vision applications that use multi-view classification. Finally, we highlight key findings and future directions for developing multi-view 3D object recognition methods to provide readers with a comprehensive understanding of the field.
人类决策通常依赖于来自多个视角或视图的视觉信息。相比之下,基于机器学习的物体识别利用了一个物体的单张图像中的信息。然而,单个图像中传递的信息可能不足以实现准确的决策,尤其是在复杂识别问题中。因此,多视角 3D 表示用于物体识别已经证明为实现最先进的性能提供了最有前途的结果。 本文回顾了多视角 3D 物体识别方法在 3D 分化和检索任务中的最新进展。具体来说,我们关注基于深度学习和Transformer 的技术,因为它们得到了广泛应用并取得了最先进的成绩。我们提供了关于现有基于深度学习和Transformer 的多视角 3D 物体识别模型的详细信息,包括最常用的 3D 数据集、相机配置和视角数量、视角选择策略、预训练 CNN 架构、融合策略以及关于分类和检索任务的识别性能。此外,我们研究了各种使用多视角分类的计算机视觉应用。最后,我们重点关注了在开发多视角 3D 物体识别方法方面的一些关键发现和未来方向,以提供读者全面的了解该领域的理解。
https://arxiv.org/abs/2404.15224
Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。
https://arxiv.org/abs/2404.15161
Recording and identifying faint objects through atmospheric scattering media by an optical system are fundamentally interesting and technologically important. In this work, we introduce a comprehensive model that incorporates contributions from target characteristics, atmospheric effects, imaging system, digital processing, and visual perception to assess the ultimate perceptible limit of geometrical imaging, specifically the angular resolution at the boundary of visible distance. The model allows to reevaluate the effectiveness of conventional imaging recording, processing, and perception and to analyze the limiting factors that constrain image recognition capabilities in atmospheric media. The simulations were compared with the experimental results measured in a fog chamber and outdoor settings. The results reveal general good agreement between analysis and experimental, pointing out the way to harnessing the physical limit for optical imaging in scattering media. An immediate application of the study is the extension of the image range by an amount of 1.2 times with noise reduction via multi-frame averaging, hence greatly enhancing the capability of optical imaging in the atmosphere.
通过一个光学系统对大气散射介质中记录和识别微弱物体的过程既有趣又具有技术重要性。在这项工作中,我们介绍了一个全面的模型,该模型结合了目标特征、大气效应、成像系统、数字处理和视觉感知等因素,来评估几何成像的最终可感知极限,特别是可见距离边界处的角分辨率。该模型允许重新评估传统成像记录、处理和感知的效果,并分析限制图像识别能力的大气媒体中的限制因素。通过与雾室和户外设置的实验结果进行比较,结果显示分析结果与实验结果之间存在很好的一致性,指出了在散射介质中利用物理极限进行光学成像的方法。本研究的直接应用是在大气中通过多帧平均降噪扩展图像范围,从而极大地增强了大气中光学成像的能力。
https://arxiv.org/abs/2404.15082
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the coin by proposing a unified framework termed hierarchicaL dEcoupling And Fusing (LEAF) to coordinate expression-relevant representations and pseudo-labels for semi-supervised FER. LEAF introduces a hierarchical expression-aware aggregation strategy that operates at three levels: semantic, instance, and category. (1) At the semantic and instance levels, LEAF decouples representations into expression-agnostic and expression-relevant components, and adaptively fuses them using learnable gating weights. (2) At the category level, LEAF assigns ambiguous pseudo-labels by decoupling predictions into positive and negative parts, and employs a consistency loss to ensure agreement between two augmented views of the same image. Extensive experiments on benchmark datasets demonstrate that by unveiling and harmonizing both sides of the coin, LEAF outperforms state-of-the-art semi-supervised FER methods, effectively leveraging both labeled and unlabeled data. Moreover, the proposed expression-aware aggregation strategy can be seamlessly integrated into existing semi-supervised frameworks, leading to significant performance gains.
半监督学习在解决面部表情识别(FER)任务的标签稀缺性挑战方面表现出巨大的潜力。然而,目前最先进的方法主要集中在一边,即生成高质量伪标签,而忽略了另一边:增强表情相关的表示。在本文中,我们通过提出一个名为层次结构解耦和融合(LEAF)的统一框架,揭示了硬币的另一面。LEAF通过一个三层结构的中间表示形式:语义、实例和类别。 (1)在语义和实例层面,LEAF通过将表示解耦为表达无关和表达相关的组件,并使用可学习的中门权重适应性地融合它们。 (2)在类别层面,LEAF通过将预测解耦为正负部分,并使用一致损失确保同一图像的两个增强视图之间的一致性。在基准数据集上的大量实验证明,通过揭示和调和硬币的两面,LEAF超越了最先进的半监督FER方法,有效地利用了标签数据和未标记数据。此外,所提出的表示意识聚合策略可以轻松地融入现有的半监督框架中,导致性能的大幅提升。
https://arxiv.org/abs/2404.15041
This paper presents Discriminative Part Network (DP-Net), a deep architecture with strong interpretation capabilities, which exploits a pretrained Convolutional Neural Network (CNN) combined with a part-based recognition module. This system learns and detects parts in the images that are discriminative among categories, without the need for fine-tuning the CNN, making it more scalable than other part-based models. While part-based approaches naturally offer interpretable representations, we propose explanations at image and category levels and introduce specific constraints on the part learning process to make them more discrimative.
本文介绍了一种名为Discriminative Part Network(DP-Net)的深度架构,具有很强的解释能力,它利用了一个预训练的卷积神经网络(CNN)和一个基于部分的识别模块。通过结合预训练的CNN和基于部分的识别模块,该系统可以学习并检测图像中不同分类之间的判别性部分,而无需对CNN进行微调,使其比其他基于部分的应用更具有可扩展性。 尽管基于部分的方法可以提供可解释的表示,但我们提出了在图像和分类级别解释器和特定对部分学习过程的限制,以使其更具判别性。
https://arxiv.org/abs/2404.15037
Numerous prior studies predominantly emphasize constructing relation vectors for individual neighborhood points and generating dynamic kernels for each vector and embedding these into high-dimensional spaces to capture implicit local structures. However, we contend that such implicit high-dimensional structure modeling approch inadequately represents the local geometric structure of point clouds due to the absence of explicit structural information. Hence, we introduce X-3D, an explicit 3D structure modeling approach. X-3D functions by capturing the explicit local structural information within the input 3D space and employing it to produce dynamic kernels with shared weights for all neighborhood points within the current local region. This modeling approach introduces effective geometric prior and significantly diminishes the disparity between the local structure of the embedding space and the original input point cloud, thereby improving the extraction of local features. Experiments show that our method can be used on a variety of methods and achieves state-of-the-art performance on segmentation, classification, detection tasks with lower extra computational cost, such as \textbf{90.7\%} on ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and \textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2 for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and \textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available at \href{this https URL}{this https URL}.
许多先前的研究主要侧重于为单个聚类点构建关系向量并生成动态核,并将它们嵌入高维空间以捕捉隐含的局部结构。然而,我们认为,由于缺乏明确的结构信息,这种隐含的高维结构建模方法不足以代表点云的局部几何结构。因此,我们引入了X-3D,一种明确的3D结构建模方法。X-3D通过捕获输入3D空间中的显式局部结构信息并使用它来生成共享权重的动态核来工作。这种建模方法引入了有效的几何先验,并显著降低了嵌入空间中局部结构与原始输入点云之间的差异,从而提高了局部特征的提取。实验表明,我们的方法可以应用于各种方法,并且在分类、检测任务上具有与较低附加计算成本 state-of-the-art 性能,例如在 ScanObjectNN 上达到 90.7% 的分类精度,在 S3DIS 6 折和 S3DIS Area 5 上达到 79.2% 的检测精度,在 ScanNetV2 上达到 74.3% 的检测精度和在 ScanNetV2 上达到 64.5% 的mAP,在 SUN RGB-D 上达到 46.9% 的mAP,在 ScanNetV2 上达到 69.0% 的mAP,在 ScanNetV2 上达到 51.1% 的mAP。我们的代码可在此处下载:https://this https URL/this https URL。
https://arxiv.org/abs/2404.15010
Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.
通过Transformer架构获得的解释,以原始注意形式表示,可以看作是一个类无关的显著性图。此外,基于注意的池化作为一种对特征空间进行遮蔽的形式。为了实现这个目标,我们设计了一个基于注意的池化机制,旨在在推理过程中取代全局平均池化(GAP)。这个机制被称为跨注意流(CA-Stream),它包括一系列与不同网络深度的特征交互的跨注意块。CA-Stream提高了模型的可解释性,同时保留了识别性能。
https://arxiv.org/abs/2404.14996
Millimeter wave radar is gaining traction recently as a promising modality for enabling pervasive and privacy-preserving gesture recognition. However, the lack of rich and fine-grained radar datasets hinders progress in developing generalized deep learning models for gesture recognition across various user postures (e.g., standing, sitting), positions, and scenes. To remedy this, we resort to designing a software pipeline that exploits wealthy 2D videos to generate realistic radar data, but it needs to address the challenge of simulating diversified and fine-grained reflection properties of user gestures. To this end, we design G3R with three key components: (i) a gesture reflection point generator expands the arm's skeleton points to form human reflection points; (ii) a signal simulation model simulates the multipath reflection and attenuation of radar signals to output the human intensity map; (iii) an encoder-decoder model combines a sampling module and a fitting module to address the differences in number and distribution of points between generated and real-world radar data for generating realistic radar data. We implement and evaluate G3R using 2D videos from public data sources and self-collected real-world radar data, demonstrating its superiority over other state-of-the-art approaches for gesture recognition.
毫米波雷达最近作为实现普遍且隐私保护的手势识别的有前景的模态而受到关注。然而,缺乏丰富和细粒度的雷达数据集会阻碍开发通用的深度学习模型用于各种用户姿势(例如,站立,坐着)和场景的手势识别。为了解决这个问题,我们采用了设计一个利用丰富 2D 视频生成逼真雷达数据的软件流水线,但需要解决模拟用户手势多种反射特性的挑战。为此,我们设计了 G3R,包括三个关键组件:(i)一个手势反射点生成器将手臂骨点扩展为人体反射点;(ii)一个信号模拟模型模拟多径反射和衰减雷达信号以输出人体强度图;(iii)一个编码器-解码器模型结合采样模块和拟合模块来解决生成和现实世界雷达数据中点之间的数量和分布差异,以生成逼真的雷达数据。我们使用来自公共数据源的 2D 视频和自收集的实世界雷达数据来实施和评估 G3R,证明了其在手势识别方面的优越性。
https://arxiv.org/abs/2404.14934
Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
驾驶员活动分类对于确保道路安全具有至关重要的作用,应用范围从驾驶员辅助系统到自动驾驶车辆控制转换。在本文中,我们提出了一种利用可扩展性来自视觉-语言模型的驾驶员活动分类新方法。我们的方法采用了一个预训练的视觉-语言编码器来处理来自多个视角的同步视频帧。每个帧都使用预训练的视觉-语言编码器进行编码,然后将得到的嵌入进行融合以生成分类概率预测。通过利用对比学习得到的视觉-语言表示,我们的方法在多样驾驶员活动中取得了稳健的性能。我们在自然驾驶行动识别数据集上评估我们的方法,证明了在许多类别中具有强大的准确性。我们的结果表明,视觉-语言表示为驾驶员监测系统提供了有前途的途径,通过自然语言描述符实现准确性和可解释性。
https://arxiv.org/abs/2404.14906
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
作为计算机视觉中的一个基本视频任务,Open-Vocabulary Action Recognition (OVAR) 最近受到了越来越多的关注,随着视觉语言预训练的发展。为了实现任意类别的泛化,现有方法将类标签视为文本描述,然后将 OVAR 表示为评估视觉样本与文本类之间的嵌入相似度。然而,一个关键问题被完全忽视了:用户提供的类描述可能会有噪音,例如拼写和错别字,这限制了原典 OVAR 的现实应用。为了填补研究空白,本文通过模拟各种类型的多级噪音来评估现有方法,并揭示了它们的脆弱性。为了应对噪音 OVAR 任务,我们进一步提出了一个新颖的 DENOISER 框架,包括生成和分类两个部分。具体来说,生成部分通过一个解码过程消除噪音类-文本名称,即提出文本候选者,然后利用跨模态和内模态信息进行投票,获得最佳。在分类部分,我们使用原典 OVAR 模型将视觉样本分配到类文本名称,从而获得更多的语义信息。为了优化,我们交替迭代生成和分类部分进行逐步改进。消除的文本类有助于 OVAR 模型更准确地分类视觉样本;与此同时,分类的视觉样本有助于更好地消除噪音。在三个数据集上,我们进行了广泛的实验,以展示我们卓越的鲁棒性,并对每个组件的深入剖析进行了充分研究。
https://arxiv.org/abs/2404.14890
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.
优秀的平面设计通常在两个层面上实现和谐,从微观的设计元素(颜色、字体和布局)到整体设计。这种复杂性使得理解平面设计具有挑战性,因为它需要具备同时识别设计元素和理解设计的能力。随着多模态大型语言模型(MLLMs)的快速发展,我们建立了DesignProbe,作为研究MLLMs在设计能力方面的基准。我们的基准包括8个任务,跨越微观设计元素和整体设计级别。在设计元素级别,我们考虑了属性的识别和语义理解任务。在整体设计级别,我们包括了风格和隐喻。我们对9个MLLM进行了测试,并使用了GPT-4作为评估器。此外,进一步的实验表明,优化提示可以提高MLLMs的性能。我们首先通过不同的LLM重新撰写了提示,发现那些通过自己LLMs进行自定义的性能明显提高。然后我们以两种方式添加额外任务知识(文本描述和图像示例),发现添加图像极大地提高了文本的性能。
https://arxiv.org/abs/2404.14801
The wide deployment of Face Recognition (FR) systems poses risks of privacy leakage. One countermeasure to address this issue is adversarial attacks, which deceive malicious FR searches but simultaneously interfere the normal identity verification of trusted authorizers. In this paper, we propose the first Double Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG employs a one-time watermark embedding to deceive unauthorized FR models and allows authorizers to perform identity verification by extracting the watermark. Specifically, we propose an information-guided adversarial attack against FR models. The encoder embeds an identity-specific watermark into the deep feature space of the carrier, guiding recognizable features of the image to deviate from the source identity. We further adopt a collaborative meta-optimization strategy compatible with sub-tasks, which regularizes the joint optimization direction of the encoder and decoder. This strategy enhances the representation of universal carrier features, mitigating multi-objective optimization conflicts in watermarking. Experiments confirm that DPG achieves significant attack success rates and traceability accuracy on state-of-the-art FR models, exhibiting remarkable robustness that outperforms the existing privacy protection methods using adversarial attacks and deep watermarking, or simple combinations of the two. Our work potentially opens up new insights into proactive protection for FR privacy.
广泛部署人脸识别(FR)系统会带来隐私泄露的风险。解决这个问题的一个对策是对抗性攻击,这种攻击会欺骗恶意的人脸识别,但同时会干扰可信授权者的正常身份验证。在本文中,我们提出了基于可追踪的对抗性水印的第一个双隐私保护(DPG)方案。DPG采用一次性水印嵌入来欺骗未经授权的人脸识别模型,并允许授权者通过提取水印来验证身份。具体来说,我们针对FR模型提出了信息指导的对抗性攻击。编码器将身份特定的水印嵌入到载体的深度特征空间中,引导图像的可识别特征远离源身份。我们进一步采用了一种可互补的元优化策略,该策略与子任务兼容,规范了编码器和解码器的联合优化方向。这种策略提高了普遍载荷特征的代表性,减轻了水印标记中的多目标优化冲突。实验证实,DPG在最先进的FR模型上实现了显著的攻击成功率和可追溯准确性,表现出出色的稳健性,超过使用对抗攻击和深度水印的现有隐私保护方法,或者使用简单的水印和编码器组合。我们的工作可能会为FR隐私的主动保护提供新的见解。
https://arxiv.org/abs/2404.14693
With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.
随着戴口罩成为一种新的文化规范,同时考虑戴口罩的情况下的面部表情识别(FER)已成为一个重要的挑战。在本文中,我们提出了一种统一的跨分支视觉Transformer,用于面部表情识别和戴口罩分类任务。我们的方法通过采用双分支架构提取共享特征,获得多尺度特征表示。此外,我们还提出了一种跨任务融合阶段,处理每个任务使用单独的分支,同时通过交叉注意模块交换信息。与使用单独网络处理两个任务相比,我们提出的框架通过简单的但有效的跨任务融合阶段减少了整体复杂性。 丰富的实验结果表明,与不同 state-of-the-art 方法相比,我们提出的模型在面部表情识别和戴口罩分类任务上都表现更好,或者处于同一水平。
https://arxiv.org/abs/2404.14606
Face recognition technology has become an integral part of modern security systems and user authentication processes. However, these systems are vulnerable to spoofing attacks and can easily be circumvented. Most prior research in face anti-spoofing (FAS) approaches it as a two-class classification task where models are trained on real samples and known spoof attacks and tested for detection performance on unknown spoof attacks. However, in practice, FAS should be treated as a one-class classification task where, while training, one cannot assume any knowledge regarding the spoof samples a priori. In this paper, we reformulate the face anti-spoofing task from a one-class perspective and propose a novel hyperbolic one-class classification framework. To train our network, we use a pseudo-negative class sampled from the Gaussian distribution with a weighted running mean and propose two novel loss functions: (1) Hyp-PC: Hyperbolic Pairwise Confusion loss, and (2) Hyp-CE: Hyperbolic Cross Entropy loss, which operate in the hyperbolic space. Additionally, we employ Euclidean feature clipping and gradient clipping to stabilize the training in the hyperbolic space. To the best of our knowledge, this is the first work extending hyperbolic embeddings for face anti-spoofing in a one-class manner. With extensive experiments on five benchmark datasets: Rose-Youtu, MSU-MFSD, CASIA-MFSD, Idiap Replay-Attack, and OULU-NPU, we demonstrate that our method significantly outperforms the state-of-the-art, achieving better spoof detection performance.
面部识别技术已成为现代安全系统和用户身份验证过程的重要组成部分。然而,这些系统容易受到伪造攻击的攻击,而且可以轻松被绕过。在先前的面部抗伪造(FAS)研究中,大多数将FAS视为二分类分类任务,其中模型在真实样本和已知伪造攻击上进行训练,并在未知伪造攻击上测试检测性能。然而,在实践中,FAS应该被视为一个一分类分类任务,在训练过程中,不能假设有任何关于伪造样本的知识。在本文中,我们将从一分类的角度重新定义面部抗伪造任务,并提出了一个新的超几何一分类框架。为了训练我们的网络,我们使用从高斯分布伪负样本,带权运行平均的伪负类,并提出两个新的损失函数:(1) Hyp-PC:超几何对偶混淆损失,和(2) Hyp-CE:超几何交叉熵损失,它们在超几何空间中操作。此外,我们还使用欧氏特征截断和梯度截断来稳定超几何空间中的训练。据我们所知,这是第一个在一类方式上扩展超几何嵌入用于面部抗伪造的工作。在五个基准数据集:罗切斯特县,密苏里大学MSU-MFSD,卡西娅大学CASIA-MFSD,碘亚帕回复攻击和OULU-NPU的广泛实验中,我们证明了我们的方法在性能上明显优于现有技术水平,实现了更好的伪造检测性能。
https://arxiv.org/abs/2404.14406
To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
截至目前,大多数在深度视觉模型中实现人类可解释计算的网络子组件的发现都涉及对单个单元的深入研究和对大量人类劳动的充分利用。我们研究可扩展的方法来提取视觉模型计算图的子图,该子图在识别特定视觉概念时起作用。我们引入了一种新的方法来识别这些子图:使用几个示例来指定一个视觉概念,然后跟踪层之间神经元激活之间的相互依赖,或者它们的功能性连接。我们发现,我们的方法提取了因果影响模型输出并能够防御对抗攻击的电路。编辑这些电路可以保护预训练的大型模型免受攻击。
https://arxiv.org/abs/2404.14349
Heterogeneous Face Recognition (HFR) aims to expand the applicability of Face Recognition (FR) systems to challenging scenarios, enabling the matching of face images across different domains, such as matching thermal images to visible spectra. However, the development of HFR systems is challenging because of the significant domain gap between modalities and the lack of availability of large-scale paired multi-channel data. In this work, we leverage a pretrained face recognition model as a teacher network to learn domaininvariant network layers called Domain-Invariant Units (DIU) to reduce the domain gap. The proposed DIU can be trained effectively even with a limited amount of paired training data, in a contrastive distillation framework. This proposed approach has the potential to enhance pretrained models, making them more adaptable to a wider range of variations in data. We extensively evaluate our approach on multiple challenging benchmarks, demonstrating superior performance compared to state-of-the-art methods.
异质面部识别(HFR)旨在将面部识别(FR)系统的应用扩展到具有挑战性的场景中,实现不同领域间面部图像的匹配,例如将热成像与可见光谱进行匹配。然而,由于模态之间的显著差异和大规模多通道数据缺乏,HFR系统的发展具有挑战性。在这项工作中,我们利用预训练的人脸识别模型作为教师网络,学习领域无关网络层,称为领域无关单元(DIU),以减少领域差距。所提出的DIU可以在训练过程中有效处理有限量的成对训练数据,并通过对比性蒸馏框架实现有效训练。这种方法具有提高预训练模型的潜力,使它们对数据中的更广泛的变异性具有更强的适应性。我们对我们的方法在多个具有挑战性的基准进行了广泛评估,证明了其在最先进的methods之上的卓越性能。
https://arxiv.org/abs/2404.14343
Heterogeneous Face Recognition (HFR) focuses on matching faces from different domains, for instance, thermal to visible images, making Face Recognition (FR) systems more versatile for challenging scenarios. However, the domain gap between these domains and the limited large-scale datasets in the target HFR modalities make it challenging to develop robust HFR models from scratch. In our work, we view different modalities as distinct styles and propose a method to modulate feature maps of the target modality to address the domain gap. We present a new Conditional Adaptive Instance Modulation (CAIM ) module that seamlessly fits into existing FR networks, turning them into HFR-ready systems. The CAIM block modulates intermediate feature maps, efficiently adapting to the style of the source modality and bridging the domain gap. Our method enables end-to-end training using a small set of paired samples. We extensively evaluate the proposed approach on various challenging HFR benchmarks, showing that it outperforms state-of-the-art methods. The source code and protocols for reproducing the findings will be made publicly available
异质面部识别(HFR)关注于不同领域(例如,热图像到可见图像)的匹配,使面部识别(FR)系统在具有挑战性的场景更具多样性。然而,这些领域与目标HFR模态之间存在的领域差距以及目标HFR模态中有限的大规模数据集使从零开始开发鲁棒HFR模型具有挑战性。在我们的工作中,我们将不同模块视为不同的样式,并提出了一个方法来调节目标模态的特征图以解决领域差距。我们提出了一个名为条件自适应实例调制(CAIM)的模块,它与现有的FR网络无缝集成,使它们成为HFR ready的系统。CAIM模块调节中间特征图,有效地适应源模态的风格,并弥合领域差距。我们的方法使用一对配对样本进行端到端训练。我们在各种具有挑战性的HFR基准中广泛评估所提出的方案,结果表明,它超过了最先进的方法。复制这些发现的源代码和协议将公开发布。
https://arxiv.org/abs/2404.14247