This paper introduces a new dispersed Haar-like filter for efficiently detection face. The basic idea for finding the filter is maximising between-class and minimising within-class variance. The proposed filters can be considered as an optimal configuration dispersed Haar-like filters; filters with disjoint black and white parts.
本文提出了一种新的分布式哈勃类似滤波器,用于有效地检测人脸。找到滤波器的基本思路是最大化跨类别方差,最小化内部方差。所提出的滤波器可以被视为最优配置的分布式哈勃类似滤波器;具有分离的黑和白部分的滤波器。
https://arxiv.org/abs/2404.10476
In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii) To address the disregard for negative samples and the inadequacy of their generation in existing methods, we incorporate physics-inspired contrastive learning for LLE and design the Bag of Curves (BoC) method to generate more reasonable negative samples that closely adhere to the underlying physical imaging principle. (iii) To overcome the reliance on semantic ground truths in existing methods, we propose an unsupervised regional segmentation module, ensuring regional brightness consistency while eliminating the dependency on semantic ground truths. Overall, the proposed PIE can effectively learn from unpaired positive/negative samples and smoothly realize non-semantic regional enhancement, which is clearly different from existing LLE efforts. Besides the novel architecture of PIE, we explore the gain of PIE on downstream tasks such as semantic segmentation and face detection. Training on readily available open data and extensive experiments demonstrate that our method surpasses the state-of-the-art LLE models over six independent cross-scenes datasets. PIE runs fast with reasonable GFLOPs in test time, making it easy to use on mobile devices.
在本文中,我们提出了一个基于物理学习的对比学习范式,称为PIE。PIE主要解决了以下三个问题:(一)为了解决现有学习方法通常在严格像素对应图像对上训练LLE模型的問題,我们消除了需要像素对应的一对训练数据,而是使用未配对的图像进行训练。 (二)为了解决现有方法忽视负样本以及它们的生成不夠合理的问题,我们引入了基于物理的对比学习LLE,并设计了Bag of Curves(BoC)方法来生成更合理的负样本,使其更贴近底层物理成像原理。 (三)为了克服现有方法在现有方法中依赖语义真实值的问题,我们提出了一个无监督的区域分割模块,在确保区域亮度一致性的同时消除对语义真实值的依赖。 总体而言,与现有的LLE方法相比,所提出的PIE具有明显的优势。除了PIE的新架构外,我们还研究了PIE在下游任务(如语义分割和面部检测)上的性能提升。在易于获取的开源数据上进行训练,并进行了广泛的实验,结果表明,我们的方法在六个独立场景数据集上的性能超越了最先进的LLE模型。PIE在测试时间具有合理的GFLOPs,使其在移动设备上使用方便。
https://arxiv.org/abs/2404.04586
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.
我们提出了VLOGGER方法,一种从单个输入图像的人类视频生成方法,该方法在最近的成功生成扩散模型的基础上进行了改进。VLOGGER方法由两部分组成:1)一个随机的人类到3D运动扩散模型;2)一个新型的扩散基于架构,它通过空间和时间控制来增强文本到图像模型。这支持生成高质量的视频,具有可变长度,并且可以通过高级人脸和身体表示来轻松控制。与之前的工作相比,我们的方法不需要为每个人进行训练,不依赖于人脸检测和裁剪,可以生成完整的图像(不仅是脸或嘴唇),并考虑了广泛的场景(例如可见的躯干或多样主体身份),这些场景对于正确合成交流中的人是至关重要的。我们还策划了MENTOR,一个具有3D姿势和表情注释的新颖而多样的大数据集,比之前的大一倍(800,000个身份),并支持动态手势,我们在其中训练和消融我们的主要技术贡献。VLOGGER在三个公共基准测试中的表现优于最先进的方法,同时考虑了图像质量、身份保留和时间一致性。我们还展示了VLOGGER在视频编辑和个性化方面的应用。
https://arxiv.org/abs/2403.08764
This technical report presents a diffusion model based framework for face swapping between two portrait images. The basic framework consists of three components, i.e., IP-Adapter, ControlNet, and Stable Diffusion's inpainting pipeline, for face feature encoding, multi-conditional generation, and face inpainting respectively. Besides, I introduce facial guidance optimization and CodeFormer based blending to further improve the generation quality. Specifically, we engage a recent light-weighted customization method (i.e., DreamBooth-LoRA), to guarantee the identity consistency by 1) using a rare identifier "sks" to represent the source identity, and 2) injecting the image features of source portrait into each cross-attention layer like the text features. Then I resort to the strong inpainting ability of Stable Diffusion, and utilize canny image and face detection annotation of the target portrait as the conditions, to guide ContorlNet's generation and align source portrait with the target portrait. To further correct face alignment, we add the facial guidance loss to optimize the text embedding during the sample generation.
此技术报告提出了一种基于扩散模型的图像交换框架,用于两个肖像图像之间的换脸。基本框架包括三个组件:IP适配器、控制网络和稳定扩散的修复管道,用于分别进行面部特征编码、多条件生成和换脸修复。此外,我还引入了面部指导优化和基于CodeFormer的混合技术,进一步提高了生成质量。具体来说,我们采用了一种轻量级的自定义化方法(即DreamBooth-LoRA),通过使用稀有标识符“sks”来表示原始身份,并在每个交叉注意层中注入原始肖像的图像特征,从而保证身份一致性。然后我依赖于Stable Diffusion的强修复能力,并利用目标肖像的清晰图像和面部检测标注作为条件,引导控制网络的生成并使原始肖像与目标肖像对齐。为了进一步校正面部对齐,我们在样本生成过程中添加了面部指导损失,用于优化文本嵌入。
https://arxiv.org/abs/2403.01108
Movement disorders are typically diagnosed by consensus-based expert evaluation of clinically acquired patient videos. However, such broad sharing of patient videos poses risks to patient privacy. Face blurring can be used to de-identify videos, but this process is often manual and time-consuming. Available automated face blurring techniques are subject to either excessive, inconsistent, or insufficient facial blurring - all of which can be disastrous for video assessment and patient privacy. Furthermore, assessing movement disorders in these videos is often subjective. The extraction of quantifiable kinematic features can help inform movement disorder assessment in these videos, but existing methods to do this are prone to errors if using pre-blurred videos. We have developed an open-source software called SecurePose that can both achieve reliable face blurring and automated kinematic extraction in patient videos recorded in a clinic setting using an iPad. SecurePose, extracts kinematics using a pose estimation method (OpenPose), tracks and uniquely identifies all individuals in the video, identifies the patient, and performs face blurring. The software was validated on gait videos recorded in outpatient clinic visits of 116 children with cerebral palsy. The validation involved assessing intermediate steps of kinematics extraction and face blurring with manual blurring (ground truth). Moreover, when SecurePose was compared with six selected existing methods, it outperformed other methods in automated face detection and achieved ceiling accuracy in 91.08% less time than a robust manual face blurring method. Furthermore, ten experienced researchers found SecurePose easy to learn and use, as evidenced by the System Usability Scale. The results of this work validated the performance and usability of SecurePose on clinically recorded gait videos for face blurring and kinematics extraction.
通常,运动障碍的诊断是通过基于共识的专家对临床获得的病人视频进行评估得出的。然而,如此广泛的共享病人视频会对患者隐私造成风险。可以采用面部模糊来删除视频中的身份信息,但这种过程通常是手动且耗时费力的。已有的自动面部模糊技术要么过度模糊,要么缺乏足够的模糊,这些都可能对视频评估和患者隐私造成灾难性的影响。此外,评估这些视频中的运动障碍通常是主观的。计算运动特征以帮助告知这些视频中的运动障碍评估是一种可行的方法,但现有的方法在使用预模糊视频时容易出错。我们开发了一个名为SecurePose的免费开源软件,可以在患者视频中使用iPad进行 clinic 环境下记录的可靠面部模糊和自动运动特征提取。SecurePose使用姿态估计方法(OpenPose)提取运动学,跟踪并唯一标识视频中所有的人,识别患者,并执行面部模糊。该软件在116名脊髓性截瘫儿童的外科诊所访问中记录的步态视频上进行了验证。验证包括使用手动模糊的中间步骤评估运动学提取和面部模糊(真实值)。此外,当SecurePose与其他六个选择的方法进行比较时,它在自动面部检测方面超过了其他方法,用时减少了91.08%。此外,十名有经验的研究人员发现SecurePose易于学习和使用,正如System Usability Scale所证明的。本工作的结果证实了SecurePose在临床上记录的步态视频中的面部模糊和运动学提取的性能和可用性。
https://arxiv.org/abs/2402.14143
The majority of computer vision applications that handle images featuring humans use face detection as a core component. Face detection still has issues, despite much research on the topic. Face detection's accuracy and speed might yet be increased. This review paper shows the progress made in this area as well as the substantial issues that still need to be tackled. The paper provides research directions that can be taken up as research projects in the field of face detection.
绝大多数处理图像的人脸识别应用都使用人脸检测作为核心组件。尽管在人脸识别领域已经进行了大量的研究,但人脸识别仍然存在一些问题。人脸识别的准确性和速度可能还有提高的空间。本文回顾论文展示了该领域所取得的研究进展以及仍需要解决的严重问题。论文提供了一些可以作为人脸识别领域研究项目的方向。
https://arxiv.org/abs/2402.03796
Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: this https URL.
检测玻璃区域是一个具有挑战性的任务,因为其透明度和反射特性具有不确定性。这些透明的玻璃分享传感和物体所具有的视觉外观,因此没有固定的模式。 训练大量数据的最近视觉基础模型在图像感知和图像生成方面表现出惊人的性能。为了更准确地分割玻璃表面,我们充分利用两个视觉基础模型:Segment Anything(SAM)和Stable Diffusion。具体来说,我们设计了一个简单的玻璃表面分割器GEM,它仅包含一个SAM骨架、一个简单的特征金字塔、一个精明的查询选择模块和一个掩码解码器。精明的查询选择可以动态地识别玻璃表面特征,并将它们作为初始化查询传递给掩码解码器。我们还通过扩散模型提出了一个合成但更真实的大规模玻璃表面检测数据集,其包含原始数据大小的1x、5x、10x和20x。这个数据集是一个可迁移学习的可行来源。合成数据的规模对迁移学习有积极影响,而随着数据量的增加,提高将逐渐趋于饱和。大量实验证明,GEM在GSD-S验证集上达到了最先进水平(IoU +2.1%)。代码和数据集可在此处下载:https://this URL。
https://arxiv.org/abs/2401.15282
The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{this https URL}
本文的目标是实现自动识别字符的标题生成。给定一个视频和少量的元数据,我们提出了一种音频-视觉方法,生成对话的完整文本,带有精确的语音时间戳和说话人身份确定的字符。关键想法是首先使用音频-视觉线索选择每个角色的精确音频示例集,然后使用这些示例来对所有语音段进行说话人身份的分类。值得注意的是,该方法不需要面部检测或跟踪。我们在《辛普森一家》、《费城永远阳光下》和《 Scrubs》等众多的情景喜剧上评估了该方法。我们想象这个系统将有助于自动生成字幕,从而提高现代流媒体服务上丰富视频的可用性。项目页面:\url{这个链接}
https://arxiv.org/abs/2401.12039
The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.
任何组织对车辆入口和停车的自动管理是一个复杂而广泛的挑战,涉及记录、效率和安全问题。手动跟踪车辆和查找停车位的方法缓慢而且浪费时间。为解决车辆入口和停车的自动管理问题,我们利用了最先进的人工智能技术,将车辆入口和停车的过程自动化到任何组织中。为了确保安全性,我们的系统集成了车辆检测、车牌识别和人脸识别模型,以确保人和车辆与组织注册。我们已经为车辆检测、车牌识别、人脸识别和识别训练了多个深度学习模型,然而,YOLOv8n模型在其他模型中表现出色。此外,通过整合这些技术,系统提供了高效的车辆检测、精确的身份识别、简洁的记录和优化的停车位分配,从而提高了便利性、准确性和安全性。未来研究机会在于对各种现实应用进行系统性能的微调。
https://arxiv.org/abs/2312.02699
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven't been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.
面部识别器已成为许多应用程序的关键组件,包括监控,而这些应用程序通常需要在具有有限处理能力和内存的边缘设备上运行。因此,对于在资源受限设备上运行的紧凑型面部识别器模型,迫切需要开发出高效运作的模型。在最近几年里,网络剪枝技术已经引起了研究人员的高度关注。尽管这些方法在面部识别器方面越来越受欢迎,但在面部识别器背景下,这些方法并没有得到很好的研究。在本文中,我们在两个已经小型且紧凑的面部识别器EXTD(非常小巧的面部识别器)和EResFD(高效的ResNet面部识别器)上实现了基于几何中值滤波(FPGM)的滤波器剪枝。我们所采用的主要剪枝算法是结合Filter Pruning through Geometric Median(FPGM)和Soft Filter Pruning(SFP)迭代过程。我们还应用了L1范数剪枝作为基准,与所提出的剪枝方法进行比较。在WIDER FACE数据集上的实验评估结果表明,与所提出的剪枝方法相比,具有有限的准确度损失,甚至在高剪枝率下具有小的准确度增加,原剪枝方法具有巨大潜力。
https://arxiv.org/abs/2311.16613
Human eye gaze estimation is an important cognitive ingredient for successful human-robot interaction, enabling the robot to read and predict human behavior. We approach this problem using artificial neural networks and build a modular system estimating gaze from separately cropped eyes, taking advantage of existing well-functioning components for face detection (RetinaFace) and head pose estimation (6DRepNet). Our proposed method does not require any special hardware or infrared filters but uses a standard notebook-builtin RGB camera, as often approached with appearance-based methods. Using the MetaHuman tool, we also generated a large synthetic dataset of more than 57,000 human faces and made it publicly available. The inclusion of this dataset (with eye gaze and head pose information) on top of the standard Columbia Gaze dataset into training the model led to better accuracy with a mean average error below two degrees in eye pitch and yaw directions, which compares favourably to related methods. We also verified the feasibility of our model by its preliminary testing in real-world setting using the builtin 4K camera in NICO semi-humanoid robot's eye.
人类眼睛注视估计是成功的人机交互的重要认知成分,使机器人能够阅读和预测人类行为。我们通过人工神经网络解决这个问题,并建立了一个模块系统,从分别裁剪的眼睛中估计注视,利用现有的面部检测(RetinaFace)和头姿态估计(6DRepNet)的成熟组件。我们提出的方法不需要特殊的硬件或红外滤镜,而是利用了一个标准的笔记本内置的RGB相机,通常与基于外观的方法相同。使用元人类工具,我们还生成了超过57,000个合成面部数据集,并将其公开发布。在将这个数据集(带有眼部和头姿态信息)放在标准的哥伦比亚 gaze数据集中训练模型后,我们在眼俯仰和眼偏转方向上的平均误差低于两度,与相关方法相比具有优势。我们还通过使用NICO半人形机器人预先测试模型中的内置4K相机来验证我们模型的可行性。
https://arxiv.org/abs/2311.14175
In response to the global COVID-19 pandemic, there has been a critical demand for protective measures, with face masks emerging as a primary safeguard. The approach involves a two-fold strategy: first, recognizing the presence of a face by detecting faces, and second, identifying masks on those faces. This project utilizes deep learning to create a model that can detect face masks in real-time streaming video as well as images. Face detection, a facet of object detection, finds applications in diverse fields such as security, biometrics, and law enforcement. Various detector systems worldwide have been developed and implemented, with convolutional neural networks chosen for their superior performance accuracy and speed in object detection. Experimental results attest to the model's excellent accuracy on test data. The primary focus of this research is to enhance security, particularly in sensitive areas. The research paper proposes a rapid image pre-processing method with masks centred on faces. Employing feature extraction and Convolutional Neural Network, the system classifies and detects individuals wearing masks. The research unfolds in three stages: image pre-processing, image cropping, and image classification, collectively contributing to the identification of masked faces. Continuous surveillance through webcams or CCTV cameras ensures constant monitoring, triggering a security alert if a person is detected without a mask.
为了应对全球新冠疫情,人们普遍要求采取保护措施,口罩已成为最主要的保护方式。该方法采用双重策略:首先,通过检测人脸来识别存在的人脸,然后在对人脸进行识别时,确定口罩。本项目利用深度学习创建了一个可以实时检测戴口罩情况的模型。人脸检测,作为物体检测的一个方面,在安全、生物识别和执法等领域有广泛应用。世界各地已经开发并实施了许多检测系统,而卷积神经网络因其卓越的检测性能和速度而备受选择。实验结果证实了模型在测试数据上的卓越准确性。 本项目的研究重点是提高安全性,特别是敏感区域的安全性。研究论文提出了一种快速预处理图像的方法,其中口罩居中。利用特征提取和卷积神经网络,系统对佩戴口罩的人进行分类和检测。研究过程包括图像预处理、图像裁剪和图像分类,共同致力于识别戴口罩的脸孔。通过摄像头或闭路电视的持续监控,确保持续监控,如果一个人在没有戴口罩的情况下被检测到,则会触发安全警报。
https://arxiv.org/abs/2311.10408
Two difficulties here make low-light image enhancement a challenging task; firstly, it needs to consider not only luminance restoration but also image contrast, image denoising and color distortion issues simultaneously. Second, the effectiveness of existing low-light enhancement methods depends on paired or unpaired training data with poor generalization performance. To solve these difficult problems, we propose in this paper a new learning-based Retinex decomposition of zero-shot low-light enhancement method, called ZERRINNet. To this end, we first designed the N-Net network, together with the noise loss term, to be used for denoising the original low-light image by estimating the noise of the low-light image. Moreover, RI-Net is used to estimate the reflection component and illumination component, and in order to solve the color distortion and contrast, we use the texture loss term and segmented smoothing loss to constrain the reflection component and illumination component. Finally, our method is a zero-reference enhancement method that is not affected by the training data of paired and unpaired datasets, so our generalization performance is greatly improved, and in the paper, we have effectively validated it with a homemade real-life low-light dataset and additionally with advanced vision tasks, such as face detection, target recognition, and instance segmentation. We conducted comparative experiments on a large number of public datasets and the results show that the performance of our method is competitive compared to the current state-of-the-art methods. The code is available at:this https URL
本文提出了一种新的基于学习的零散低光增强方法,称为ZERRINNet。为了解决这些问题,我们在论文中提出了一种新的基于学习的Retinex分解零散低光增强方法。首先,我们设计了一个N-Net网络,包括噪声损失项,用于通过估计低光图像的噪声来消除原始低光图像的噪声。此外,我们还使用了RI-Net来估计反射分量和支持向量,为了解决色彩失真和对比度问题,我们使用了纹理损失项和分割平滑损失来约束反射分量和光照分量。最后,我们的方法是一种零参考增强方法,不会受到成对和未成对数据集的训练数据的影响。在论文中,我们通过使用自己制作的现实低光数据集以及先进视觉任务(如面部检测、目标识别和实例分割)来有效验证了我们的方法。我们在多个公共数据集上进行了比较实验,结果表明,与最先进的现有方法相比,我们的方法具有竞争力的性能。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2311.02995
Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One noteworthy limitation is the absence of suitable face detectors for the always-on scenario, a crucial aspect of image sensor-level applications. These detectors must operate directly with sensor RAW data before the image signal processor (ISP) takes over. This gap poses a significant challenge in achieving optimal performance in such scenarios. Further research and development are necessary to bridge this gap and fully leverage the potential of iCIS applications. In this study, we aim to bridge the gap by exploring extremely low-bit lightweight face detectors, focusing on the always-on face detection scenario for mobile image sensor applications. To achieve this, our proposed model utilizes sensor-aware synthetic RAW inputs, simulating always-on face detection processed "before" the ISP chain. Our approach employs ternary (-1, 0, 1) weights for potential implementations in image sensors, resulting in a relatively simple network architecture with shallow layers and extremely low-bitwidth. Our method demonstrates reasonable face detection performance and excellent efficiency in simulation studies, offering promising possibilities for practical always-on face detectors in real-world applications.
尽管在轻型边缘设备上进行了大量关于为边缘设备设计的轻量级深度神经网络(DNNs)的研究,但当前的 face 检测器并没有完全满足集成嵌入式 DNNs 的“智能”CMOS图像传感器(iCIS)的要求。这些传感器在各种实际应用中非常重要,如高效的移动电话和具有持续开启功能的安防系统。一个值得注意的是,在持续开启的场景下缺乏适合的 face 检测器,这是图像传感器级别应用的关键方面。这些检测器必须在图像信号处理器(ISP)接管之前直接与传感器RAW数据操作。这一空白对在此类场景实现最佳性能提出了重大挑战。进一步的研究和开发是必要的,以弥合这一空白并充分利用iCIS应用的潜力。在本研究中,我们旨在通过探索极端轻量级的 face 检测器来弥合这一空白,重点关注移动图像传感器应用的持续开启场景。为了实现这一目标,我们提出的模型利用了传感器感知的合成RAW输入,在ISP链之前对持续开启的 face 检测进行模拟。我们采用二进制(-1,0,1)权重来设计成像传感器上的实现,导致网络架构相对简单,具有极低的位宽。我们的方法在模拟研究中的面部检测性能和效率都表现出相当不错的水平,为现实世界中的持续开启面部检测器提供了有前途的解决方案。
https://arxiv.org/abs/2311.01001
Wearing a mask is one of the important measures to prevent infectious diseases. However, it is difficult to detect people's mask-wearing situation in public places with high traffic flow. To address the above problem, this paper proposes a mask-wearing face detection model based on YOLOv5l. Firstly, Multi-Head Attentional Self-Convolution not only improves the convergence speed of the model but also enhances the accuracy of the model detection. Secondly, the introduction of Swin Transformer Block is able to extract more useful feature information, enhance the detection ability of small targets, and improve the overall accuracy of the model. Our designed I-CBAM module can improve target detection accuracy. In addition, using enhanced feature fusion enables the model to better adapt to object detection tasks of different scales. In the experimentation on the MASK dataset, the results show that the model proposed in this paper achieved a 1.1% improvement in mAP(0.5) and a 1.3% improvement in mAP(0.5:0.95) compared to the YOLOv5l model. Our proposed method significantly enhances the detection capability of mask-wearing.
戴口罩是预防传染病的重要措施之一。然而,在高度人流量的公共场合很难检测到人们的口罩佩戴情况。为解决上述问题,本文提出了一种基于YOLOv5l的口罩佩戴面部检测模型。首先,Multi-Head Attentional Self-Convolution 不仅提高了模型的收敛速度,还增强了模型的检测精度。其次,引入Swin Transformer Block能够提取更丰富的特征信息,提高小目标检测能力,并提高整个模型的准确性。我们设计的I-CBAM模块可以提高目标检测精度。此外,使用增强特征融合可以使模型更好地适应不同规模的物体检测任务。在MASK数据集的实验中,结果表明,与YOLOv5l模型相比,本文提出的模型在mAP(0.5)和mAP(0.5:0.95)上分别实现了1.1%和1.3%的提高。我们提出的方法显著增强了戴口罩检测能力。
https://arxiv.org/abs/2310.10245
Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Typically, the modules of these systems are evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations, dataset imbalances, and inconsistent annotations can lead to incorrect system performance estimates. We propose strategies to create balanced evaluation subsets of our dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides consistent, accurate, and interpretable estimates of the system's performance which is critical for real-world applications.
实时的视频分析系统在像自动驾驶这样的带宽受限环境中执行计算机视觉任务,如面部检测和识别。在端到端面部分析系统中,首先使用流行的视频编码格式(如HEVC)对输入进行压缩,然后传递给依次执行面部检测、对齐和识别的模块。通常,这些系统的模块使用特定任务的不平衡数据集进行独立评估,这可能导致性能估计的误解。在本文中,我们通过使用驾驶特定数据集进行了对端到端面部分析系统的深入评估,这使得有意义的结果。我们证明了独立任务评估、数据不平衡和注释不统一可能导致系统性能估计错误。我们提出了创建数据集平衡评估子集以及在不同分析和任务场景下使其注释保持一致的策略。然后,我们按顺序评估端到端系统的性能,以考虑任务依赖关系。我们的实验结果表明,我们的方法提供了一致、准确和可解释的系统性能估计,这对实时应用程序至关重要。
https://arxiv.org/abs/2310.06945
Individual identification plays a pivotal role in ecology and ethology, notably as a tool for complex social structures understanding. However, traditional identification methods often involve invasive physical tags and can prove both disruptive for animals and time-intensive for researchers. In recent years, the integration of deep learning in research offered new methodological perspectives through automatization of complex tasks. Harnessing object detection and recognition technologies is increasingly used by researchers to achieve identification on video footage. This study represents a preliminary exploration into the development of a non-invasive tool for face detection and individual identification of Japanese macaques (Macaca fuscata) through deep learning. The ultimate goal of this research is, using identifications done on the dataset, to automatically generate a social network representation of the studied population. The current main results are promising: (i) the creation of a Japanese macaques' face detector (Faster-RCNN model), reaching a 82.2% accuracy and (ii) the creation of an individual recognizer for K{ō}jima island macaques population (YOLOv8n model), reaching a 83% accuracy. We also created a K{ō}jima population social network by traditional methods, based on co-occurrences on videos. Thus, we provide a benchmark against which the automatically generated network will be assessed for reliability. These preliminary results are a testament to the potential of this innovative approach to provide the scientific community with a tool for tracking individuals and social network studies in Japanese macaques.
个人识别在生态学和行为学中具有关键作用,尤其是在复杂的社会结构理解中作为工具。然而,传统的识别方法通常涉及入侵的物理标签,对动物和研究人员来说都可能具有破坏性和耗时。近年来,将深度学习应用于研究为自动化完成复杂任务提供了新的方法论视角。越来越多的研究人员利用对象检测和识别技术来实现对视频录像中的个体的识别。这项研究是对通过深度学习开发非侵入性工具进行初步探索,以研究日本狨猴(Macaca fuscata)的面部检测和个体识别。这一研究的主要目标是,利用数据集中的识别结果,自动生成研究人口的社会网络表示。目前的主要结果是有希望:(i)创建了一个日本狨猴面部检测器(Faster-RCNN模型),达到82.2%的准确率;(ii)创建了一个K{ō}jima Island macaques population individual recognizer(YOLOv8n模型),达到83%的准确率。我们还通过传统方法创建了一个K{ō}jima population social network,基于视频中的共同出现。因此,我们为自动生成的网络设立了基准。这些初步结果是对这种创新方法为科学研究界提供跟踪狨猴个体和社交网络的工具具有潜力的有力证明。
https://arxiv.org/abs/2310.06489
Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
极端头姿势在面部分析任务中提出了一起常见的挑战,包括面部检测、面部地标检测(FLD)和头部姿势估计(HPE)。这些任务是相互依存的,准确的FLD依赖于强大的面部检测,而HPE与这些关键点密切相关。本文重点探讨了这些任务的整合,特别是在处理大角度面部姿势的复杂性时。本研究的主要贡献是提出了一种实时多任务检测系统,能够同时完成面部、面部地标和头部姿势的联合检测。该系统基于广泛使用的YOLOv8检测框架,通过添加地标回归头扩展了原对象检测头,从而能够高效地定位关键的面部地标。此外,我们在原YOLOv8框架中优化和改进了各种模块。为了验证我们提出的模型的有效性和实时性能,我们在300W-LP和AFLW2000-3D数据集上进行了广泛的实验。所获得的结果验证我们的模型能够应对大角度面部姿势挑战,同时在这些相互关联的任务中提供实时性能。
https://arxiv.org/abs/2309.11773
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Robot vision often involves a large computational load due to large images to process in a short amount of time. Existing solutions often involve reducing image quality which can negatively impact processing. Another approach is to generate regions of interest with expensive vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images. To achieve this, we propose a unique attention mechanism to localize speech sources and evaluate its impact on a face detection algorithm. Our results show that the attention mechanism reduces the computational load. The proposed pipeline is flexible and can be easily adapted for human-robot interactions, robot surveillance, video-conferences or smart glasses.
机器人视觉往往由于处理大量图像需要在短时间内进行,而需要大量的计算资源。现有解决方案往往涉及到降低图像质量,这可能会对处理产生负面影响。另一种方法是使用昂贵的视觉算法生成感兴趣的区域。在本文中,我们评估了如何使用音频来生成光学图像感兴趣的区域。为了实现这一点,我们提出了一种独特的注意力机制,以定位语音来源,并评估它对人脸识别算法的影响。我们的结果表明,注意力机制可以降低计算资源。我们提出的管道是灵活的,可以轻松适应人类-机器人交互、机器人监控、视频通话或智能眼镜等场景。
https://arxiv.org/abs/2309.08005