We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .
我们介绍了一个具有多才多艺的柔性捕获视觉语言模型(VLM),可以生成不同长度的区域特定描述。该模型FlexCap通过为输入边界框生成长度条件下的捕获结果,从而控制其输出信息密度。描述可以从简洁的物体标签到详细的捕获信息。为了实现这一点,我们创建了各种长度的大规模训练数据集,从带标签的图像开始。这种柔性捕获能力具有几个宝贵的应用。首先,FlexCap在视觉基因组数据集上的密集捕获任务中表现出卓越的性能。其次,通过使用FlexCap生成局部描述作为大型语言模型的输入,可以构建视觉问答(VQA)系统。该系统在多个VQA数据集上实现了最先进的零散射击性能。我们还证明了使用FlexCap的“局部化然后描述”方法比其他VLM的“描述然后局部化”方法在开放性物体检测方面表现更好。我们突出柔性捕获模型的一个新颖特点,即它可以通过前缀条件提取多样视觉信息。最后,我们初步展示了FlexCap在图像分类、物体属性识别和视觉对话等任务上的广泛应用。项目网页:https://this URL 。
https://arxiv.org/abs/2403.12026
In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-7B (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.
在皮肤病变图像分类领域,传统的卷积神经网络(CNN)方法面临着显著的挑战。这些挑战由于皮肤病变数据集的不平衡性质而进一步加剧,使得模型学习少数民族类特征的效果受到限制。尽管采用增强策略,如使用生成对抗网络(GANs)的策略,但以前的方法并没有完全解决这些复杂性。这项研究通过将图神经网络(GNNs)与胶囊网络(Capsule Networks)相结合,引入了一种创新的方法来增强分类性能。GNNs以其在处理图状数据方面的卓越性能而闻名,提供了一种超越传统CNN能力的先进机制来捕捉复杂模式和关系。胶囊网络通过提供对图像中空间层次结构的卓越识别,进一步增强了这种能力。 我们的研究重点是在Capsule Network的基础上评估和优化Tiny Pyramid Vision GNN(Tiny Pyramid ViG)架构。将这种混合模型应用于MNIST:HAM10000数据集,这是一个用于基准测试分类模型的全面皮肤病变数据集。经过75个训练周期后,我们的模型在准确率方面取得了显著的提高,达到89.23%和95.52%,超过了 established benchmarks,如GoogleNet(83.94%)、InceptionV3(86.82%)、MobileNet V3(89.87%)、EfficientNet-7B(92.07%)、ResNet18(92.22%)、ResNet34(91.90%)、ViT-Base(73.70%)和IRv2-SA(93.47%)。这一结果表明,我们的方法在克服皮肤病变分类固有挑战方面具有潜力,为皮肤病学图像诊断的发展做出了贡献。
https://arxiv.org/abs/2403.12009
The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.
混合深度模型(ViT)和卷积神经网络(CNN)已经成为视觉任务中强大的骨干网络。提高这类混合骨干网络的输入分辨率自然增强了模型能力,但无疑会带来沉重的计算成本,且呈指数增长。我们提出了一个名为HIRI-ViT的新混合骨干网络,将普遍的四阶段ViT升级为专为高分辨率输入设计的五阶段ViT。HIRI-ViT基于将典型的CNN操作分解为两种并行CNN分支以实现高性价比的原始想法。一种高分辨率分支直接接受主要高分辨率特征作为输入,但使用较少的卷积操作。另一种低分辨率分支首先执行 down-sampling,然后利用这样的低分辨率特征进行更多的卷积操作。在识别任务(ImageNet-1K 数据集)和密集预测任务(COCO 和 ADE20K 数据集)上的实验都证明了HIRI-ViT的优越性。更值得注意的是,在相当的成本($\sim$5.0 GFLOPs)下,HIRI-ViT在ImageNet上实现了与目前最佳发表的Top-1准确率84.3%相媲美的性能,绝对提高了iFormer-S的83.4%性能。
https://arxiv.org/abs/2403.11999
Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024.. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expression recognition model's generalization ability, resulting in subpar recognition performance. To address this problem, we employ a semi-supervised learning technique to generate expression category pseudo-labels for unlabeled face data. At the same time, we uniformly sampled the labeled facial expression samples and implemented a debiased feedback learning strategy to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning. Moreover, , to further compensate for the limitation and bias of features obtained only from static images, we introduced a Temporal Encoder to learn and capture temporal relationships between neighboring expression image features. In the 6th ABAW competition, our method achieved outstanding results on the official validation set, a result that fully confirms the effectiveness and competitiveness of our proposed method.
面部表情识别(FER)在计算机视觉领域扮演着关键角色,并在各种领域得到了广泛应用。本文旨在介绍我们即将参加的第六届情感行为分析野外(ABAW)比赛的策略,该比赛将在2024年的CVPR上举行。在面部表情识别任务中,FER数据集有限的大小对表情识别模型的泛化能力构成了挑战,导致识别性能较差。为了解决这个问题,我们采用了一种半监督学习方法来生成未标记面部数据的表情类别伪标签。同时,我们均匀采样了已标记的面部表情样本,并实现了一种有偏反馈学习策略来解决数据集中类别不平衡和半监督学习中的数据偏差问题。此外,为了进一步弥补仅从静态图像中获得的特征的局限性和偏差,我们引入了一个时间编码器来学习并捕捉相邻表情图像特征之间的 temporal 关系。在6th ABAW比赛中,我们采用的方法在官方验证集上取得了突出的成绩,这一结果充分证实了我们所提出方法的有效性和竞争力。
https://arxiv.org/abs/2403.11942
Unmanned Aerial Vehicles (UAVs) are gaining popularity in civil and military applications. However, uncontrolled access to restricted areas threatens privacy and security. Thus, prevention and detection of UAVs are pivotal to guarantee confidentiality and safety. Although active scanning, mainly based on radars, is one of the most accurate technologies, it can be expensive and less versatile than passive inspections, e.g., object recognition. Dynamic vision sensors (DVS) are bio-inspired event-based vision models that leverage timestamped pixel-level brightness changes in fast-moving scenes that adapt well to low-latency object detection. This paper presents F-UAV-D (Fast Unmanned Aerial Vehicle Detector), an embedded system that enables fast-moving drone detection. In particular, we propose a setup to exploit DVS as an alternative to RGB cameras in a real-time and low-power configuration. Our approach leverages the high-dynamic range (HDR) and background suppression of DVS and, when trained with various fast-moving drones, outperforms RGB input in suboptimal ambient conditions such as low illumination and fast-moving scenes. Our results show that F-UAV-D can (i) detect drones by using less than <15 W on average and (ii) perform real-time inference (i.e., <50 ms) by leveraging the CPU and GPU nodes of our edge computer.
无人机(UAVs)在民用和军事应用中越来越受欢迎。然而,未受控制的访问受限制区域会威胁隐私和安全。因此,预防无人机(UAVs)的检测和检测是确保机密性和安全性的关键。尽管主动扫描,主要基于雷达,是最准确的无人机检测技术之一,但它可能昂贵且不如被动检测(例如物体识别)灵活。动态视觉传感器(DVS)是受生物启发的基于事件的视觉模型,它利用快速移动场景中时间戳级像素级的亮度变化来适应低延迟的物体检测。本文介绍了一种名为F-UAV-D的嵌入式系统,可实现快速移动无人机检测。特别,我们提出了一个利用DVS作为实时和低功耗配置中RGB摄像头的替代品的设置。我们的方法利用DVS的高动态范围(HDR)和背景抑制特性,并在各种快速移动无人机上训练时,在低照度和快速移动场景下优于RGB输入。我们的结果表明,F-UAV-D可以(i)通过平均使用能量低于<15瓦来检测无人机,(ii)通过利用边缘计算机的CPU和GPU节点进行实时推理(即<50毫秒)实现。
https://arxiv.org/abs/2403.11875
A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.
连续手语识别(CSLR)中的一个关键挑战是从视频输入中有效地捕捉长距离空间交互。为解决这个挑战,我们提出了TCNet,一种混合网络,有效地从轨迹和相关区域中建模空间和时间信息。TCNet的轨迹模块将帧转换为由连续视觉令牌组成的对齐轨迹。此外,对于查询令牌,沿着轨迹自注意力学习。因此,我们的网络还可以关注运动中特定区域的细粒度空间和时间模式,比如手指的运动。TCNet的关联模块使用了一种新颖的动态关注机制,筛选出无关帧区域。此外,它将相关区域的动态键值令牌分配给每个查询。这两项创新显著减少了计算成本和内存。我们在四个大型数据集上进行实验:PHOENIX14,PHOENIX14-T,CSL和CSL-Daily。我们的结果表明,TCNet在连续手语识别中始终实现最先进的性能。例如,在PHOENIX14和PHOENIX14-T上,我们分别提高了1.5%和1.0%的单词错误率。
https://arxiv.org/abs/2403.11818
Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves comparable or even superior performance compared to existing PEFT methods while evoking only 71%-85% of their FLOPs on the VTAB-1K benchmark.
现有的参数高效的微调(PEFT)方法通过提高参数效率在视觉 transformer(ViT)的适应中取得了显著的成功。然而,在迁移适应过程中提高推理效率的研究仍然鲜被探索。这限制了预训练 ViT 模型的更广泛应用,尤其是当模型具有很高的计算复杂度时。在本文中,我们提出了动态调整(DyT),一种提高 ViT 适应参数和推理效率的新方法。具体来说,除了使用轻量级适配器模块外,我们还提出了一个标记分配器来区分有用的标记和无用的标记,允许后者在原始块动态跳过,从而在推理过程中减少冗余计算。此外,我们探索了多种设计变体,以找到最佳的 DyT 实践。最后,受到混合专家(MoE)机制的启发,我们引入了增强型适配器,进一步提高了迁移适应性能。我们对 DyT 在各种任务(包括图像/视频识别和语义分割)上的性能进行了评估。例如,DyT 实现了与现有 PEFT 方法相当甚至更好的性能,同时仅激活了它们 FLOPs 的 71%-85%。
https://arxiv.org/abs/2403.11808
Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively
翻译:将大型语言模型(LLM)生成的具有类别的特定提示组合起来已成为提高视觉语言模型(VLM)零 shot 识别能力的有效方法。为了获得这些类别特定的提示,现有方法依赖于手工创建提示以生成VLM提示,用于下游任务。然而,这需要手动组合这些任务特定的提示,并且,即使如此,它们可能也无法涵盖感兴趣类别的多样视觉概念和任务特定的风格。为了有效地将人类从循环中解放出来并完全自动化零 shot 识别的提示生成过程,我们提出了元提示视觉识别(MPVR)。仅以目标任务的简短自然语言描述作为输入,并包含相关类别的标签列表,MPVR会自动生成一个多样化的类别特定提示集,从而实现强大的零 shot 分类器。MPVR在多个流行的零 shot 图像识别基准测试中表现良好,这些基准测试属于广泛不同的领域。例如,MPVR通过使用GPT和MixTrac LLMs分别获得了CLIP的零 shot识别改进率分别为19.8%和18.2%(平均每个数据集上的改进率为5.0%和4.5%)。
https://arxiv.org/abs/2403.11755
Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained language models using probing classifiers, enabling efficient simultaneous text generation and information extraction. For this, we introduce an approach called EMBER and show that it enables named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments using GPT-2 show that EMBER maintains high token generation rates during streaming text generation, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline using a separate NER model. Code and data are available at this https URL.
提取语义信息从生成文本中是一个有用的工具,可以应用于诸如自动事实检查或检索增强生成等应用。目前,这需要使用推理期间分离模型,这会增加计算成本,或者对语言模型进行破坏性微调。相反,我们提出了一种将信息提取功能直接嵌入预训练语言模型中的方法,使用提示分类器,实现高效的同时文本生成和信息提取。为此,我们介绍了一种称为EMBER的方法,并证明了它可以在无需微调的情况下,使命名实体识别在解码语言模型中实现,同时仅在推理时造成极少量的额外计算成本。具体来说,我们的实验使用GPT-2表明,EMBER在流式文本生成中保持高 token 生成率,仅比使用单独的NER模型进行测量时的速度下降略有减少,约为1%。代码和数据可在此处获得:https://www.aclweb.org/anthology/N22-11965/14914260/
https://arxiv.org/abs/2403.11747
Various musculoskeletal humanoids have been developed so far. While these humanoids have the advantage of their flexible and redundant bodies that mimic the human body, they are still far from being applied to real-world tasks. One of the reasons for this is the difficulty of bipedal walking in a flexible body. Thus, we developed a musculoskeletal wheeled robot, Musashi-W, by combining a wheeled base and musculoskeletal upper limbs for real-world applications. Also, we constructed its software system by combining static and dynamic body schema learning, reflex control, and visual recognition. We show that the hardware and software of Musashi-W can make the most of the advantages of the musculoskeletal upper limbs, through several tasks of cleaning by human teaching, carrying a heavy object considering muscle addition, and setting a table through dynamic cloth manipulation with variable stiffness.
目前已经开发了许多多轴肌肉骨骼人类机器人。虽然这些机器人具有灵活性和冗余性,能够模仿人体,但它们仍然离实际应用场景还有很远的距离。其中原因之一是灵活的身体行走困难。因此,我们开发了一种肌肉骨骼轮式机器人 Musashi-W,通过将轮子和肌肉骨骼 upper 部分组合起来,实现对现实应用场景的优化。此外,我们通过结合静态和动态身体拓扑结构学习、反射控制和视觉识别,构建了其软件系统。我们证明了 Musashi-W 的硬件和软件系统可以充分利用多轴肌肉骨骼 upper 部分的优点,通过几个清洁任务、搬运重物考虑肌肉增加和通过动态布料操作设置桌子等任务,实现了更好的效果。
https://arxiv.org/abs/2403.11729
Despite the availability of large datasets for tasks like image classification and image-text alignment, labeled data for more complex recognition tasks, such as detection and segmentation, is less abundant. In particular, for instance segmentation annotations are time-consuming to produce, and the distribution of instances is often highly skewed across classes. While semi-supervised teacher-student distillation methods show promise in leveraging vast amounts of unlabeled data, they suffer from miscalibration, resulting in overconfidence in frequently represented classes and underconfidence in rarer ones. Additionally, these methods encounter difficulties in efficiently learning from a limited set of examples. We introduce a dual-strategy to enhance the teacher model's training process, substantially improving the performance on few-shot learning. Secondly, we propose a calibration correction mechanism that that enables the student model to correct the teacher's calibration errors. Using our approach, we observed marked improvements over a state-of-the-art supervised baseline performance on the LVIS dataset, with an increase of 2.8% in average precision (AP) and 10.3% gain in AP for rare classes.
尽管像图像分类和图像文本对齐等任务已经有了大量的大数据集,但用于更复杂的识别任务(例如检测和分割)的有标签数据相对较少。特别是,例如分割注释工作量很大,并且类别的分布通常在类之间高度不对称。虽然半监督的教师-学生蒸馏方法在利用大量未标记数据方面表现出前景,但它们存在参数估计不准确,导致对常见类别的过度自信,对罕见类别的过度自信。此外,这些方法在从有限的样本中高效学习方面也存在困难。我们引入了一种双重策略来增强教师模型的训练过程,在几shot学习上取得了显著的改善。 其次,我们提出了一个校准修正机制,使得学生模型可以纠正教师的校准误差。使用我们提出的方法,我们在LVIS数据集上观察到了在现有监督基线性能上显著的改善,平均精度(AP)增加了2.8%,稀有类别的AP增加了10.3%。
https://arxiv.org/abs/2403.11675
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。
https://arxiv.org/abs/2403.11656
This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.
本文介绍了一种名为Arc2Face的身份条件面部模型,它通过ArcFace嵌入一个人的形象,可以生成具有无与伦比的脸部相似度的多样化的照片现实图像。然而,尽管以前尝试将脸部识别特征编码为详细图像,我们发现常见的分辨率数据集(如FFHQ)缺乏足够的标识来重构任何主题。因此,我们仔细放大了WebFace42M数据库,这是面部识别(FR)领域公共数据中最大的一个。Arc2Face在预训练的Stable Diffusion模型基础上进行了调整,仅基于标识向量进行条件生成。我们强调了FR特征的紧凑性,可以完全捕捉到人脸的本质,而不是通过手动定制提示来获得相似度。至关重要的是,文本增强模型往往难以将标识和文本分离,通常需要对给定的脸进行描述以实现满意的相似度。然而,Arc2Face只需要ArcFace的区分特征来指导生成,为许多任务提供了至关重要的ID一致性先验,这些任务的ID一致性至关重要。例如,我们将FR模型从我们的模型上合成合成图像进行训练,实现了优于现有合成数据集的性能。
https://arxiv.org/abs/2403.11641
Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.
上下文优化(CoOp)作为一种简单而有效的技术,将类似于CLIP的视觉语言模型适应下游图像识别任务。然而,在适应新任务的同时,学习具有满意的基于到新、领域和跨任务泛化能力的紧凑上下文仍然具有挑战性。为解决这一挑战,我们提出了一个轻量级且具有泛化能力的通用方法,称为组合Kronecker上下文优化(CK-CoOp)。从技术上讲,CK-CoOp的提示上下文单词是可学习的有向量,这些有向量是由词嵌入层中量化权重的线性组合而成的。这些基础向量包括通过Kronecker乘积生成的一些可学习的小矩阵的非学习able组件,以及通过应用Kronecker乘积在这些可学习的小矩阵上构建的可学习组件。直观上,组合结构通过记住更多的预训练知识来减轻过拟合在训练数据上的风险。同时,Kronecker乘积打破了字典的非学习able限制,从而通过最小的额外参数增强表示能力。大量实验证实,CK-CoOp在基于到新、领域和跨任务泛化评估中实现了最先进的性能,但同时也具有较少的可学习参数和高效的训练和推理速度。
https://arxiv.org/abs/2403.11631
For training a video-based action recognition model that accepts multi-view video, annotating frame-level labels is tedious and difficult. However, it is relatively easy to annotate sequence-level labels. This kind of coarse annotations are called as weak labels. However, training a multi-view video-based action recognition model with weak labels for frame-level perception is challenging. In this paper, we propose a novel learning framework, where the weak labels are first used to train a multi-view video-based base model, which is subsequently used for downstream frame-level perception tasks. The base model is trained to obtain individual latent embeddings for each view in the multi-view input. For training the model using the weak labels, we propose a novel latent loss function. We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks. The proposed framework is evaluated using the MM Office dataset by comparing several baseline algorithms. The results show that the proposed base model is effectively trained using weak labels and the latent embeddings help the downstream models improve accuracy.
为了训练一个接受多视角视频的基于视频的动作识别模型,为每个帧级标签进行标注费时且困难。然而,标记序列级标签相对容易。这种粗略的注释称为弱标签。然而,使用弱标签训练基于多视角的视频动作识别模型具有挑战性。在本文中,我们提出了一个新颖的学习框架,其中弱标签首先用于训练一个多视角视频为基础模型,随后用于下游帧级感知任务。基础模型通过为多视角输入中的每个视图获得单个潜在表示进行训练。为使用弱标签训练模型,我们提出了一个新颖的潜在损失函数。我们还提出了一个使用视图特定潜在表示用于下游帧级动作识别和检测任务的模型。所提出的框架通过比较多个基线算法对MM Office数据集进行评估。结果表明,使用弱标签训练出的基础模型能有效利用弱标签,潜在嵌入有助于下游模型提高准确性。
https://arxiv.org/abs/2403.11616
Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at this https URL
持续学习可以使视觉语言模型持续获得新知识,而无需访问整个历史数据集。然而,在大型模型上减轻性能下降并非易事,因为(i)终身学习中的参数漂移和(ii)全模型调整所带来的大量计算负担。在本文中,我们提出了一个参数高效的持续学习框架,以减轻视觉语言模型在逐步学习中的长期遗忘。我们的方法包括动态扩展预训练的CLIP模型,并通过添加混合专家(MoE)适者来响应新任务。为了保留视觉语言模型的零散识别能力,我们进一步引入了分布区分自动选择器(DDAS),它分别将输入分配到MoE适者原始CLIP。通过在各种设置中进行广泛的实验,与之前的最佳方法相比,我们提出的方法始终表现出更好的性能,同时将参数训练负担降低了60%。我们的代码位于此链接:
https://arxiv.org/abs/2403.11549
Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.
自动光学检测(AOI)在制造业过程中扮演着关键角色,主要利用高分辨率成像仪器进行扫描。它通过分析图像纹理或模式来检测异常,因此在工业制造和质量控制中成为必不可少的工具。尽管AOI非常重要,但部署模型进行AOI通常面临挑战。这些挑战包括样本量有限、源域间差异和成像过程中的光线和相机位置变化对模型的敏感性等。这些因素共同削弱了模型的预测准确性。传统的AOI往往没有充分利用机器或内部图像的丰富机制参数信息,包括统计参数,这些参数通常对AOI分类有利。为了应对这个问题,我们引入了一个外部模式引导的数据挖掘框架,主要基于光学字符识别(OCR),旨在从图像作为第二模态提取统计特征以提高性能,称之为OANet(Ocr-Aoi-Net)。我们方法的关键方面是对单模态模型的外模式特征与通过卷积神经网络编码的图像特征之间的对齐。这种协同作用使不同模态语义表示的融合更加精确。我们进一步引入了特征精度和一个门控函数在我们的OANet中优化这些特征,提高推理和决策能力。实验结果表明,我们的方法显著提高了缺陷检测模型的召回率,在具有挑战性的场景下,表现仍然良好。
https://arxiv.org/abs/2403.11536
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{this https URL}.
为了隐私和安全问题,现在需要从预训练视觉模型中清除不需要的信息变得越来越明显。在现实场景中,清除请求可能来自用户和模型所有者,通常会形成一个序列。因此,在这种情况下,我们期望在保持其余信息的同时,持续从预训练模型中移除特定的信息。我们将这个问题称为持续遗忘,并确定两个关键挑战。 (i) 对于不需要的知识,高效的删除至关重要。 (ii) 对于保留的知识,遗忘过程所带来的影响应最小化。 为了解决这些问题,我们提出了Group Sparse LoRA (GS-LoRA)。具体来说,对于(i),我们使用LoRA模块对每个遗忘任务独立微调Transformer块中的FFN层,而对于(ii),采用简单的组稀疏 regularization,使自动选择特定LoRA组并消除其他组。GS-LoRA有效、参数效率高、数据效率高,并且易于实现。我们在面部识别、目标检测和图像分类上进行广泛的实验,并证明GS-LoRA能够以最小的影响对待定类别的其他类别。代码发布在[这个链接](https:// this URL)。
https://arxiv.org/abs/2403.11530
Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.
近年来在生成型人工智能方面的进步表明,通过视觉提示,GPT-4V在图像识别任务上表现出显著的专业能力。尽管它令人印象深刻的功能,但与GPT-4V的推理相关的财务成本构成了对其广泛应用的严重障碍。为了应对这一挑战,我们的工作引入了Collage Prompting,一种经济实惠的提示方法,将多个图像拼接成一个单一的视觉输入。通过collage prompt,GPT-4V能够同时对几张图片进行图像识别。基于观察到GPT-4V的图像识别准确度随拼贴提示中图片的顺序而有很大差异,我们的方法进一步学习如何优化图像的排列以实现最大识别准确度。一个图预测器被训练来指示每个collage prompt的准确性,然后我们提出了一种优化方法来寻找可能的图像排列。在各种数据集上的实验结果表明,collage prompt的代价效益得分远大于标准提示。此外,GPT-4V的视觉识别中,使用学习到的排列的collage prompt比使用随机排列的collage prompt具有明显更好的准确度。
https://arxiv.org/abs/2403.11468
Conventional approaches to facial expression recognition primarily focus on the classification of six basic facial expressions. Nevertheless, real-world situations present a wider range of complex compound expressions that consist of combinations of these basics ones due to limited availability of comprehensive training datasets. The 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) offered unlabeled datasets containing compound expressions. In this study, we propose a zero-shot approach for recognizing compound expressions by leveraging a pretrained visual language model integrated with some traditional CNN networks.
传统的面部表情识别方法主要集中在对六个基本面部表情的分类。然而,现实世界情况呈现了更广泛的复杂组合表情,由缺乏全面的训练数据集导致。第六届野外情感行为分析研讨会(ABAW)提供了包含组合表情的未标注数据集。在这项研究中,我们提出了一种基于预训练视觉语言模型与一些传统CNN网络集成的零散识别组合表情的零散方法。
https://arxiv.org/abs/2403.11450