Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at this https URL.
理解视觉场景是许多下游任务的先决条件,包括自动驾驶、机器人学和其他基于视觉的方法。一种实现视觉数据推理的常见方法是场景图生成(SGG);然而,许多现有方法假定没有现实世界的污损,例如雾、雪、烟以及类似于阳光暴晒或水滴的非均匀扰动。在本文中,我们提出了一个包含程序生成天气污损和其他对视觉基因组数据进行变换的新型SGG基准。此外,我们还引入了一种相应的方法:层次知识增强鲁棒场景图生成(HiKER-SGG),为场景图生成在具有挑战性设置下的强基线。 HiKER-SGG的核心是利用层次知识图进行预测,从粗略初始估计到详细预测。在广泛的实验中,我们证明了HiKER-SGG不仅表现出在零散拍摄方式下的污损图像的优越性能,而且在未污损的场景图生成任务中超过了最先进的现有方法。代码可在此处下载:https://www. HiKER-SGG 的 GitHub 地址。
https://arxiv.org/abs/2403.12033
We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .
我们介绍了一个具有多才多艺的柔性捕获视觉语言模型(VLM),可以生成不同长度的区域特定描述。该模型FlexCap通过为输入边界框生成长度条件下的捕获结果,从而控制其输出信息密度。描述可以从简洁的物体标签到详细的捕获信息。为了实现这一点,我们创建了各种长度的大规模训练数据集,从带标签的图像开始。这种柔性捕获能力具有几个宝贵的应用。首先,FlexCap在视觉基因组数据集上的密集捕获任务中表现出卓越的性能。其次,通过使用FlexCap生成局部描述作为大型语言模型的输入,可以构建视觉问答(VQA)系统。该系统在多个VQA数据集上实现了最先进的零散射击性能。我们还证明了使用FlexCap的“局部化然后描述”方法比其他VLM的“描述然后局部化”方法在开放性物体检测方面表现更好。我们突出柔性捕获模型的一个新颖特点,即它可以通过前缀条件提取多样视觉信息。最后,我们初步展示了FlexCap在图像分类、物体属性识别和视觉对话等任务上的广泛应用。项目网页:https://this URL 。
https://arxiv.org/abs/2403.12026
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.
我们介绍了GeoWizard,一种新的生成基础模型,旨在从单张图像中估计几何属性,如深度和法线。尽管在這個領域已經進行了重要的研究,但進展仍然受到公共可用數據集的低多样性和不良质量的極大限制。因此,先前的研究要么被限制在有限的情況下,要么無法捕捉到几何細節。在本文中,我們证明了生成模型(與傳統判別模型,如CNN和Transformer)可以有效解決固有複雜問題。我們還進一步表明,利用扩散priors可以顯著改善推廣、細節保留和資源使用效率。具體來說,我們將原始的穩定扩散模型擴展到共同預測深度和法線,實現了兩種表示之間的自動信息交換和高度一致性。更重要的是,我們提出了一種簡單而有效的策略來將各種場景的複雜數據分佈分解除剖分成不同的子分佈。這種策略使我們的模型能夠識別不同的場景佈局,具有令人驚奇的保真度。GeoWizard為零擊深度和法線預測設置了新的標准,顯著增強了許多下游應用,如3D重建、2D內容創建和新的視角合成。
https://arxiv.org/abs/2403.12013
Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.
基于文本的扩散-based视频编辑提出了一个在图像编辑文献中没有遇到过的挑战:建立真实世界的运动。与现有的视频编辑方法不同,我们关注分数扩散采样以绕过标准的反扩散过程,并从已经表现出自然运动的视频中启动优化。我们的分析表明,虽然视频分数扩散可以有效地引入由目标文本指示的新内容,但它也可能导致结构和运动的大幅偏差。为了应对这种情况,我们提出了在分数扩散过程中匹配原始视频和编辑视频的空间-时间自相似性。得益于这种方法,我们的方法对模型具有无关性,可以应用于级联和非级联视频扩散框架。通过与领先方法进行广泛的比较,我们的方法证明了在改变外观的同时准确保留原始结构和运动的优势。
https://arxiv.org/abs/2403.12002
The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.
理解并推理3D现实的能力是走向通用人工智能的重要里程碑。目前的常见做法是通过将3D数据和文本用于微调大语言模型(LLMs),实现3D理解。尽管这些方法的有效性得到了提高,但它们在很大程度上仍然受到可用3D数据规模和多样性的限制。相反,在这项工作中,我们引入了Agent3D-Zero,一种创新的3D感知代理框架,以通过零散地选择和分析一系列观点来理解3D场景。我们方法的核心是对3D场景感知挑战的重新认识,将其视为从多张图像中理解和合成见解的过程,受到我们人类尝试理解3D场景方式的启发。通过巩固这一想法,我们提出了通过积极选择和分析一系列观点来利用大型视觉语言模型(VLM)的新方法。具体来说,给定一个输入3D场景,Agent3D-Zero首先处理带有自定义视觉提示的鸟瞰图图像,然后迭代选择下一个观察点并总结底层知识。Agent3D-Zero的独特之处在于引入了新颖的视觉提示,这显著地增强了VLMs确定最有信息性的观点的能力,从而推动观察3D场景。大量实验证明,所提出的框架在理解多样且之前未见过的3D环境方面非常有效。
https://arxiv.org/abs/2403.11835
3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
3D重建已经在移动机器人领域广泛应用。然而,前期的研究仅能提供基本的几何结构,而缺乏开放世界场景理解的能力,限制了复杂任务(如人交互和视觉导航)的发展。此外,传统的三维场景理解方法依赖于昂贵的标注3D数据集来训练一个监督模型进行单一任务的建模。因此,在移动机器人未来的发展中,几何重建(即开放词汇3D理解和重建)至关重要。在本文中,我们提出了OpenOcc,一个将3D场景重建和开放词汇理解与神经元辐射场统一起来的新框架。我们用占有率表示场景的几何结构,并通过体积渲染实现零散测量的开放词汇预训练模型到3D语言场的转化。此外,还提出了一种新颖的语义感知信心传播(SCP)方法,以解决由于离散测量引起的语言场表示退化的问题。实验结果表明,我们的方法在3D场景理解任务中具有竞争力的性能,特别是对于小目标和长尾物体。
https://arxiv.org/abs/2403.11796
Extracting hyper-relations is crucial for constructing comprehensive knowledge graphs, but there are limited supervised methods available for this task. To address this gap, we introduce a zero-shot prompt-based method using OpenAI's GPT-3.5 model for extracting hyper-relational knowledge from text. Comparing our model with a baseline, we achieved promising results, with a recall of 0.77. Although our precision is currently lower, a detailed analysis of the model outputs has uncovered potential pathways for future research in this area.
提取超关系是构建全面知识图谱的关键步骤,但目前可供参考的监督方法有限。为填补这一空白,我们引入了一种基于零散提示的模型,使用OpenAI的GPT-3.5模型从文本中提取超关系知识。与基线模型相比,我们取得了可喜的成果,具有召回率为0.77。尽管我们的精确度目前较低,但模型输出的详细分析揭示了未来研究在这个领域的潜在方向。
https://arxiv.org/abs/2403.11786
Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively
翻译:将大型语言模型(LLM)生成的具有类别的特定提示组合起来已成为提高视觉语言模型(VLM)零 shot 识别能力的有效方法。为了获得这些类别特定的提示,现有方法依赖于手工创建提示以生成VLM提示,用于下游任务。然而,这需要手动组合这些任务特定的提示,并且,即使如此,它们可能也无法涵盖感兴趣类别的多样视觉概念和任务特定的风格。为了有效地将人类从循环中解放出来并完全自动化零 shot 识别的提示生成过程,我们提出了元提示视觉识别(MPVR)。仅以目标任务的简短自然语言描述作为输入,并包含相关类别的标签列表,MPVR会自动生成一个多样化的类别特定提示集,从而实现强大的零 shot 分类器。MPVR在多个流行的零 shot 图像识别基准测试中表现良好,这些基准测试属于广泛不同的领域。例如,MPVR通过使用GPT和MixTrac LLMs分别获得了CLIP的零 shot识别改进率分别为19.8%和18.2%(平均每个数据集上的改进率为5.0%和4.5%)。
https://arxiv.org/abs/2403.11755
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of semantic neglect, where the model fails to learn meaningful semantic alignments. In this paper, we propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents. Specifically, a semantic-enhanced PSL agent is proposed and a prioritized semantic training strategy is introduced to select goal images that exhibit clear semantic supervision and relax the reward function from strict exact view matching. At inference time, a semantic expansion inference scheme is designed to preserve the same granularity level of the goal-semantic as training. Furthermore, for the popular HM3D environment, we present an Instance Navigation (InstanceNav) task that requires going to a specific object instance with detailed descriptions, as opposed to the Object Navigation (ObjectNav) task where the goal is defined merely by the object category. Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task. Code will be released at https://anonymous.4open. science/r/PSL/.
我们研究零样本实例导航,其中代理商在没有使用对象注释进行训练的情况下,导航到特定的目标物体。 previous object navigation approaches apply the image-goal navigation (ImageNav) task (到达图像的位置)进行预训练,并使用视觉语言模型将代理器转移到实现目标物体。然而,这些方法导致语义忽视的问题,即模型无法学习有意义的语义对齐。在本文中,我们提出了一个优先语义学习(PSL)方法来提高导航代理的语义理解能力。具体来说,我们提出了一个语义增强的PSL代理和一个优先语义训练策略,以选择具有清晰语义监督并减轻奖励函数的精确视匹配的目标图像。在推理时,我们设计了一个语义扩展推理方案,以保留训练时的目标语义级别。此外,对于流行的HM3D环境,我们还提出了一个实例导航(InstanceNav)任务,需要对特定的物体实例进行详细的描述,而不是仅仅定义目标物体类别。我们的PSL代理在零样本ObjectNav中的成功率方面优于前状态-of-the-art,并且在新的InstanceNav任务中也非常出色。代码将在https://anonymous.4open.science/ r/PSL/中发布。
https://arxiv.org/abs/2403.11650
This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.
本文介绍了一种名为Arc2Face的身份条件面部模型,它通过ArcFace嵌入一个人的形象,可以生成具有无与伦比的脸部相似度的多样化的照片现实图像。然而,尽管以前尝试将脸部识别特征编码为详细图像,我们发现常见的分辨率数据集(如FFHQ)缺乏足够的标识来重构任何主题。因此,我们仔细放大了WebFace42M数据库,这是面部识别(FR)领域公共数据中最大的一个。Arc2Face在预训练的Stable Diffusion模型基础上进行了调整,仅基于标识向量进行条件生成。我们强调了FR特征的紧凑性,可以完全捕捉到人脸的本质,而不是通过手动定制提示来获得相似度。至关重要的是,文本增强模型往往难以将标识和文本分离,通常需要对给定的脸进行描述以实现满意的相似度。然而,Arc2Face只需要ArcFace的区分特征来指导生成,为许多任务提供了至关重要的ID一致性先验,这些任务的ID一致性至关重要。例如,我们将FR模型从我们的模型上合成合成图像进行训练,实现了优于现有合成数据集的性能。
https://arxiv.org/abs/2403.11641
Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at this https URL
持续学习可以使视觉语言模型持续获得新知识,而无需访问整个历史数据集。然而,在大型模型上减轻性能下降并非易事,因为(i)终身学习中的参数漂移和(ii)全模型调整所带来的大量计算负担。在本文中,我们提出了一个参数高效的持续学习框架,以减轻视觉语言模型在逐步学习中的长期遗忘。我们的方法包括动态扩展预训练的CLIP模型,并通过添加混合专家(MoE)适者来响应新任务。为了保留视觉语言模型的零散识别能力,我们进一步引入了分布区分自动选择器(DDAS),它分别将输入分配到MoE适者原始CLIP。通过在各种设置中进行广泛的实验,与之前的最佳方法相比,我们提出的方法始终表现出更好的性能,同时将参数训练负担降低了60%。我们的代码位于此链接:
https://arxiv.org/abs/2403.11549
We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without requiring any form of training.
我们提出了一种新的方法来自动生成“路径规划指令”给体素机器人代理。与之前依赖人类标注数据集的特定仿真平台上的方法不同,我们的算法使用上下文学习来条件LLM生成指令,只需几个参考。使用基于LLM的视觉问答策略,我们收集了LLM用于指令生成的环境中的详细信息。我们在包括Matterport3D、AI Habitat和ThreeDWorld在内的多个仿真平台上实现我们的方法,从而证明了其跨平台性质。我们通过用户研究主观评估了我们的方法,观察到83.3%的用户认为生成的指令准确地捕捉了环境的细节,并具有类似于人类指令的特征。此外,我们在REVERIE数据集上使用多种方法进行零散路径导航,并观察到与基线非常接近的关联(<1%的变化),这表明生成的指令在替代人类标注数据方面具有可行性。据我们所知,这是第一个在非训练方式下生成“人类相似”指令的LLM驱动方法。
https://arxiv.org/abs/2403.11487
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
我们探讨了如何通过将大型语言模型和视觉语言模型与新颖的统一记忆机制相结合来解决具有挑战性的视频理解问题,特别是捕捉长时间视频中的长期时间关系。特别是,所提出的多模态代理VideoAgent:1)构建了一个结构化记忆来存储视频的通用时间事件描述和物体中心跟踪状态;2)在给定输入任务查询时,它采用包括视频段局部定位和物体记忆查询等其他视觉基础模型来交互式解决任务,利用LLM的零 shot工具使用能力。VideoAgent在多个长期时间范围的视觉理解基准测试中表现出色,与基线相比,平均提高了6.6%的NExT-QA和26.0%的EgoSchema,缩小了开源模型和私有模型之间的差距,包括Gemini 1.5 Pro。
https://arxiv.org/abs/2403.11481
Conventional approaches to facial expression recognition primarily focus on the classification of six basic facial expressions. Nevertheless, real-world situations present a wider range of complex compound expressions that consist of combinations of these basics ones due to limited availability of comprehensive training datasets. The 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) offered unlabeled datasets containing compound expressions. In this study, we propose a zero-shot approach for recognizing compound expressions by leveraging a pretrained visual language model integrated with some traditional CNN networks.
传统的面部表情识别方法主要集中在对六个基本面部表情的分类。然而,现实世界情况呈现了更广泛的复杂组合表情,由缺乏全面的训练数据集导致。第六届野外情感行为分析研讨会(ABAW)提供了包含组合表情的未标注数据集。在这项研究中,我们提出了一种基于预训练视觉语言模型与一些传统CNN网络集成的零散识别组合表情的零散方法。
https://arxiv.org/abs/2403.11450
Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g., prompt) to guide the model to learn different distributions separately, named multi-partite mapping. However, it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work, we propose an advanced selective hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly, we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly, DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally, which gradually maps different distributions into a shared one. In the reverse process, combined with SDT and strong condition guidance, DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles, by only modifying the mapping strategy, we achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly, by only using a lightweight model (only 0.89M), we could achieve outstanding performance. The source code and pre-trained models are available at this https URL
通用图像修复是一个实际且具有潜力的计算机视觉任务,适用于实际应用。这项任务的主要挑战是同时处理不同退化分布。现有的方法主要利用任务特定的条件(例如提示)来引导模型学习不同的分布,称为多部分映射。然而,这并不适合通用模型学习,因为它忽略了不同任务之间的共享信息。在这项工作中,我们提出了基于扩散模型的先进选择性小时glass映射策略,称为DiffUIR。两项新的考虑使我们的DiffUIR变得非平凡。首先,我们为模型提供了强大的条件指导以获得扩散模型的准确生成方向(选择性)。更重要的是,DiffUIR巧妙地将灵活的共享分布项(SDT)整合到扩散算法中,并自然地逐步将不同的分布映射到共享的一个。在反向过程中,结合SDT和强大的条件指导,DiffUIR逐轮将共享分布引导到具有高图像质量(小时glass)的任务特定分布。没有花哨的功能,仅通过修改映射策略,我们在五个图像修复任务上实现了最先进的性能,在通用设置中的22个基准和零散设置中的零。令人惊讶的是,仅使用轻量级模型(仅0.89M)却取得了出色的性能。源代码和预训练模型可在此处下载:https://URL
https://arxiv.org/abs/2403.11157
Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing.
近年来,在神经辐射场(NeRFs)方面取得的新突破引发了将它们集成到现实世界3D应用程序中的显著需求。然而,不同3D应用程序所需的功能各不相同,因此需要各种具有不同流程的NeRF模型,这导致每个目标任务的繁琐的NeRF训练和费时的尝试性实验。从新兴基础模型泛化能力和适应性的启发下,我们的工作旨在开发一种通用的NeRF来处理各种3D任务。我们通过提出一个名为Omni-Recon的框架来实现这一目标,该框架具有以下功能:(1)可扩展的3D重构和零散多任务场景理解;(2)对各种下游3D应用程序具有适应性,如实时渲染和场景编辑。我们的关键洞见是,基于图像的渲染流程中,具有准确的几何和外观估计的图像可以将其2D图像特征提升到3D版本,从而以通用的方式将广泛探索的2D任务扩展到3D世界中。具体来说,我们的Omni-Recon采用基于图像渲染的通用NeRF模型,具有两个解耦的分支:一个复杂变压器基分支,用于逐步融合几何和外观特征以实现准确的的几何估计,另一个轻量级分支用于预测源视图的混合权重。这种设计在混合权重重用的情况下实现了最先进的(SOTA)通用3D表面重构质量,同时可以在将复杂几何分支烘焙成网格后实现实时渲染,快速适应以实现SOTA通用3D理解性能,并实现与2D扩散模型的无缝集成,实现文本指导的3D编辑。
https://arxiv.org/abs/2403.11131
The Segment Anything Model (SAM), with its remarkable zero-shot capability, has been proven to be a powerful foundation model for image segmentation tasks, which is an important task in computer vision. However, the transfer of its rich semantic information to multiple different downstream tasks remains unexplored. In this paper, we propose the Task-Aware Low-Rank Adaptation (TA-LoRA) method, which enables SAM to work as a foundation model for multi-task learning. Specifically, TA-LoRA injects an update parameter tensor into each layer of the encoder in SAM and leverages a low-rank tensor decomposition method to incorporate both task-shared and task-specific information. Furthermore, we introduce modified SAM (mSAM) for multi-task learning where we remove the prompt encoder of SAM and use task-specific no mask embeddings and mask decoder for each task. Extensive experiments conducted on benchmark datasets substantiate the efficacy of TA-LoRA in enhancing the performance of mSAM across multiple downstream tasks.
Segment Anything Model (SAM) 是一种非常出色的零击能力模型,已经被证明是图像分割任务的重要基础模型,这是计算机视觉中一个非常重要的任务。然而,将其丰富的语义信息传递给多个不同的下游任务仍然是一个未探索的问题。在本文中,我们提出了 Task-Aware Low-Rank Adaptation (TA-LoRA) 方法,该方法使 SARM 能够作为多任务学习的基础模型。具体来说,TA-LoRA 在 SAM 的每一层注入一个更新参数张量,并利用低秩张量分解方法来包含任务共享和任务特定的信息。此外,我们还引入了 mSAM 以进行多任务学习,其中我们删除了 SAM 的提示编码器,并为每个任务使用任务特定的掩码嵌入和掩码解码器。在基准数据集上进行的大量实验证实了 TA-LoRA 在增强 mSAM 在多个下游任务上的性能方面的有效性。
https://arxiv.org/abs/2403.10971
Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system's dynamics, such as the mass or dimensions of a robot, without making further simplifying assumptions about the observability of the Markovian state. Toward the goal of ZSG to unseen variation in context, we propose the contextual recurrent state-space model (cRSSM), which introduces changes to the world model of the Dreamer (v3) (Hafner et al., 2023). This allows the world model to incorporate context for inferring latent Markovian states from the observations and modeling the latent dynamics. Our experiments show that such systematic incorporation of the context improves the ZSG of the policies trained on the ``dreams'' of the world model. We further find qualitatively that our approach allows Dreamer to disentangle the latent state from context, allowing it to extrapolate its dreams to the many worlds of unseen contexts. The code for all our experiments is available at \url{this https URL}.
零样本一般化(ZSG)将隐式动力学转换为未知动态是一个主要挑战,用于创建具有普遍能力的机器人。为解决更广泛的问题,我们从一个简单的设置开始,即上下文强化学习(cRL),假设观测到系统动态参数的可观测性,例如机器人的质量和维度,而没有进一步简化关于隐马尔可夫状态可观测性的假设。针对ZSG将隐式变化的问题,我们提出了上下文递归状态空间模型(cRSSM),这是对Dreamer(v3)世界模型的修改(Hafner等人,2023)。这使得世界模型能够包含隐式信息,从观测中推断隐马尔可夫状态,并建模隐式动态。我们的实验结果表明,这种系统性的上下文包含能够提高ZSG策略在“梦境”世界模型上的效果。我们进一步发现,我们的方法使Dreamer能够区分隐式状态和上下文,使它能够将梦境扩展到未见到的许多世界。我们所有实验的代码都可以在 \url{这个 https URL} 中找到。
https://arxiv.org/abs/2403.10967
Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.
大图像扩散模型已经在新颖视角合成(NVS)中展示了零样本能力。然而,现有的基于扩散的方法在训练集上生成的新视图与相应的目标姿态和外观不一致,这因此限制了下游任务的性能,如图像到多视图生成和3D重建。我们意识到,这种不一致很大程度上是因为在扩散训练中很难直接强制执行准确的姿态和外观对齐,而大多数现有方法如Zero123是通过这种方式完成的。为了解决这个问题,我们提出了Ctrl123,一种基于转录的闭环NVS扩散方法,它在姿态敏感的特征空间中强制生成视图和真实目标之间的对齐。我们进行的广泛实验证明,Ctrl123在NVS和3D重建任务上具有有效的效果,实现了与现有方法相当显著的多视图一致性和姿态一致性的改进。
https://arxiv.org/abs/2403.10953
Generalist foundation model has ushered in newfound capabilities in medical domain. However, the contradiction between the growing demand for high-quality annotated data with patient privacy continues to intensify. The utilization of medical artificial intelligence generated content (Med-AIGC) as an inexhaustible resource repository arises as a potential solution to address the aforementioned challenge. Here we harness 1 million open-source synthetic fundus images paired with natural language descriptions, to curate an ethical language-image foundation model for retina image analysis named VisionCLIP. VisionCLIP achieves competitive performance on three external datasets compared with the existing method pre-trained on real-world data in a zero-shot fashion. The employment of artificially synthetic images alongside corresponding textual data for training enables the medical foundation model to successfully assimilate knowledge of disease symptomatology, thereby circumventing potential breaches of patient confidentiality.
通用基础模型在医学领域已经带来了新的能力。然而,对高质量带患者隐私的标注数据的需求不断增加,这仍然是一个主要问题。将医疗人工智能生成的内容(Med-AIGC)作为无尽的资源存储库是一种可能的解决方案来解决上述挑战。在这里,我们利用1000万对自然语言描述与开源合成基金素图像的联合,筛选出一个名为VisionCLIP的伦理语言图像基础模型,用于视网膜图像分析。VisionCLIP在零散数据集上与预训练在现实世界数据上的方法实现了竞争性的性能。在训练过程中使用人工合成图像与相应文本数据相结合,使医疗基础模型能够成功吸收疾病症状学知识,从而规避潜在的病人隐私泄露风险。
https://arxiv.org/abs/2403.10823