In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{this https URL}
在本文中,我们探讨了来自预训练的文本到视频(T2V)扩散模型生成视频理解任务的视觉表示。我们假设,从预训练的生成T2V模型中学习到的潜在表示包含了丰富的语义和连贯的时间对应关系,从而自然地促进视频理解。我们的假设通过经典的参考视频物体分割(R-VOS)任务得到验证。我们引入了一个名为“VD-IT”的新框架,它专为固定预训练的T2V模型设计了一些专用的组件。具体来说,VD-IT使用文本信息作为条件输入,确保了语义的一致性,并在精确的时间实例匹配方面保持了语义一致性。它还进一步增加了图像令牌作为补充的文本输入,从而丰富特征集,生成了详细和细微的掩码。 此外,我们提出了一种预测模块,用于预测视频特有噪声,这有助于保留特征保真度并提高分割质量。通过广泛的实验,我们惊讶地观察到,固定生成T2V扩散模型,与通常用于视频骨干的(如Video Swin Transformer)预训练方法相比,具有更好的保持语义对齐和时间一致性的潜力。在现有标准基准上,我们的VD-IT实现了极具竞争力的结果,超过了许多现有方法。代码将在\url{这个 https URL}中提供。
https://arxiv.org/abs/2403.12042
Dataset distillation aims to compress a dataset into a much smaller one so that a model trained on the distilled dataset achieves high accuracy. Current methods frame this as maximizing the distilled classification accuracy for a budget of K distilled images-per-class, where K is a positive integer. In this paper, we push the boundaries of dataset distillation, compressing the dataset into less than an image-per-class. It is important to realize that the meaningful quantity is not the number of distilled images-per-class but the number of distilled pixels-per-dataset. We therefore, propose Poster Dataset Distillation (PoDD), a new approach that distills the entire original dataset into a single poster. The poster approach motivates new technical solutions for creating training images and learnable labels. Our method can achieve comparable or better performance with less than an image-per-class compared to existing methods that use one image-per-class. Specifically, our method establishes a new state-of-the-art performance on CIFAR-10, CIFAR-100, and CUB200 using as little as 0.3 images-per-class.
数据集蒸馏的目的是将数据集压缩成更小的数据集,以便在蒸馏数据集上训练的模型能取得高准确度。现有方法将这一目标定义为在每类K个蒸馏图像的预算下最大化蒸馏分类准确率,其中K为正整数。在本文中,我们推动了数据集蒸馏的边界,将数据集压缩到每类不到一个图像。重要的是意识到有意义的数量不是每类的蒸馏图像数量,而是每类蒸馏像素数量。因此,我们提出了后进先出数据集蒸馏(PoDD)新方法,将整个原始数据集压缩成单个后进先出数据集。后进先出方法推动了为创建训练图像和学习标签的新技术解决方案。我们的方法可以在比现有方法使用一个图像/类的情况下实现与现有方法相当或更好的性能。具体来说,我们的方法在CIFAR-10、CIFAR-100和CUB200上实现了与现有方法相当或更好的性能,只需使用0.3个图像/类。
https://arxiv.org/abs/2403.12040
Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.
对应关系源于为生成和判别任务训练的大型视觉模型。这通过在特征网格上使用最近邻计算对图像对之间的对应关系进行了揭示和基准测试。现有工作试图通过 carefully混合不同来源的特征来提高这些对应关系图的质量,例如通过将不同层或网络的特征组合在一起。我们指出,有两种更好的对应关系策略可用,它们直接对对应关系领域施加结构:功能图。运用这个简单的数学工具,我们将在像素空间中将对应关系问题提升到函数空间,并直接优化具有全局一致性的映射。我们证明了我们的技术不仅使对应关系更加平滑,而且更准确地反映了我们正在研究的大型视觉模型中蕴含的知识。我们的方法在各种密集对应任务上达到了最先进水平。我们还证明了我们在关键点对应和手势图转移方面的有效性。
https://arxiv.org/abs/2403.12038
Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [this http URL](this http URL).
近年来在视频生成方面的进展令人印象深刻,然而,许多现有方法在一致性和文本-视频对齐方面存在问题。此外,该领域缺乏有效的文本指导视频修复技术,与文本指导图像修复领域已被充分探索的领域形成鲜明对比。为此,本文提出了一种新颖的文本指导视频修复模型,实现了更好的一致性、可控制性和兼容性。具体来说,我们引入了一个简单但高效的动作捕捉模块来保留运动一致性,并设计了一个实例感知区域选择,而不是随机区域选择,以获得更好的文本控制性,并利用一种新颖的方法将一些个性化的模型注入到我们的CoCoCo模型中,从而实现更好的模型兼容性。大量实验结果表明,我们的模型可以生成高质量的视频剪辑。同时,我们的模型在运动一致性、文本控制性和模型兼容性方面表现更好。更多细节可见于[http://www.thisurl.com](http://www.thisurl.com)。
https://arxiv.org/abs/2403.12035
This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.
本文提出了一种利用预训练视频扩散模型构建可扩展的3D生成模型的全新范式。开发基础3D生成模型的主要障碍是3D数据的有限可用性。与图像、文本或视频不同,3D数据并不容易获得,这导致了与其他类型数据之间显著的规模差异。为解决这个问题,我们提出使用训练有大量文本、图像和视频的预训练视频扩散模型作为3D数据的常识源。通过微调其多视角生成能力,我们生成了一个大规模合成多视角数据集,用于训练馈线3D生成模型。所提出的模型VFusion3D,在训练几乎3M个合成多视角数据时,可以在几秒钟内生成一个3D资产,并比当前的SOTA馈线3D生成模型具有卓越的性能,用户更喜欢我们的结果,70%的时间。
https://arxiv.org/abs/2403.12034
Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at this https URL.
理解视觉场景是许多下游任务的先决条件,包括自动驾驶、机器人学和其他基于视觉的方法。一种实现视觉数据推理的常见方法是场景图生成(SGG);然而,许多现有方法假定没有现实世界的污损,例如雾、雪、烟以及类似于阳光暴晒或水滴的非均匀扰动。在本文中,我们提出了一个包含程序生成天气污损和其他对视觉基因组数据进行变换的新型SGG基准。此外,我们还引入了一种相应的方法:层次知识增强鲁棒场景图生成(HiKER-SGG),为场景图生成在具有挑战性设置下的强基线。 HiKER-SGG的核心是利用层次知识图进行预测,从粗略初始估计到详细预测。在广泛的实验中,我们证明了HiKER-SGG不仅表现出在零散拍摄方式下的污损图像的优越性能,而且在未污损的场景图生成任务中超过了最先进的现有方法。代码可在此处下载:https://www. HiKER-SGG 的 GitHub 地址。
https://arxiv.org/abs/2403.12033
Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.
开放域3D物体合成因为数据有限和计算复杂度高而落后于图像合成。为了弥合这一差距,最近的工作调查了多视角扩散,但往往在3D一致性、视觉质量或效率方面都存在不足。本文提出MVEdit,作为SDEdit的3D counterpart,通过祖本抽样在多视角图像中共同去噪并输出高质量纹理网格。基于现成2D扩散模型,MVEdit通过无训练的3D适配器在2D视图上实现3D一致性,然后通过渲染视图对下一个时间步的2D视图进行条件处理,实现高质量纹理合成。只需2-5分钟的推理时间,这一框架在质量和速度方面的权衡比分数蒸馏更好。MVEdit具有高度的可扩展性和可定制性,包括文本/图像到3D生成、3D-to-3D编辑和高质量纹理合成等多种应用。特别是,评估显示在图像到3D和文本引导纹理生成任务上达到了最先进水平。此外,我们提出了一种在有限资源的小3D数据集上微调2D潜在扩散模型的方法,实现了快速低分辨率文本到3D初始化的过程。
https://arxiv.org/abs/2403.12032
As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present ROUTERBENCH, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through ROUTERBENCH, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at this https URL.
随着大型语言模型(LLMs)的应用范围不断扩展,有效的服务解决方案的需求变得越来越关键。尽管LLMs具有多样性,但没有一个模型可以最好地解决所有任务和应用,特别是在平衡性能和成本时。这一局限性导致开发了LLM路由系统,它们结合了各种模型的优势,克服了单个LLM的约束。然而,缺乏对评估LLM路由器性能的标准化基准,阻碍了该领域的发展。为了弥合这一差距,我们提出了ROUTERBENCH,一种旨在系统地评估LLM路由系统有效性的新评估框架,以及一个由来自代表性LLM的超过405k个推理结果组成的全面数据集,以支持制定路由策略的开发。我们进一步提出了LLM路由的理论框架,并通过ROUTERBENCH进行了各种路由方法的比较分析,强调它们在我们评估框架中的潜力和局限性。这项工作不仅正式化和推进了LLM路由系统的发展,而且还为它们的评估设定了一个标准,为更易获得且具有经济可行性的LLM部署铺平道路。代码和数据可在此链接下载:https://url.cn/xyz6h
https://arxiv.org/abs/2403.12031
Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite the strong performance of Pre-Trained Models (PTMs) in CIL, a critical issue persists: learning new classes often results in the overwriting of old ones. Excessive modification of the network causes forgetting, while minimal adjustments lead to an inadequate fit for new classes. As a result, it is desired to figure out a way of efficient model updating without harming former knowledge. In this paper, we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model updating without conflict, we train a distinct lightweight adapter module for each new task, aiming to create task-specific subspaces. These adapters span a high-dimensional feature space, enabling joint decision-making across multiple subspaces. As data evolves, the expanding subspaces render the old class classifiers incompatible with new-stage spaces. Correspondingly, we design a semantic-guided prototype complement strategy that synthesizes old classes' new features without using any old class instance. Extensive experiments on seven benchmark datasets verify EASE's state-of-the-art performance. Code is available at: this https URL
类级增强学习(CIL)要求学习系统在不断学习新课程的同时不遗忘。尽管预训练模型(PTMs)在CIL中的表现强劲,但关键问题仍然存在:学习新课程往往导致旧课程的覆盖。网络的过度修改会导致遗忘,而最小的调整则导致对新类课程的不足适应。因此,我们希望找出一种在不损害前知识的情况下进行有效模型更新的方法。在本文中,我们提出了基于PTM的CIL的可扩展子空间增强集(EASE)方案。为了实现模型更新而不产生冲突,我们为每个新任务训练一个轻量级的异构适配器模块,旨在创建任务特定的子空间。这些适配器跨越高维特征空间,使得跨多个子空间进行联合决策。随着数据进化,扩展的子空间使得旧类分类器与新阶段空间不兼容。相应地,我们设计了一个语义引导的类原型补充策略,在不需要使用任何旧类实例的情况下合成旧类的的新特征。在七个基准数据集上的大量实验证实了EASE的尖端性能。代码可在此处下载:https://this URL
https://arxiv.org/abs/2403.12030
3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.
3D人体建模在计算机视觉领域一直是一个挑战。以前的方法通常耗时且难以捕捉到人体细节的详细外观。在本文中,我们提出了一个新的方法,称为\emph{Ultraman},用于从单个图像快速重建纹理丰富的3D人体模型。与现有技术相比,\emph{Ultraman}在提高建模速度和精度的同时保留高质量纹理细节方面取得了巨大的改进。我们提出了一个人体建模新框架,包括几何重建、纹理生成和纹理映射。首先,使用网格重建框架,准确从单张图像中提取3D人体形状。同时,我们提出了一种基于单张图像生成人体多视角一致图像的方法。最后,将这种新型纹理映射方法与建模相结合,以优化纹理细节并确保颜色一致性。通过广泛的实验和评估,我们证明了\emph{Ultraman}在各种标准数据集上的卓越性能。此外,在人体渲染质量和速度方面,\emph{Ultraman}优于最先进的方法。在本文被接受后,我们将公开代码和数据。
https://arxiv.org/abs/2403.12028
The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.
神经渲染领域在生成模型和可分离渲染技术的进步下取得了显著的进展。尽管2D扩散取得了成功,但统一的3D扩散管道仍然不确定。本文介绍了一种名为LN3Diff的新框架,以解决这一空白并实现快速、高质和高通量的条件3D生成。我们的方法利用了3D感知的架构和变分自编码器(VAE)对输入图像进行编码,将其转换为结构化、紧凑的3D潜在空间。通过将潜在空间解码为基于Transformer的解码器,我们的方法在ShapeNet上实现了与3D生成和单目3D复原和条件3D生成相关的最先进性能。此外,它在推理速度方面超过了现有的3D扩散方法,无需对每个实例进行优化。我们提出的LN3Diff在3D生成建模方面取得了显著的进展,并为各种3D视觉和图形任务的各种应用带来了巨大的潜力。
https://arxiv.org/abs/2403.12019
Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of LLM calls. Lastly, we present detailed ablation studies for our design choices.
最近,通过交互式学习来学习实体学习的最新方法直接使用大型语言模型(LLMs)作为智能体来确定环境中的下一步行动。 由于其世界知识和推理能力,LLM智能体在基于强化学习(RL)的基础上实现了比之前更小的智能体的更强性能;然而,频繁地调用LLMs通常需要花费较长时间且代价较高。我们是否可以直接使用LLMs作为智能体呢?我们提出了一个名为EnvGen的新框架来回答这个问题。首先,我们要求LLM生成允许智能体快速学习不同任务的训练环境。具体来说,LLM被要求学习智能体应该学习的任务描述和模拟目标,然后被要求生成一系列环境配置(例如不同地形,分配给智能体的物品等)。接下来,我们在原始和LLM生成的环境中训练一个小型RL智能体。然后,通过向LLM提供智能体的表现反馈,使其能够持续调整生成的环境来逐步提高智能体缺乏的技能。我们在Crafter和Heist环境中进行了全面的实验,证明了EnvGen的有效性。我们发现,使用EnvGen训练的智能体可以在SOTA方法(包括GPT-4智能体)中取得更好的性能,并且能够显著更快地学习长距离任务。我们直观地展示了LLM如何适应性地调整训练环境以帮助智能体提高其较弱技能。此外,EnvGen的使用数量远少于LLM智能体,后者需要数千个LLM调用。最后,我们详细研究了我们的设计选择。
https://arxiv.org/abs/2403.12014
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.
我们介绍了GeoWizard,一种新的生成基础模型,旨在从单张图像中估计几何属性,如深度和法线。尽管在這個領域已經進行了重要的研究,但進展仍然受到公共可用數據集的低多样性和不良质量的極大限制。因此,先前的研究要么被限制在有限的情況下,要么無法捕捉到几何細節。在本文中,我們证明了生成模型(與傳統判別模型,如CNN和Transformer)可以有效解決固有複雜問題。我們還進一步表明,利用扩散priors可以顯著改善推廣、細節保留和資源使用效率。具體來說,我們將原始的穩定扩散模型擴展到共同預測深度和法線,實現了兩種表示之間的自動信息交換和高度一致性。更重要的是,我們提出了一種簡單而有效的策略來將各種場景的複雜數據分佈分解除剖分成不同的子分佈。這種策略使我們的模型能夠識別不同的場景佈局,具有令人驚奇的保真度。GeoWizard為零擊深度和法線預測設置了新的標准,顯著增強了許多下游應用,如3D重建、2D內容創建和新的視角合成。
https://arxiv.org/abs/2403.12013
3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: this https URL
由于扩展数据收集过程的硬件限制,3D手部物体交互数据的缺乏。在本文中,我们提出HOIDiffusion模型用于生成真实和多样化的3D手部物体交互数据。我们的模型是一个条件扩散模型,它以3D手部物体几何结构和文本描述作为图像合成的输入。这种方式可以实现更加可控和真实的合成,因为我们可以以独立的方式指定结构和学习风格输入。HOIDiffusion通过利用预训练在大型自然图像上的扩散模型和几个3D人类演示进行训练。除了可控制图像合成之外,我们还采用生成的3D数据来进行6D物体姿态估计的学习,并证明了其对改善感知系统的有效性。项目页面:此链接:https:// this URL
https://arxiv.org/abs/2403.12011
Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is this http URL.
基于文本或单张图像提示生成多视角图像是一个关键的三维内容创作能力。关于这个问题,有两个基本问题是我们用于训练的数据是什么,以及如何确保多视角的一致性。本文介绍了一种对两个问题做出根本性贡献的新框架。与从2D扩散模型中利用图像进行训练不同,我们提出了一个密集一致的多视角生成模型,该模型从标准的视频生成模型进行微调。由于这些模型使用的视频数据集丰富多样,导致训练领域差异减小。为了增强多视角一致性,我们引入了3D感知去噪采样,它首先采用一种前馈重构模块获得一个明确的全身3D模型,然后采用一种有效的方法将全局3D模型的图像抽样入去噪采样循环中,以提高最终图像的多视角一致性。作为附加功能,这个模块还提供了在几秒钟内创建3D资产表示为3D高斯分布的方式。我们的方法可以生成24个密集视角,并且在训练方面的效率远高于(与同质量、同一致性的最先进方法相比)采用4个GPU小时。通过进一步微调,我们的方法在数量指标和视觉效果方面都超过了现有的最先进方法。我们的项目页面是这个链接:http://www.example.com。
https://arxiv.org/abs/2403.12010
In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-7B (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.
在皮肤病变图像分类领域,传统的卷积神经网络(CNN)方法面临着显著的挑战。这些挑战由于皮肤病变数据集的不平衡性质而进一步加剧,使得模型学习少数民族类特征的效果受到限制。尽管采用增强策略,如使用生成对抗网络(GANs)的策略,但以前的方法并没有完全解决这些复杂性。这项研究通过将图神经网络(GNNs)与胶囊网络(Capsule Networks)相结合,引入了一种创新的方法来增强分类性能。GNNs以其在处理图状数据方面的卓越性能而闻名,提供了一种超越传统CNN能力的先进机制来捕捉复杂模式和关系。胶囊网络通过提供对图像中空间层次结构的卓越识别,进一步增强了这种能力。 我们的研究重点是在Capsule Network的基础上评估和优化Tiny Pyramid Vision GNN(Tiny Pyramid ViG)架构。将这种混合模型应用于MNIST:HAM10000数据集,这是一个用于基准测试分类模型的全面皮肤病变数据集。经过75个训练周期后,我们的模型在准确率方面取得了显著的提高,达到89.23%和95.52%,超过了 established benchmarks,如GoogleNet(83.94%)、InceptionV3(86.82%)、MobileNet V3(89.87%)、EfficientNet-7B(92.07%)、ResNet18(92.22%)、ResNet34(91.90%)、ViT-Base(73.70%)和IRv2-SA(93.47%)。这一结果表明,我们的方法在克服皮肤病变分类固有挑战方面具有潜力,为皮肤病学图像诊断的发展做出了贡献。
https://arxiv.org/abs/2403.12009
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.
我们提出了Stable Video 3D(SV3D)-- 一个用于高分辨率、图像到多视角生成围绕三维物体 orbital 视频的潜在视频扩散模型。最近关于三维生成的研究表明,为了适应新颖的多视角合成(NVS)和三维优化,采用了一些方法对2D生成模型进行了修改。然而,由于视图有限或NVS 不一致,这些方法存在几个缺点,从而影响了三维物体生成的性能。在这项工作中,我们提出了SV3D,它改编了图像到视频扩散模型用于新颖的多视角合成和三维生成,从而利用了视频模型的泛化能力和多视角一致性,同时进一步增加了对NVS 的显式相机控制。我们还提出了用于图像到三维生成的改进3D优化技术,以利用SV3D及其NVS 输出进行图像到三维生成。在多个数据集上进行的实验以及用户研究结果表明,SV3D在NVS以及三维重建方面都具有与之前工作相当的最佳性能。
https://arxiv.org/abs/2403.12008
Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.
基于文本的扩散-based视频编辑提出了一个在图像编辑文献中没有遇到过的挑战:建立真实世界的运动。与现有的视频编辑方法不同,我们关注分数扩散采样以绕过标准的反扩散过程,并从已经表现出自然运动的视频中启动优化。我们的分析表明,虽然视频分数扩散可以有效地引入由目标文本指示的新内容,但它也可能导致结构和运动的大幅偏差。为了应对这种情况,我们提出了在分数扩散过程中匹配原始视频和编辑视频的空间-时间自相似性。得益于这种方法,我们的方法对模型具有无关性,可以应用于级联和非级联视频扩散框架。通过与领先方法进行广泛的比较,我们的方法证明了在改变外观的同时准确保留原始结构和运动的优势。
https://arxiv.org/abs/2403.12002
Using generative Artificial Intelligence (AI), we transformed a set of 1,000 scientific papers in the area of biological materials into detailed ontological knowledge graphs, revealing their inherently scale-free nature. Using graph traversal path detection between dissimilar concepts based on combinatorial ranking of node similarity and betweenness centrality, we reveal deep insights into unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, and propose never-before-seen material designs and their behaviors. One comparison revealed detailed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. The algorithm further created an innovative hierarchical mycelium-based composite that incorporates joint synthesis of graph sampling with principles extracted from Kandinsky's Composition VII painting, where the resulting composite reflects a balance of chaos and order, with features like adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across physical, biological, and artistic spheres, revealing a nuanced ontology of immanence and material flux that resonates with postmodern philosophy, and positions these interconnections within a heterarchical framework. Our findings reveal the dynamic, context-dependent interplay of entities beyond traditional hierarchical paradigms, emphasizing the significant role of individual components and their fluctuative relationships within the system. Our predictions achieve a far higher degree of novelty, technical detail and explorative capacity than conventional generative AI methods. The approach establishes a widely useful framework for innovation by revealing hidden connections that facilitate discovery.
使用生成人工智能 (AI),我们将1,000篇生物材料领域的科学论文转换为详细的概念知识图,揭示了它们固有的无尺度特性。通过基于节点相似度和连通性排名的组合排名在相似概念之间进行图遍历路径检测,我们揭示了前所未有的跨学科关系,这些关系可以用于回答问题、发现知识空白和提出新颖的物质设计和它们的特性。一个比较揭示了生物材料与贝多芬第九交响曲之间的详细结构相似性,通过同构映射揭示了共享的复杂性模式。该算法还创新性地制备了一种基于菌丝的复合材料,将图采样与Kandinsky的《构图 VII》中提取的原则相结合,形成了一个平衡熵和秩序的复合物,具有可调节的孔隙率、机械强度和复杂图案化学功能性。我们在物理、生物和艺术领域发现了其他同构体,揭示了物质存在的隐含维度和物质流动的细微差别,与后现代哲学产生共鸣,并将其置于分层框架中。我们的研究结果揭示了超越传统等级范式的实体之间的动态、上下文相关的相互作用,强调了个体组件及其在系统中的波动关系的重要性。我们的预测实现了与传统生成人工智能方法远更高的新颖性、技术细节和探索能力。该方法为创新建立了广泛的有用框架,通过揭示隐藏的连接促进了发现。
https://arxiv.org/abs/2403.11996
Mesh is a fundamental representation of 3D assets in various industrial applications, and is widely supported by professional softwares. However, due to its irregular structure, mesh creation and manipulation is often time-consuming and labor-intensive. In this paper, we propose a highly controllable generative model, GetMesh, for mesh generation and manipulation across different categories. By taking a varying number of points as the latent representation, and re-organizing them as triplane representation, GetMesh generates meshes with rich and sharp details, outperforming both single-category and multi-category counterparts. Moreover, it also enables fine-grained control over the generation process that previous mesh generative models cannot achieve, where changing global/local mesh topologies, adding/removing mesh parts, and combining mesh parts across categories can be intuitively, efficiently, and robustly accomplished by adjusting the number, positions or features of latent points. Project page is this https URL.
网格是各种工业应用中3D资产的基本表示形式,并且得到了专业软件的广泛支持。然而,由于其不规则的结构,网格的创建和编辑通常是耗时且劳动密集的。在本文中,我们提出了一个高度可控制生成模型,GetMesh,用于不同类别的网格生成和编辑。通过将变化的点数作为潜在表示,并将其重新组织为三面体表示,GetMesh生成具有丰富和尖锐细节的网格,超过了单一类别和多类别相应模型。此外,它还允许对生成过程进行细粒度控制,这是以前网格生成模型无法实现的。通过调整潜在点的数量、位置或特征,可以直觉地、高效地实现全球/局部网格拓扑结构的变化、添加或删除网格部分以及跨类合并网格部分。项目页面地址是:https://url.cn/getmesh。
https://arxiv.org/abs/2403.11990