Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
由于网络上存在大量教学视频,学习从视频中呈现的多步骤任务模型是一个令人着迷的目标。我们引入了一个新的预训练视频模型,VideoTaskformer,专注于代表教学视频语义和结构。我们使用一个简单的有效目标来预训练VideoTaskformer:预测从教学视频中随机掩盖的步骤的弱监督文本标签(掩码步建模)。与以前 Local 学习的步骤表示方法相比,我们的方法涉及全球学习,利用整个任务周围的视频作为上下文。从这些学习表示中,我们可以验证未观测视频是否正确执行给定任务,并预测哪些步骤可能在给定步骤后执行。我们引入了两个新的基准来检测教学视频中的错误,以验证是否存在异常步骤,以及步骤是否按照正确的顺序执行。我们还引入了一个长期预测基准,其目标是从给定步骤预测长期步骤。我们的方法在这些任务中表现出色,我们认为这些任务将成为一个有价值的方式,用于衡量步骤表示质量。此外,我们评估了VideoTaskformer,针对三个现有基准,即操作活动识别、步骤分类和步骤预测,并在每个基准上证明了我们的方法和以前基准的卓越表现,实现了新的技术水平。
https://arxiv.org/abs/2303.13519
The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未训练过的类上表现良好。以前的算法从开始训练就开始训练特征金字塔和检测头,这破坏了在预训练期间建立的视觉文本特征对齐,并努力防止语言模型忘记未训练过的类。我们提出了三种方法来缓解这些问题。第一种方法是使用简单的方案来增加文本嵌入,以防止在训练期间看到的少数类上过度拟合,同时同时节省内存和计算。第二种方法是修改特征金字塔网络和检测头,包括可训练的门控快捷方式,这鼓励视觉文本特征对齐,并在检测训练开始时保证它。最后一种方法是利用更大的图像文本对语料库,从而提高检测在这些类上没有人类标注 bounding box 的检测性能。我们三种方法在 LVIS 基准测试的零样本版本上进行评估,每个方法都表现出明显和重要的 benefits。我们的最终网络在 mAP-all 度量上实现了新的前沿技术,并表现出 mAP-罕见的类上的 competitive 性能,以及与 COCO 和 Object365 相比更好的传输性能。
https://arxiv.org/abs/2303.13518
Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model.
大规模文本到图像扩散模型可以生成高保真度的图像,并具有强大的组合能力。然而,这些模型通常训练在大量互联网数据上,通常包含版权材料、授权图像和个人照片。此外,它们已被发现复制各种生活艺术家的风格或记住确切的训练样本。我们怎样才能在没有从头训练模型的情况下删除这些版权概念或图像,而无需重新训练模型?为了实现这一目标,我们提出了一种高效的模型初始化方法,即防止生成目标概念。我们的算法学习将图像分布匹配为我们希望初始化的样式、实例或文本提示对应的分布。这防止了模型生成目标概念,由于其文本条件。广泛的实验表明,我们的方法可以成功防止生成初始化中删除的概念,同时保留模型中密切相关的概念。
https://arxiv.org/abs/2303.13516
Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: this https URL.
尽管图像质量越来越真实,最近的3D图像生成模型通常运行在固定 extent 的3D体积上,且相机运动限制有限。我们研究无条件合成无限制自然场景的任务,以实现任意大的相机运动,同时保持持久的3D世界模型。我们的场景表示包括一个可扩展的平面场景布局网格,可以通过任意相机姿态通过3D解码和体积渲染进行渲染,并创建一个全景的天空穹顶。基于这个表示,我们仅从单视图互联网照片学习了一个生成世界模型。我们的方法可以实现模拟穿越3D地形的旅程,同时保持全局场景一致性——例如,回到起点得到相同的场景视图。我们的方法和传统的3D生成模型的固定边界相比,支持持久的相机无关的世界表示。我们的项目页面:这个 https URL。
https://arxiv.org/abs/2303.13515
We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
我们介绍了 SAOR,一种从野生图像中估计具有关节的物体的三维形状、纹理和视角的新方法。与之前的方法依赖于预先定义的特定类别3D模板或定制的3D骨骼,SAOR使用无骨骼个体模型从单个视角图像集中学习关节形状,而不需要任何3D物体形状先验。为了防止不整的解决方案,我们提出了一种交叉实例一致性损失,利用分离物体形状变形和关节。这通过一种新的轮廓采样机制在训练期间增强视角多样性。在我们的方法中,只需要在训练期间从现有的预训练网络估计物体轮廓和相对深度图。在推理期间,给定单个视角图像,它高效输出明确的网格表示。与相关的现有工作相比,我们对挑战性的四足动物在质量和数量方面取得了改进的结果。
https://arxiv.org/abs/2303.13514
To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use human feedback as channels to learn the desired behavior. We describe the competition and provide an overview of the top solutions. We conclude by discussing the impact of the competition and future directions for improvement.
为促进从人类反馈中 fine-tuning 基础模型的研究,我们在 NeurIPS 2022 年举办了 MineRL BASALT 比赛,比赛的主题是从人类反馈中 fine-tuning 基础模型。BASALT 挑战要求团队竞争,开发用于解决 Minecraft 中难以定义奖励函数的任务的算法。通过这场比赛,我们旨在促进使用人类反馈作为学习目标行为的算法开发。我们描述了比赛,并概述了最优秀的解决方案。最后,我们讨论了比赛的影响和未来的改进方向。
https://arxiv.org/abs/2303.13512
In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization.
在本文中,我们提出了一种神经网络预设置技术,以解决现有颜色风格传输方法的限制,包括视觉偏差、巨大的内存要求和缓慢的风格切换速度。我们的技术基于两个核心设计。首先,我们提议使用无监督神经网络颜色映射(DNCM),通过图像自适应的颜色映射矩阵,对每个像素进行连续的操作,避免偏差并支持具有较小内存 footprint的高分辨率输入。其次,我们开发了一道两阶段的 pipeline,将任务分为颜色正常化和风格化,以便通过提取颜色风格作为预设置,并在正常化输入图像中重用它们,实现高效的风格切换。由于pairwise dataset 不存在,我们描述了如何通过自监督策略训练神经网络预设置。通过全面评估,我们展示了神经网络预设置比现有方法的各种优点。此外,我们展示,我们的训练模型自然地支持多个应用程序,包括暗光图像增强、水下图像修复、图像去雾和图像协调。
https://arxiv.org/abs/2303.13511
This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at this https URL .
本文介绍了基于激光雷达的自我监督前训练方法Masked Voxel Jigsaw and Reconstruction(MV-JAR),以及在谷歌自动驾驶数据集上精心设计的高效数据检测基准。受到后续3D物体检测器中场景点、像素级 hierarchy的灵感,我们设计Masked和 Reconstruction策略,考虑场景点和 voxel 点分布情况。我们采用Reversed-Furthest-Voxel-Sampling策略来解决 LiDAR 点分布不均的问题,并提出了MV-JAR,它结合了两种技术,用于建模上述分布,结果表现优异。我们的实验揭示了先前高效实验的局限性,这些实验uniformly samples fine-tuningsplits with varying data proportion from each LiDAR sequence,导致在不同split之间存在类似数据多样性的问题。为了解决这个问题,我们提出了一个新的基准,样本场景序列以不同的 fine-tuningsplit,确保模型充分收敛,并提供更精确的训练方法评估。在谷歌自动驾驶数据和KITTI数据集上的实验表明,MV-JAR consistently and significantly improves 3D检测性能,在各种数据尺度上实现6.3%的性能提升,相对于从头开始训练。代码和基准将在这httpsURL上提供。
https://arxiv.org/abs/2303.13510
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
We present DreamBooth3D, an approach to personalize text-to-3D generative models from as few as 3-6 casually captured images of a subject. Our approach combines recent advances in personalizing text-to-image models (DreamBooth) with text-to-3D generation (DreamFusion). We find that naively combining these methods fails to yield satisfactory subject-specific 3D assets due to personalized text-to-image models overfitting to the input viewpoints of the subject. We overcome this through a 3-stage optimization strategy where we jointly leverage the 3D consistency of neural radiance fields together with the personalization capability of text-to-image models. Our method can produce high-quality, subject-specific 3D assets with text-driven modifications such as novel poses, colors and attributes that are not seen in any of the input images of the subject.
我们提出了Dream Booth3D方法,一种从只需要3-6个随意拍摄的对象的图像就可以个性化生成文本到三维模型的方法。我们的方法结合了最近在个性化文本到图像模型(Dream Booth)和文本到三维生成(DreamFusion)方面的进展。我们发现,简单地将这些方法结合起来无法生成令人满意的主题特定的三维资产,因为个性化文本到图像模型过度适应对象输入视角。我们采用了三步骤的优化策略,同时利用神经网络亮度场三维一致性和文本到图像模型的个性化能力。我们的方法能够以文本驱动的方式生成高质量的主题特定的三维资产,例如对象输入图像中未曾出现的新姿态、颜色和属性。
https://arxiv.org/abs/2303.13508
The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL
建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:
https://arxiv.org/abs/2303.13505
Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
大多数视频恢复网络很慢,具有高计算负载,并且不能用于实时视频增强。在这项工作中,我们设计了一个高效且快速的框架,用于实时视频增强,例如实时视频通话和视频流。我们提出的方法被称为循环瓶颈混合器网络(ReBotNet),采用了双分支框架。第一个分支使用基于ConvNext的编码器将输入帧的空间和时间维度 tokenizing,并使用瓶颈混合器对这些抽象 tokens进行处理。为了进一步改善时间一致性,第二个分支直接使用从每个帧提取的 token 进行混合。一个通用的解码器然后将两个分支的特征合并,以预测增强帧。此外,我们提出了一种循环训练方法,其中利用最后帧的预测来高效地增强当前帧,同时改善时间一致性。为了评估我们的方法,我们编辑了两个模拟实际视频通话和流媒体场景的新数据集,并在多个数据集上展示了广泛的结果,其中 ReBotNet 在计算量更低、内存要求更低和Inference时间更快速的情况下表现更好。
https://arxiv.org/abs/2303.13504
This paper presents a new, provably-convergent algorithm for computing the flag-mean and flag-median of a set of points on a flag manifold under the chordal metric. The flag manifold is a mathematical space consisting of flags, which are sequences of nested subspaces of a vector space that increase in dimension. The flag manifold is a superset of a wide range of known matrix groups, including Stiefel and Grassmanians, making it a general object that is useful in a wide variety computer vision problems. To tackle the challenge of computing first order flag statistics, we first transform the problem into one that involves auxiliary variables constrained to the Stiefel manifold. The Stiefel manifold is a space of orthogonal frames, and leveraging the numerical stability and efficiency of Stiefel-manifold optimization enables us to compute the flag-mean effectively. Through a series of experiments, we show the competence of our method in Grassmann and rotation averaging, as well as principal component analysis.
本论文提出了一种新的、可证明收敛的算法,用于计算在一个 Flag manifold 上、受链式度量影响的一个集合点 Flag-mean 和 Flag-Median。 Flag manifold 是包含 flag 的数学空间,即一个向量空间中的嵌套子空间序列,其维度在增加。 Flag manifold 是许多已知的矩阵群的扩展,包括施特林群和 Grassmanian,因此它是一个通用的对象,在各种计算机视觉问题中非常有用。 要解决计算第一级 Flag 统计的挑战,我们首先将问题转化为一个涉及限制在施特林 manifold 上的辅助变量的问题。施特林 manifold 是一组正交帧的空间,利用施特林 manifold 的优化数值稳定性和效率,我们能够有效地计算 Flag-mean。 通过一系列实验,我们展示了我们方法在 Grassman 和旋转平均以及主成分分析中的能力和优势。
https://arxiv.org/abs/2303.13501
Recent progress in NeRF-based GANs has introduced a number of approaches for high-resolution and high-fidelity generative modeling of human heads with a possibility for novel view rendering. At the same time, one must solve an inverse problem to be able to re-render or modify an existing image or video. Despite the success of universal optimization-based methods for 2D GAN inversion, those, applied to 3D GANs, may fail to produce 3D-consistent renderings. Fast encoder-based techniques, such as those developed for StyleGAN, may also be less appealing due to the lack of identity preservation. In our work, we introduce a real-time method that bridges the gap between the two approaches by directly utilizing the tri-plane representation introduced for EG3D generative model. In particular, we build upon a feed-forward convolutional encoder for the latent code and extend it with a fully-convolutional predictor of tri-plane numerical offsets. As shown in our work, the renderings are similar in quality to optimization-based techniques and significantly outperform the baselines for novel view. As we empirically prove, this is a consequence of directly operating in the tri-plane space, not in the GAN parameter space, while making use of an encoder-based trainable approach.
近年来,基于NeRF的GAN技术取得了进展,引入了多种方法,以实现高分辨率和高保真度的人类头部生成建模,并具备创造新视角的能力。同时,必须解决逆问题才能重新渲染或修改现有图像或视频。尽管2DGAN反转的通用优化方法取得了成功,但应用于3DGAN时可能无法产生3D一致性的渲染。类似于StyleGAN开发的快速编码技术也可能因为缺乏身份保留而不太吸引人。在我们的研究中,我们引入了实时方法,通过直接利用为EG3D生成模型引入的三方平面表示来直接连接两个方法之间的差异。特别是,我们基于前向卷积编码器的 latent code构建了一个 fully-convolutional 预测器,并将其扩展为三方平面数值偏移的全面卷积预测器。在我们的研究中,渲染质量与优化方法类似,对于新视角显著优于基准。我们通过经验证明,这是直接操作三方平面而不是GAN参数空间的后果,同时利用编码器可训练方法。
https://arxiv.org/abs/2303.13497
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.
这篇文章重写了计算机视觉中用于视觉识别任务的标准预训练-再微调范式。通常,最先进的基础模型是通过大规模(较弱)监督数据集训练的,包含数百万图像。我们引入了一个简单的预训练阶段,并使用自监督MAE技术初始化模型。虽然MAE只表现出与模型大小的关系,但我们发现它与训练数据集大小也有关系。因此,我们的MAE基于预训练方法适用于训练基础模型。预训练 consistently 改善模型收敛和下游转移性能,涵盖了模型大小(数百万到数十亿参数)和数据大小(数百万到数十亿图像)。我们测试了10个不同的视觉识别任务,包括图像分类、视频识别、对象检测、低尺度分类和零尺度识别。我们最大的模型在iNaturalist-18上取得了新的最先进的结果(91.3%),在1-视角的ImageNet-1k上取得了62.1%的准确率,并在Food-101上实现了零视角转移(96.0%)。我们的研究表明,模型初始化在包含数十亿图像的大规模预训练任务中发挥着重要作用。
https://arxiv.org/abs/2303.13496
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
Attention is the crucial cognitive ability that limits and selects what information we observe. Previous work by Bolander et al. (2016) proposes a model of attention based on dynamic epistemic logic (DEL) where agents are either fully attentive or not attentive at all. While introducing the realistic feature that inattentive agents believe nothing happens, the model does not represent the most essential aspect of attention: its selectivity. Here, we propose a generalization that allows for paying attention to subsets of atomic formulas. We introduce the corresponding logic for propositional attention, and show its axiomatization to be sound and complete. We then extend the framework to account for inattentive agents that, instead of assuming nothing happens, may default to a specific truth-value of what they failed to attend to (a sort of prior concerning the unattended atoms). This feature allows for a more cognitively plausible representation of the inattentional blindness phenomenon, where agents end up with false beliefs due to their failure to attend to conspicuous but unexpected events. Both versions of the model define attention-based learning through appropriate DEL event models based on a few and clear edge principles. While the size of such event models grow exponentially both with the number of agents and the number of atoms, we introduce a new logical language for describing event models syntactically and show that using this language our event models can be represented linearly in the number of agents and atoms. Furthermore, representing our event models using this language is achieved by a straightforward formalisation of the aforementioned edge principles.
注意力是一个重要的认知能力,它限制并选择我们观察到的信息。Bolander等人(2016)之前的研究表明,我们可以基于动态知识逻辑(DEL)建立一个注意力模型,其中参与者可以是全注意力或完全没有注意力。虽然引入了真实的特征,即缺乏注意力的参与者认为什么也没有发生,但模型并没有表现出注意力最本质的特征:选择性。在这里,我们提出了一种扩展,可以关注原子公式的子集。我们介绍了命题注意力对应的逻辑,并证明了其axiomatization是 sound和完整的。然后我们扩展了框架,以处理缺乏注意力的参与者,他们可能会默认关注他们未关注的特定真相值(类似于关注原子的前置知识)。这一特性可以更容易地模拟缺乏注意力的忽视现象,即因为未能关注而出现错误的信念。该模型的两个版本通过适当的DEL事件模型定义了注意力基于学习,这些事件模型基于几个清晰的边缘原则。虽然这些事件模型的大小随着参与者数量和原子数量呈指数增长,但我们引入了一个新的逻辑语言,以描述事件模型的结构,并证明了使用这个语言,我们可以线性地表示参与者和原子的数量。此外,使用这个语言表示我们的事件模型是通过上述边缘原则的简单形式化实现的。
https://arxiv.org/abs/2303.13494
Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments. The use of demonstrations, on the other hand, enables agents to benefit from expert knowledge rather than having to discover the best action to take through exploration. In this survey, we discuss the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios. Additionally, we exemplify a practical pipeline for generating and utilizing demonstrations in the recently proposed ManiSkill robot learning benchmark.
尽管强化学习最近取得了巨大的成功,但这种试错学习在复杂环境中可能是不实用的或效率不高的。另一方面,使用演示可以使代理从专家知识中获得利益,而不是必须通过探索发现最佳行动。在本调查中,我们讨论了在顺序决策中使用演示的优缺点,以及在基于学习的决策范式中(例如,在学习模型中的强化学习和规划)使用演示的各种方法,并讨论了在各种情况下收集演示的方法。此外,我们示例了一个实用的管道,用于在最近提出的ManiSkill机器人学习基准中生成和利用演示。
https://arxiv.org/abs/2303.13489
Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.
将对象属性和关系嵌入三维场景是许多人工智能任务的必要条件,例如视觉grounded对话和身体操纵。然而,三维领域的变量导致了两个基本挑战:1)Labeling的成本和2)3D grounded语言的复杂性。因此,模型的重要目标是数据高效性,适用于不同数据分布和具有未呈现语义形式的任务,以及 ground 复杂的语言语义(例如,观点Anchoring和多物体参考)。为了解决这些挑战,我们提出了NS3D,一个神经符号化的三维基元框架。NS3D利用大型语言到代码模型将语言转换为具有层次结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D扩展了先前的神经符号性视觉推理方法,引入了功能模块,有效地推理高arity关系(即,关系 among 超过两个物体),这在复杂三维场景中澄清物体是至关重要的。模块化和组合式架构使NS3D能够在数据效率和泛化设置方面实现最先进的结果,并证明零shot转移至一个未呈现的三维问答任务。
https://arxiv.org/abs/2303.13483
We study the problem of object retrieval in scenarios where visual sensing is absent, object shapes are unknown beforehand and objects can move freely, like grabbing objects out of a drawer. Successful solutions require localizing free objects, identifying specific object instances, and then grasping the identified objects, only using touch feedback. Unlike vision, where cameras can observe the entire scene, touch sensors are local and only observe parts of the scene that are in contact with the manipulator. Moreover, information gathering via touch sensors necessitates applying forces on the touched surface which may disturb the scene itself. Reasoning with touch, therefore, requires careful exploration and integration of information over time -- a challenge we tackle. We present a system capable of using sparse tactile feedback from fingertip touch sensors on a dexterous hand to localize, identify and grasp novel objects without any visual feedback. Videos are available at this https URL.
我们研究的是视觉感知不存在、物体形状未知且能够自由移动的场景,例如取出抽屉中的物品。成功的解决方案需要定位自由物品、识别特定的实例并用手抓住被识别的物品,仅使用触摸反馈。与视觉不同的是,相机只能观察整个场景,触摸传感器是局部的,只能观察与操纵器接触的部分场景。此外,通过触摸传感器收集信息需要在触摸表面上施加力量,这可能会干扰场景本身。因此,与触摸进行推理需要仔细探索和整合信息——这是一个我们挑战的问题。我们提出了一种系统,它能够利用手指尖端触摸传感器提供的稀疏触觉反馈,在没有任何视觉反馈的情况下定位、识别并抓住新物品。视频资源可在以下 https URL 中找到。
https://arxiv.org/abs/2303.13482