We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. this https URL
我们追求开发能够通过多样化的操作技能与通用未见对象进行零距离交互的机器人,并展示如何静止的人类视频可以作为学习这些通用机器人的丰富数据来源。与典型的机器人学习方法不同,我们采用分解方法,可以利用大规模的人类视频来学习人类如何完成所需任务(人类计划),然后将此计划翻译到机器人的身上。 具体来说,我们学习了一个人类计划预测器,它基于当前场景图像和目标图像预测未来的手和物体配置。我们将此与一个翻译模块相结合,该模块学习了一个基于计划的机器人操作策略,允许在零距离的情况下跟随人类计划进行通用操作任务,无需部署时间训练。 重要的是,虽然计划预测器可以利用大规模的人类视频进行学习,但翻译模块只需要少量的领域数据,并且可以应用于训练过程中没有见过的任务。我们证明了我们所学的系统可以实现超过16种操作技能,涵盖40个物体,涉及100个真实世界任务,包括桌面操作和野外操作的多样性。
https://arxiv.org/abs/2312.00775
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
像CLIP这样的视觉语言预训练在各种下游任务上的表现都相当出色,例如零散射击图像分类和图像-文本检索。大多数现有的CLIP类似作品通常采用较大的图像编码器,如ResNet50和ViT,而轻量级的版本很少被讨论。在本文中,我们提出了一个多级交互范式来训练轻量级CLIP模型。首先,为了减轻一些图像-文本对不是严格一对一对应的问题,我们通过逐渐软化负样本的标签来改进传统的全局实例级对齐目标。其次,引入了一个基于标记的轻量化二元匹配的平滑二元匹配基于词级对齐目标,用于对图像补丁和文本单词进行细粒度对齐。此外,根据观察到CLIP模型的准确性不会随着文本编码器参数的增加而相应增加,我们引入了遮蔽语言建模(MLM)的额外目标,以提高缩短的文本编码器的潜力。在实践中,我们提出了一个在网络阶段为遮蔽图像嵌入注入的辅助融合模块,以增强MLM。大量实验证明,在无需在推理过程中引入额外计算成本的情况下,所提出的方法在多个下游任务上取得了更高的性能。
https://arxiv.org/abs/2312.00674
Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level, our work deviates from this trend. We observe that large neural networks, unlike their smaller counterparts, when pre-trained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR, suggesting that the domain gap might be less pronounced than previously believed. By approaching the HFR problem as one of low-data fine-tuning, we introduce a straightforward framework: comprehensive pre-training, succeeded by a regularized fine-tuning strategy, that matches or surpasses the current state-of-the-art on four publicly available benchmarks. Corresponding codes can be found at this https URL.
异质化面部识别(HFR)涉及在可见(VIS)和近红外(NIR)视觉域之间匹配面部图像的复杂任务。尽管大部分现有文献都在努力缩小领域差距并将其解决在输入或特征级别,但我们的工作与这一趋势有所偏差。我们观察到,与较小的人工神经网络相比,当在大型同构性 VIS 数据上进行预训练时,大型神经网络在 HFR 表现出非凡的零散性能,表明领域差距可能没有之前想象的那么大。通过将 HFR 问题视为低数据量的微调问题,我们引入了一个简单而直接的框架:全面的预训练,接着是一个正则化微调策略,它在四个公开可用的基准上超过了或与现有水平相当。相关代码可以在这个链接找到:https://www.xxx.com/。
https://arxiv.org/abs/2312.00627
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.
图像分割(RIS)旨在在无需文本描述的情况下对图像中的对象进行分割。尽管取得了巨大的进展,但目前的做法在具有各种文本表达或未见到的视觉实体的案例中表现不佳,限制了其在进一步应用中的发展。在本文中,我们提出了一个新颖的RIS方法,通过解决上述两个难题来显著提高泛化能力。特别地,为了处理未约束的文本,我们提出了一种显式且关键的提示来增强给定的表达,将表达放入统一语境中,从而在语言风格变化的情况下促进目标捕捉。此外,我们还引入了一个多模态融合聚合模块,带有来自强大预训练模型的视觉指导,以利用空间关系和像素同构性来处理未见到的视觉实体的不完整目标掩码和错误阳性不规则突起。在零散数据集设置中进行了广泛的实验,与最先进的 methods相比,所提出的 approach 实现了稳定的性能提升,例如在 RefCOCO、RefCOCO+ 和 ReferIt 上的 mIoU 分别增加了 4.15%、5.45% 和 4.64%。此外,在 GraspNet-RIS 上的结果也表明,我们的方法在处理大规模领域转移的新场景时表现良好。
https://arxiv.org/abs/2312.00452
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
在本文中,我们提出了一种高效且高性能的部分相关视频检索(PRVR)方法,旨在检索输入文本查询中包含至少一个相关时刻的未剪辑长视频。在效率和性能方面,之前的研究被忽视的一个瓶颈是密帧的视觉编码。这使得研究者选择轻量级的视觉骨干,但由于其学习到的视觉表示能力有限,导致检索性能低于他们的能力。然而,简单地用高性能的大规模视觉与语言模型(VLMs)替换它们并不理想,因为它们的效率太低了。为了应对这些问题,我们关注超图像,这是通过将视频帧按照 $N \times N$ 的网格布局重新排列来创建的。这减少了视觉编码的数量至 $\frac{1}{N^2}$,并弥补了大规模 VLMs 的低效率,使我们可以将它们用作强大的编码器。令人惊讶的是,我们发现,通过一个简单的查询图像关注技巧,VLMs 很好地向超图像进行扩展,并高效地对抗了目前的最优方法。此外,我们通过将几个可训练的模块集成到 VLM 骨干网络中,提出了一种微调方法。实验结果表明,我们的方法在活动网络捕捉和 TVR 上实现了最佳性能。
https://arxiv.org/abs/2312.00414
Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.
心脏MRI能够全面评估心肌结构、功能和组织的特性。在这里,我们描述了一个基本的心脏MRI视觉系统,具有评估人类心血管疾病和健康广泛的范围。我们的深度学习模型通过自监督的对比学习进行训练,通过学习随附的放射学报告的原始文本中的视觉概念来实现。我们在美国四个大型学术临床机构的数据上进行训练和评估。此外,我们还展示了我们的模型在英国生物银行和两个公开可用的外部数据集上的表现。我们探索了我们系统的涌现零样本能力,并展示了在不同任务上的卓越表现,包括左心室射血分数回归问题和35种疾病的诊断,如心脏淀粉样变和肥厚型心肌病。我们证明了我们的深度学习系统不仅能够理解人类心血管疾病的惊人的复杂性,而且还可以将兴趣指向感兴趣的临床问题,具有令人印象深刻的临床级别的诊断准确性和通常所需的训练数据的比例。
https://arxiv.org/abs/2312.00357
Multi-object tracking in traffic videos is a crucial research area, offering immense potential for enhancing traffic monitoring accuracy and promoting road safety measures through the utilisation of advanced machine learning algorithms. However, existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes, which cannot well simulate the challenges encountered in complex traffic scenarios. To address this gap, we introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios. To validate the complexity and challenges presented by TrafficMOT, we conducted comprehensive empirical studies using three different settings: fully-supervised, semi-supervised, and a recent powerful zero-shot foundation model Tracking Anything Model (TAM). The experimental results highlight the inherent complexity of this dataset, emphasising its value in driving advancements in the field of traffic monitoring and multi-object tracking.
traffic 视频中的多目标跟踪是一个关键的研究领域,通过利用先进的机器学习算法,具有巨大的提高交通监测准确性和促进道路安全措施的潜力。然而,现有的 traffic video 多目标跟踪数据集通常包含有限的实例或专注于单一类别,无法很好地模拟复杂交通场景中遇到的挑战。为了填补这一空白,我们介绍了 TrafficMOT,一个旨在涵盖各种交通情况的大型数据集。为了验证 TrafficMOT 所展示的复杂性和挑战,我们使用了三种不同的设置进行了全面的实证研究:全监督、半监督和最近强大的零击基础模型 Tracking Anything Model(TAM)。实验结果突出了这个数据集的固有复杂性,强调了其在交通监测领域推动进步的价值。
https://arxiv.org/abs/2311.18839
We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See this https URL for video samples.
我们提出了MicroCinema,这是一个简单而有效的框架,用于生成高质量且连贯的文本到视频。与现有的方法不同,MicroCinema采用分而治之的方法,将文本到视频生成分为两个阶段:文本到图像生成和图像到视频生成。这种策略有两个显著的优势。a) 它允许我们充分利用最近在文本到图像模型上的进展,如Stable Diffusion、Midjourney和DALLE,生成高保真度和高度详细的图像。b) 利用生成的图像,模型可以对细节丰富的外观关注程度减少,优先考虑运动动态的高效学习。为了有效实现这一策略,我们引入了两个核心设计。首先,我们提出了Appearance Injection Network,增强了给定图像的 appearance 保留。其次,我们引入了Appearance Noise Prior,这是一种新的机制,旨在保留预训练的2D扩散模型的能力。这些设计要素使MicroCinema能够根据提供的文本提示生成高质量的视频,并精确指导运动。大量实验证明,所提出的框架具有优越性。具体来说,MicroCinema在UCF-101上的SOTA零 shot FVD达到342.86,在MSR-VTT上的SOTA零 shot FVD达到377.40。请查看此链接观看视频样本。
https://arxiv.org/abs/2311.18829
Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions, one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs, our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference, our model can generate images at 20 FPS on modern hardware.
扩散模型生成高质量的图像,但需要进行数十次前向传递。我们引入了分布匹配扩散(DMD),一种将扩散模型转换为一篇一步图像生成器的工艺,对图像质量的影响最小。我们通过最小化一个近似KL散度来强制一步图像生成器与扩散模型在分布级别上匹配,该散度的梯度可以表示为两个目标分布和一个由我们的第一步生成器产生的合成分布之间的差。分数函数参数化为了每个分布上训练的两个扩散模型。结合一个简单的回归损失,与多级扩散输出的大规模结构相匹配,我们的方法在ImageNet 64x64上的FID分数为2.62,在零散COCO-30k上的FID分数为11.49,与Stable Diffusion相当,但速度orders of magnitude更快。利用FP16推理,我们的模型可以在现代硬件上以20 FPS生成图像。
https://arxiv.org/abs/2311.18828
We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.
我们提出了CoDi-2,一种多模态大型语言模型(MLLM),具有灵活性和交互性,可以在任意到任意输入-输出模式范式中跟踪复杂的多模态交互指令,进行上下文学习(ICL),推理,聊天,编辑等操作。通过将模式与语言对齐,CoDi-2不仅使大型语言模型(LLMs)能够理解复杂模式interleaved指令和上下文示例,还能够在连续特征空间中自回归地生成 grounded 和 coherent 多模态输出。要训练CoDi-2,我们构建了一个大型生成数据集,涵盖了文本、视觉和音频中的上下文多模态指令。CoDi-2展示了在多模态生成方面的广泛零散能力,例如通过多轮交互对话进行any-to-any模式生成、上下文学习和推理。CoDi-2在诸如主题驱动图像生成、视觉变换和音频编辑等任务上超过了之前领域特定模型。CoDi-2在开发全面多模态基础模型方面取得了重大突破,能够解释上下文语言-视觉-音频interleaved指令并生成多模态输出。
https://arxiv.org/abs/2311.18775
Visual-language pre-training (VLP) have achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias that introduced by MLLMs' hallucinations and intrinsic caption styles, we propose a "text shearing" to keep the lengths of extended captions identical to the originals. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, our zero-shot results are comparable to fine-tuning on target datasets, which encourages more exploration on the versatile use of MLLMs.
视觉语言预训练(VLP)在多模态任务上的成功很大程度上归功于大型图像-文本数据集的可用性。在这项工作中,我们证明了多模态大型语言模型(MLLMs)可以通过提高数据质量来增强视觉-语言表示学习。我们的方法很简单,利用MLLMs扩展每个图像的多个摘要。为了防止MLLMs的幻觉和固有描述风格带来的偏差,我们提出了一个“文本剪切”来保持扩展摘要的长度与原始相同。在图像-文本检索中,我们的方法在微调和小幅度零样本设置下,分别获得了5.6 ~ 35.0%和16.8 ~ 46.1%的R@1改善。值得注意的是,我们的零样本结果与在目标数据集上的微调结果相当,这鼓励了更加探索MLLMs的多功能应用。
https://arxiv.org/abs/2311.18765
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.
语言模型代理(LMA)最近在多步决策任务中 emergence 成为一个有前景的范例,经常比人类和其他强化学习代理表现更好。然而,对于涉及任务组合的现实生活中应用,他们的表现还有待探索。在这项工作中,我们引入了一个新的基准,称为CompWoB -- 50个新的合成 Web 自动化任务,反映了更真实的假设。我们证明了,尽管现有的提示 LMAs (gpt-3.5-turbo 或 gpt-4) 在基本任务上实现了94.0%的平均成功率,但在组合任务上的性能下降至24.9%。另一方面,仅在基本任务上进行微调的迁移 LMAs (仅在基本任务上进行微调) 的泛化差距较小,从85.4%下降到54.8%。通过平衡任务之间的数据分布,我们训练了一个新的模型,HTML-T5+,在 MiniWoB 上实现了人类级别(95.2%)的性能,并在CompWoB上实现了最佳零散性能(61.5%)。虽然这些突出了小规模微调转移模型的潜力,但它们的性能在不同指令组合下进一步恶化,改变了组合顺序。与 LMA 最近的非凡成功相比,我们的基准和详细分析强调了为现实部署构建具有鲁棒性和可组合性的 LMA 的必要性。
https://arxiv.org/abs/2311.18751
Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
通过从RGB图像中感知3D结构,基于CAD模型原语的方法可以实现对场景的有效且高效的3D对象基础表示。然而,当前方法依赖于与真实图像相关的昂贵注释的监督,并受到任务固有歧义的影响——包括单目感知深度尺度歧义以及CAD数据库模型与真实观察结果的不精确匹配。因此,我们提出DiffCAD,是第一个基于RGB图像的弱监督概率方法,可以从该图像中检索和配准CAD模型。我们将此方法建模为一个条件生成任务,利用扩散学习隐含概率模型,捕获图像中CAD对象形状、姿态和比例。这使得可以生成多种可能性的CAD重构,只需要很少的假设来描述深度/尺度歧义和的不精确形状匹配。我们的方法仅在合成数据上训练,利用单目深度和掩码估计来实现对各种真实目标领域的稳健零散适应。尽管我们的方法仅在合成数据上训练,但多假设方法可以在Scan2CAD数据集上比监督 state-of-the-art 快5.9% 。
https://arxiv.org/abs/2311.18610
With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at this https URL.
随着文本到图像扩散模型令人印象深刻的引入,图像编辑方法变得更加丰富并不断演变。这个领域的一个有前景的最近方法是 Delta 去噪得分(DDS)- 一种基于评分函数的图像编辑技术,它利用了文本到图像扩散模型的丰富生成先验。然而,仅依赖于评分函数是不够的,以便保留原始图像中的特定结构元素,这是图像编辑的关键方面。受到 DDS 和无配对图像到图像翻译(CUT)中对比学习相似性和重要性的启发,我们在这里提出了一种 embarrassingly simple yet very powerful 的 DDS 修改,称为对比去噪得分(CDS),用于潜在扩散模型(LDM)。具体来说,为了在保留输入和输出之间结构对应性的同时保持内容的可控性,我们在 DDS 框架内引入了一种简单的方法来调节结构一致性,即利用 CUT 损失来控制结构一致性。为了计算这个损失,我们利用 LDM 的中间特征,特别是自注意力层,具有丰富的空间信息。我们的方法能够实现零散图像到图像的翻译和神经辐射场(NeRF)编辑,实现保留结构细节和转换内容的平衡。质控结果和比较证明了我们的提出的方法的 effectiveness。代码可访问的 URL 可以在该链接找到。
https://arxiv.org/abs/2311.18608
Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
由于在处理庞大的视频数据时需要大量的资源,大多数研究都把重点放在将预训练的图像语言模型适应视频领域上。主导的路径提出通过添加时间学习者来解决视觉差异,而忽视了对于网络可扩展描述性叙述的实质差异,导致语义空间不够清晰,且可能性能受限。在这项工作中,我们优先考虑文本知识的细化,以促进可扩展的视频识别。为了应对类别名称的较弱语义空间,我们要求一个大型语言模型(LLM)通过扩展时描述符来扩充动作类别的名称,从而弥合文本差异并作为一个通用的知识库进行识别。此外,为了将最佳描述符分配给不同的视频实例,我们提出了最优描述符求解器,将视频识别问题转化为在帧级表示和描述符之间解决最优匹配流的问题。在零散、少散和完全监督视频识别的全面评估中,我们展示了我们方法的的有效性。我们的最佳模型在Kinetics-600上的零散识别准确率达到了75.1%。
https://arxiv.org/abs/2312.00096
In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.
在媒体制作领域,视频编辑技术扮演着关键角色。近年来,用于生成新颖view图像合成静态场景的新方法取得了巨大的成功。但添加时间信息会增加额外的复杂性。之前的模型将重点放在使用NeRF隐式表示静态和动态场景。这些模型在训练和推理时间内取得了令人印象深刻的成果,但过拟合到将场景表示为位置的函数。本文提出ZeST-NeRF,一种新的方法,可以生成不需要重新训练的新场景的temporal NeRFs。我们仅用无关场景进行训练,就可以准确重构新颖视图。我们证明了来自各个领域的现有最先进方法无法适应该新任务,而我们的解决方案的有效性得到了证明。通过训练,该网络的定量性能提高了15%,产生的视觉效果也有了显著的提高。
https://arxiv.org/abs/2311.18491
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision Guidance, a spatial layout condition, acts as a clue in the perturbed distribution, greatly narrowing down the search space, to focus on the image sampling process adhering to the spatial layout condition. The LRDiff framework constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches, while allowing for more coherent and contextually accurate image synthesis. The proposed method provides a more efficient and accurate means of synthesising images that align with specific spatial and contextual requirements. We demonstrate through our experiments that our method provides better results than existing techniques both quantitatively and qualitatively. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
本文提出了一种名为“可见指导”的创新解决方案,以增强依赖于文本查询的扩散模型的空间可控制性。我们提出了两个关键创新:可见指导和分层渲染扩散(LRDiff)框架。可见指导是一种空间布局约束,在扰动分布中充当线索,大大缩小了搜索空间,以关注符合空间布局条件的图像采样过程。LRDiff框架构建了一个具有多层图像渲染过程的图像渲染器,其中每个层应用可见指导来有方向地估计单个对象的噪声方向。这种分层渲染策略有效地防止了诸如意外的概念融合或不匹配的问题,同时允许更连贯和上下文准确的图像合成。与现有技术相比,所提出的方法提供了更高效和准确的方法来合成符合特定空间和上下文要求的图像。我们通过实验证明了我们的方法在数量和质量上优于现有技术。我们将该方法应用于三个实际应用场景:边界框到图像、语义掩码到图像和图像编辑。
https://arxiv.org/abs/2311.18435
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44\%, 91.51\%, and 66.17\% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.
大规模预训练模型在开放世界场景中展示了在视觉和语言任务方面的惊人表现。由于缺乏与3D形状可比较的预训练模型,最近的方法利用语言图像预训练实现零散 shot 3D形状识别。然而,由于模态差距,预训练的语言图像模型对3D形状识别的泛化能力不够自信。因此,本文旨在通过视觉选择和层次提示来提高信心。通过CLIP模型为例,我们通过从3D形状的多个渲染视图中确定具有高预测自信度的视图来进行视觉方面的view选择。在文本方面,我们提出了 hierarchical prompts 的策略,这是第一次提出的。第一层提示给分类候选者传统分类级的描述,而第二层根据功能级描述或进一步区分候选者的预测进行微调。值得注意的是,在没有额外训练的情况下,我们提出的零散 shot 3D分类准确率可以达到84.44\%, 91.51\%,和66.17\% on ModelNet40, ModelNet10, and ShapeNet Core55 分别。此外,我们将公开发布代码以促进可重复性和该领域的进一步研究。
https://arxiv.org/abs/2311.18402
Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.
先进的交通模型对于自动驾驶车辆(AV)的规划和闭环仿真至关重要,关键的设计目标包括准确性、多样化的多模态行为、可解释性和下游兼容性。最近,随着大型语言模型的(LLMs)的出现,交通模型的另一个有益的特征是LLM兼容性。我们提出了Categorical Traffic Transformer(CTT),一种交通模型,它输出连续轨迹预测和标记分类预测(车道模式、同构等)。CTT最引人注目的特点是其具有完全可解释的潜在空间,在训练过程中可以直接从地面真实值监督潜在变量,并完全避免模态崩溃。因此,CTT可以在预测准确率的基础上生成各种行为,具有语义含义。此外,CTT的输入和输出标记使其能够与LLMs实现常识推理和零击穿泛化。
https://arxiv.org/abs/2311.18307
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.
视觉语言模型(VLM)在各种下游任务中表现出惊人的性能。然而,理解细粒度的视觉-语言概念,如属性和物体关系,仍然是一个显著的挑战。虽然几个基准旨在更精细地评估VLMs,但它们的重点仍然在于语言方面,而忽略了视觉维度。在这里,我们强调了从文本和视觉两个角度评估VLMs的重要性。我们引入了一个合成图像的渐进式管道,在保证所有其他方面一致性的同时,在特定属性上变化。利用这个数据引擎,我们精心设计了一个基准,SPEC,来评估物体大小、位置、存在和计数。接着,我们对四个主要VLM在SPEC上进行了彻底评估。令人惊讶的是,它们的性能与随机猜测非常接近,揭示了显著的局限性。鉴于这一点,我们提出了一种简单而有效的优化VLM在细粒度理解的方法,可以在不牺牲零散性能的情况下显著提高SPEC。在两个额外的细粒度基准上的结果也显示出持续的改进,进一步验证了我们的方法的转移性。
https://arxiv.org/abs/2312.00081