Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments. The use of demonstrations, on the other hand, enables agents to benefit from expert knowledge rather than having to discover the best action to take through exploration. In this survey, we discuss the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios. Additionally, we exemplify a practical pipeline for generating and utilizing demonstrations in the recently proposed ManiSkill robot learning benchmark.
尽管强化学习最近取得了巨大的成功,但这种试错学习在复杂环境中可能是不实用的或效率不高的。另一方面,使用演示可以使代理从专家知识中获得利益,而不是必须通过探索发现最佳行动。在本调查中,我们讨论了在顺序决策中使用演示的优缺点,以及在基于学习的决策范式中(例如,在学习模型中的强化学习和规划)使用演示的各种方法,并讨论了在各种情况下收集演示的方法。此外,我们示例了一个实用的管道,用于在最近提出的ManiSkill机器人学习基准中生成和利用演示。
https://arxiv.org/abs/2303.13489
The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
视觉和语言领域已经见证了预先训练的基础模型的蓬勃发展。大多数现有方法都是独立地使用类似于CLIP、 PaLI 或 Parti的图像到文本生成目标或文本到图像生成目标进行预先训练。然而,这三个目标可以在相同的数据集上、图像到文本对或文本到图像对上进行预先训练,Intuitively,它们之间的互相补充是因为对比提供了全球对齐能力,生成则提供了精细的理解能力。在本工作中,我们提出了一种Contrastive Bi-directional Image-Text 生成模型(CoBIT),试图在一个框架中统一这三个预先训练目标。具体而言,CoBIT 采用了一种 novel Unicoder-decoder 结构,由一个图像Unicoder、一个文本Unicoder 和一个跨modal decoder 组成。图像/文本Unicoders 在不同的任务中可以切换编码和解码,从而提供灵活性和共享知识,这对图像到文本和文本到图像的生成都有益处。CoBIT 在图像理解、图像到文本理解(检索、标题生成、视觉问答、SNLI-VE) 和文本based 内容创建方面取得了卓越的性能,尤其是在零样本情况下。例如,在零样本ImageNet分类中达到了82.7%的准确率,在零样本文本到图像生成中达到了9.37 FID 得分,在零样本标题生成中达到了44.8的CIDEr。
https://arxiv.org/abs/2303.13455
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: this https URL.
CLIP已经实现了新的、令人兴奋的视觉语言联合应用,其中之一就是开放词汇分割,它可以在任何给定的文本查询中定位任意Segment。在我们的研究中,我们提出了一个新的问题:是否可以在没有用户指导的情况下,以文本查询或预先定义的类别形式发现语义Segment,并使用自然语言自动标签它们?我们提出了一个崭新的问题零指导分割和第一个基准,它利用DINO和CLIP两个预训练通用模型来解决这个问题,而不需要 Fine-tuning 或分割数据集。我们的主要想法是首先将图像分割成较小的过分割块,将它们编码到 CLIP 的视觉语言空间中,将其转换为文本标签,并将语义相似的Segment 合并在一起。然而,关键挑战是如何将一个视觉Segment 编码为特定的嵌入,以平衡全局和局部上下文信息,这对识别都有益。我们的主要贡献是一个新的注意遮蔽技术,通过分析 CLIP 内部的注意力层,平衡了两个上下文信息。我们还介绍了几个指标来评价这个新任务。利用 CLIP 固有的知识,我们的方法可以精确地在博物馆人群中定位蒙娜丽莎画作。项目页面:这个 https URL。
https://arxiv.org/abs/2303.13396
Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
标签稀缺是改善特定领域的任务表现的瓶颈。我们提出了一种全新的组件化 Transfer Learning 框架(DoT5 - 域组件式零次输入 T5),用于零次输入域转移。在没有访问域内标签的情况下,DoT5 以多任务方式共同学习域知识和任务知识(从未标记的域内自由文本的 LM 中提取任务知识,并从任务训练更常见的通用数据集中提取数据增强和 NLU)。为了提高任务训练的可转移性,我们设计了一种名为 NLGU 的策略:我们同时训练 In-domain 标签到数据生成 NLG,这可以实现数据增强的自训练和标签预测的 NLU。我们在生物医学领域和放射学资源受限的子领域评估了 DoT5,重点关注 NLI、文本摘要和嵌入学习。DoT5 通过多任务学习证明了组件化转移学习的 effectiveness。特别是,DoT5 在 RadNLI 上的零次输入转移中比当前的最佳方法高出超过 7 的绝对点的准确性。我们通过实验和案例研究验证了 DoT5 的能力,以解决需要域内专业知识的具有挑战性的 NLI 示例。
https://arxiv.org/abs/2303.13386
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. DBLP-QuAD is the largest scholarly question answering dataset.
在本研究中,我们利用 DBLP 学术知识图谱(KG)创建一个问答数据集。 DBLP 是计算机科学主要出版物的在线参考,索引了超过 4.4 百万出版物,由超过 2.2 百万的作者发布。我们的数据集包括 10,000 对问答对及其相应的 SPARQL 查询,可以在 DBLP KG 上执行以获取正确答案。 DBLP-QuAD 是世界上最大的学术问答数据集。
https://arxiv.org/abs/2303.13351
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
在本研究中,我们提出了一个端到端的知识图问答系统,名为GETT-QA。GETT-QA使用了一个流行的文本到文本预训练语言模型T5。该模型以自然语言问题作为输入,并生成简化版的SPARQL查询。在简化版中,模型并不直接生成实体和关系ID。相反,它生成相应的实体和关系标签。在后续步骤中,标签被连接到知识实体和关系ID。为了进一步改善结果,我们要求模型为每个实体生成一份知识实体嵌入的截断版本。截断知识实体嵌入为实现更细的歧义查找而提供了便利。我们发现,T5能够无需改变损失函数而学习截断知识实体嵌入,从而提高了KGQA性能。因此,我们报告了LC-QuAD 2.0和SimpleQuestions-Wikidata datasets在Wikidata上端到端KGQA方面的出色结果。
https://arxiv.org/abs/2303.13284
Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.
Prompttuning是一种有效的方法,通过使用与任务相关的文本代币将预训练的视觉语言模型(VLM)适应后续任务,而代表性的COOp工作将可学习文本代币与类代币结合,以获取特定的文本知识。然而,对于未知的类,这种特定的文本知识是更加泛化到它们,因为它们忘记了具有强泛化能力的一般性的文本知识。为了解决这个问题,我们介绍了一种新的知识引导上下文优化(KgCoOp),以增强未知类可学习prompt的泛化能力。KgCoOp的关键洞察力是,忘记重要的知识可以通过减少可学习prompt和手工制作prompt之间的差异来缓解。特别是,KgCoOp最小化由可学习prompt生成的文本嵌入与手工制作prompt之间的差异。最后,在对比度损失的基础上添加KgCoOp可以生成对于可见任务和未知任务具有区分性的prompt。对多个基准任务的广泛评估表明,提议的知识引导上下文优化是一种Prompttuning的有效方法, \emph{i.e.},在更少的训练时间内获得更好的性能。
https://arxiv.org/abs/2303.13283
Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2\% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR
在自然语言处理(NLP)中,使用Adapters作为参数高效的转移学习替代方法已经得到了研究。Adapters能够在Transformer层之间添加小型瓶颈层,同时保持大型预训练语言模型(PLM)冻结,从而实现 Memory-Efficient Transfer Learning。尽管在NLP中取得了 promising 的结果,但在信息检索中这些方法仍然未被深入研究。尽管以前的研究仅尝试过密集检索或跨语言检索场景,但本 paper 旨在完整描述在IR中使用Adapters的情况。首先,我们研究了Adapters-SPLADE,它是一个稀疏检索器,Adapters不仅保留了经过微调后实现的效率与效果,而且具有 Memory-Efficient 和数量级更轻的训练能力。我们观察到Adapters-SPLADE不仅优化了训练参数的2\% ,而且在IR基准数据集上比完全微调的替代品和现有的参数高效的密集IR模型表现更好。其次,我们考虑了跨域BEIR数据和 TripClick Adapters 的神经网络检索域适应问题。最后,我们还考虑了重新排名器和第一级排名器之间的知识共享。总之,我们的研究涵盖了神经网络IRAdapters的使用。
https://arxiv.org/abs/2303.13220
Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.
知识蒸馏是一种流行的技术,通过模拟从大型教师模型向小型学生模型传输知识来实现。然而,直接对齐教师和学生的特征映射可能强制学生接受过于严格的约束,从而损害学生模型的性能。为了减轻上述特征不匹配的问题,现有的工作主要关注在空间上对齐教师和学生的特征映射,使用像素转换。在本文中,我们新发现,沿着通道维度对齐教师和学生的特征映射也可以有效地解决特征不匹配问题。具体来说,我们提出了一种可学习非线性通道转换来对齐学生和教师模型的特征。基于它,我们进一步提出了一种简单而通用的特征蒸馏框架,只有一个超参数来平衡蒸馏损失和任务特定损失。广泛的实验结果表明,我们的方法在多种计算机视觉任务中实现了显著的性能改进,包括图像分类(+3.28%的 ImageNet-1K 上 Top-1 准确性)、物体检测(+3.9%的 MS COCO 上的 bbox mAP)、实例分割(+2.8%的 ResNet50 上的Mask-RCNN Mask mAP)、语义分割(+4.66%的 ResNet18 上的PSPNet 在 Cityscapes 上的语义分割 mIoU),这表明了我们方法的有效性和灵活性。代码将公开可用。
https://arxiv.org/abs/2303.13212
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中,我们对 predicate 进行更细致的观察,并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above),而模式级别的分布偏差相对较轻。基于这一认识,我们提出了一种分离标签学习(DLL)范式,从模式级角度解决顽固的视觉关系预测问题。具体而言,DLL 将 predicate 标签分离,并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外,我们提出了一种知识级标签分离方法,从同一模式中的头predicate 到尾predicate 转移非目标知识,以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性,即 VidVRD。广泛的实验表明,DLL 提供了一种非常简单但非常有效的解决方案,解决长尾巴问题,实现 VidSGG 的先进技术表现。
https://arxiv.org/abs/2303.13209
Conventional replay-based approaches to continual learning (CL) require, for each learning phase with new data, the replay of samples representing all of the previously learned knowledge in order to avoid catastrophic forgetting. Since the amount of learned knowledge grows over time in CL problems, generative replay spends an increasing amount of time just re-learning what is already known. In this proof-of-concept study, we propose a replay-based CL strategy that we term adiabatic replay (AR), which derives its efficiency from the (reasonable) assumption that each new learning phase is adiabatic, i.e., represents only a small addition to existing knowledge. Each new learning phase triggers a sampling process that selectively replays, from the body of existing knowledge, just such samples that are similar to the new data, in contrast to replaying all of it. Complete replay is not required since AR represents the data distribution by GMMs, which are capable of selectively updating their internal representation only where data statistics have changed. As long as additions are adiabatic, the amount of to-be-replayed samples need not to depend on the amount of previously acquired knowledge at all. We verify experimentally that AR is superior to state-of-the-art deep generative replay using VAEs.
传统的重放基于学习的方法(CL)要求,在每个新的学习阶段中,使用新数据来重放代表以前学到的所有知识样本,以避免灾难性遗忘。由于CL问题中学到的知识量随着时间不断增加,生成回放花费越来越多的时间只是简单地重新学习已经学到的知识。在这个概念验证研究中,我们提出了一种基于重放的学习方法,我们称之为adiabatic重放(AR),其效率来源于(合理的)假设每个新的学习阶段只是对现有的知识进行微小的增加。每个新的学习阶段触发一个选择性采样过程,从现有的知识主体中选择性地重放与新数据相似的样本,而不像全部重放。由于AR代表GMMs对数据的分布,它们只能选择性地更新内部表示只有在数据统计量发生变化时才能进行。只要添加是adiabatic的,即将要重放的数据数量与以前学到的知识量无关。我们实验证实,AR比使用VAEs的最先进的深度生成回放方法优越。
https://arxiv.org/abs/2303.13157
Multiple instance learning (MIL) has emerged as a popular method for classifying histopathology whole slide images (WSIs). However, existing approaches typically rely on pre-trained models from large natural image datasets, such as ImageNet, to generate instance features, which can be sub-optimal due to the significant differences between natural images and histopathology images that lead to a domain shift. In this paper, we present a novel, simple yet effective method for learning domain-specific knowledge transformation from pre-trained models to histopathology images. Our approach entails using a prompt component to assist the pre-trained model in discerning differences between the pre-trained dataset and the target histopathology dataset, resulting in improved performance of MIL models. We validate our method on two publicly available datasets, Camelyon16 and TCGA-NSCLC. Extensive experimental results demonstrate the significant performance improvement of our method for different MIL models and backbones. Upon publication of this paper, we will release the source code for our method.
多个实例学习(MIL)已经成为一种分类病理全 slide 图像(WSIs)的流行方法。然而,现有的方法通常依赖于从大型自然图像数据集(如 ImageNet)训练的模型生成实例特征,这些特征可能不如最佳水平,因为自然图像和病理图像之间存在显著的差异,导致域转换。在本文中,我们提出了一种新颖、简单但有效的方法,用于从训练模型到病理图像的知识转型学习。我们的方法是使用一个触发器组件来帮助训练模型区分训练数据和目标病理数据集之间的差异,从而改进 MIL 模型的性能。我们了两个公开数据集进行了验证,分别是Camelyon16和TCGA-NSCLC。广泛的实验结果证明了我们方法对不同 MIL 模型和骨架的显著性能改进。在本文发表后,我们将发布我们方法的源代码。
https://arxiv.org/abs/2303.13122
In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world. In this paper, we suggest the semantic multi-view model to resolve these two challenges: (1) SBERT for General Embedding (GE), (2) Multi Domain Batch (MDB) for dialogue domain knowledge, and (3) Proxy Gradient Transfer (PGT) for cluster-specialized semantic. MDB feeds diverse dialogue datasets to the model at once to tackle the multi-domain problem by learning the multiple domain knowledge. We introduce a novel method PGT, which employs the Siamese network to fine-tune the model with a clustering method directly.Our model can learn how to cluster dialogue utterances by using PGT. Experimental results demonstrate that our multi-view model with MDB and PGT significantly improves the Open Intent Induction performance compared to baseline systems.
在任务定向对话系统(TOD)中,检测和引入新的意图是将其应用于现实世界的两个主要挑战。在本文中,我们建议采用语义多视角模型来解决这两个挑战:(1) SBERT用于一般嵌入(GE),(2) 对话领域知识的多域批量(MDB),(3) 代理梯度转移(PGT)用于簇特定语义。 MDB通过一次性向模型注入多样化的对话数据集来解决多域问题,通过学习多个域知识。我们介绍了一种新的方法PGT,它使用iamese网络微调模型,并通过聚类方法直接进行。我们的模型可以通过使用PGT来学习如何聚类对话 utterances。实验结果表明,与基线系统相比,我们的多视角模型使用MDB和PGT显著提高了开放意图引入性能。
https://arxiv.org/abs/2303.13099
Spiking neural networks (SNNs) have rich spatial-temporal dynamics, which are suitable for processing neuromorphic, event-based data. However, event-based datasets are usually less annotated than static datasets used in traditional deep learning. Small data scale makes SNNs prone to overfitting and limits the performance of the SNN. To enhance the generalizability of SNNs on event-based datasets, we propose a knowledge-transfer framework that leverages static images to assist in the training on neuromorphic datasets. Our method proposes domain loss and semantic loss to exploit both domain-invariant and unique features of these two domains, providing SNNs with more generalized knowledge for subsequent targeted training on neuromorphic data. Specifically, domain loss aligns the feature space and aims to capture common features between static and event-based images, while semantic loss emphasizes that the differences between samples from different categories should be as large as possible. Experimental results demonstrate that our method outperforms existing methods on all mainstream neuromorphic vision datasets. In particular, we achieve significant performance improvement of 2.7\% and 9.8\% when using only 10\% training data of CIFAR10-DVS and N-Caltech 101 datasets, respectively.
爆发神经网络(SNN)具有丰富的空间-时间动态,适合处理神经形态学数据。然而,与传统深度学习使用的静态数据相比,事件驱动的数据通常注释较少。小型数据规模使SNN容易过拟合,限制了SNN的性能。为了增强SNN在事件驱动数据集上的通用性,我们提出了一种知识转移框架,利用静态图像协助神经形态学数据集的训练。我们的方法和提议了领域损失和语义损失,利用这两个领域的不变量和独特特征,为SNN提供更具泛化性的知识,用于后续针对神经形态学数据的 targeted 训练。具体来说,领域损失将特征空间对齐,并旨在捕捉静态和事件驱动图像之间的共同特征,而语义损失强调不同类别样本之间的差异应该尽可能大。实验结果显示,我们的方法和所有主流神经形态学视觉数据集的现有方法相比表现更好。特别地,当我们仅使用CIFAR10-DVS和N-Caltech 101等数据集的10%训练数据时,我们实现了2.7\%和9.8\%的性能改进。
https://arxiv.org/abs/2303.13077
Synthesis and reconstruction of 3D human head has gained increasing interests in computer vision and computer graphics recently. Existing state-of-the-art 3D generative adversarial networks (GANs) for 3D human head synthesis are either limited to near-frontal views or hard to preserve 3D consistency in large view angles. We propose PanoHead, the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in $360^\circ$ with diverse appearance and detailed geometry using only in-the-wild unstructured images for training. At its core, we lift up the representation power of recent 3D GANs and bridge the data alignment gap when training from in-the-wild images with widely distributed views. Specifically, we propose a novel two-stage self-adaptive image alignment for robust 3D GAN training. We further introduce a tri-grid neural volume representation that effectively addresses front-face and back-head feature entanglement rooted in the widely-adopted tri-plane formulation. Our method instills prior knowledge of 2D image segmentation in adversarial learning of 3D neural scene structures, enabling compositable head synthesis in diverse backgrounds. Benefiting from these designs, our method significantly outperforms previous 3D GANs, generating high-quality 3D heads with accurate geometry and diverse appearances, even with long wavy and afro hairstyles, renderable from arbitrary poses. Furthermore, we show that our system can reconstruct full 3D heads from single input images for personalized realistic 3D avatars.
3D人类头部合成和重建最近在计算机视觉和计算机图形领域引起了越来越多的兴趣。目前,用于3D人类头部合成的最新3D生成对抗网络(GAN)通常只能局限于近前方视图或难以在较大的视角下保持3D一致性。我们提出了潘多拉头(PanoHead),是第一个具有3D感知能力的生成模型,能够通过高质量的视角一致性图像合成,生成具有多种外观和详细几何结构的全头部,仅使用野生无序图像进行训练。其核心是,我们提升了最近发布的3DGAN的表示能力,并在从广泛分布的野生图像训练时解决了数据对齐差距。具体来说,我们提出了一种新的二阶段自适应性图像对齐方法,以 robust 3D GAN训练。我们还介绍了一种 tri-grid 神经网络体积表示,有效地解决了正面和头部背景特征的纠缠,的根源是被广泛采用的三个平面表示。我们的方法在3D神经网络场景结构的对抗学习中引入了2D图像分割的先验知识,使可以在多种背景中合成头部。得益于这些设计,我们的方法显著优于以前的3DGAN,生成高质量的3D头部,具有准确的几何和多种外观,即使具有长波状和frofro的发型,也能从任意姿态进行渲染。此外,我们展示了我们的系统可以从单个输入图像中恢复完整的3D头部,为个性化的真实3D角色创建定制的逼真头像。
https://arxiv.org/abs/2303.13071
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
最近的开放词汇检测方法旨在通过从训练大量图像文本对视觉语言模型(VLMs)的知识进行蒸馏,检测新对象。为了改进这些方法的效果,研究人员使用了大量的词汇表数据,其中包含大量对象类别,假设这些数据可以让模型提取关于各种对象的关系和更广泛地应用于未观测的对象类别的全面知识。在本研究中,我们认为需要更多的精细标签才能提取更丰富的知识,包括对象属性和关系,除了名称。为了解决这个挑战,我们提出了一种简单的有效的方法名为“伪标题标签”(PCL),该方法使用图像标题生成模型生成描述对象实例的不同角度的摘要。所产生的伪标题标签提供了密度丰富的知识蒸馏样本。在LVIS基准测试中,我们训练的最优模型在未重复训练的视觉基因组数据集上取得了34.5的AP和30.6的APr,与最先进的性能相当。PCL的简单易用和灵活性是其他显著的特征,它是一种简单的预处理技术,可以与任何图像标题生成模型一起使用,而无需对模型架构或训练过程施加任何限制。
https://arxiv.org/abs/2303.13040
Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.
移动设备的三维物体检测(Mono3D) (例如在车辆、无人机或机器人中)是一项重要但具有挑战性的任务。现有的基于Transformer的离线Mono3D模型采用网格视觉代币,虽然使用细代币可以提高性能,但在使用粗代币时性能有所下降。在本文中,我们提出了一个在线Mono3D框架,称为MonoATT,它利用一种具有不同形状和大小的异质代币的新视觉Transformer来实现移动设备的Mono3D。MonoATT的核心思想是,在利用Transformer增强Mono3D之前,自适应地将更细的代币分配给更有意义的区域。为此,我们首先利用先前知识设计了一个评分网络,用于选择图像中最重要的区域,然后提出了一种具有注意力机制的代币簇集和融合网络,以逐步将代币集中在选定区域周围。最后,从不同代币中恢复像素级特征映射,并在使用SOTA的Mono3D检测器作为底层检测核心之前使用该框架进行物体检测。现实世界KITTI数据集的实验结果显示,MonoATT能够有效提高近远物体的Mono3D精度,并保证低延迟。MonoATT比最先进的方法表现更好,并以KITTI 3D基准排名第一。
https://arxiv.org/abs/2303.13018
Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at this https URL.
知识蒸馏(KD)使用教师的预测 logits 作为软标签来指导学生,而自我蒸馏不需要真正的教师来要求软标签。这项工作将通用KD损失分解和重组为 normalized KD(NKD)损失,以及为 both 目标 class(图像类别)和非目标 class(图像)的自定义软标签,名为Universal Self-Knowledge Distillation(USKD)。我们分解了KD损失并发现它使非目标损失迫使学生的非目标 logits 与教师相同,但两个非目标 logits 的总和不同,防止它们相同。NKD 将非目标 logits 标准化以平衡它们的总和。它可以广泛应用于 KD 和自我蒸馏,更好地使用软标签为蒸馏损失。USKD 为 both 目标和非目标 class 生成自定义软标签,它平滑学生的目标 logit 作为软目标标签,使用排名的中间特征生成软非目标标签,以Zipf's law。对于 KD 有教师的情况,我们的NKD 在CIFAR-100 和 ImageNet 数据集上取得了最先进的性能,使用 ResNet-34 教师将 ImageNet 上的 ImageNet 1 准确性从69.90%提高到71.96%。对于自我蒸馏没有教师的情况,USKD 是第一种自我蒸馏方法,可以在 CNN 和 VIT 模型中有效地应用于 both CNN 和 ViT 模型,并几乎没有额外的时间和内存成本,产生新的最先进的结果,例如 MobileNet 和 DeiT-Tiny 在 ImageNet 上的 1.17% 和 0.55% 准确性提高。我们的代码在此https URL上可用。
https://arxiv.org/abs/2303.13005
New knowledge originates from the old. The various types of elements, deposited in the training history, are a large amount of wealth for improving learning deep models. In this survey, we comprehensively review and summarize the topic--``Historical Learning: Learning Models with Learning History'', which learns better neural models with the help of their learning history during its optimization, from three detailed aspects: Historical Type (what), Functional Part (where) and Storage Form (how). To our best knowledge, it is the first survey that systematically studies the methodologies which make use of various historical statistics when training deep neural networks. The discussions with related topics like recurrent/memory networks, ensemble learning, and reinforcement learning are demonstrated. We also expose future challenges of this topic and encourage the community to pay attention to the think of historical learning principles when designing algorithms. The paper list related to historical learning is available at \url{this https URL.}
新知识从旧知识中产生。将各种元素存储在训练历史中可以提高深度学习模型的性能。在本调查中,我们全面审查和总结了“历史学习:学习模型与学习历史”主题,该主题在优化过程中利用其学习历史来改进神经网络模型,从三个详细方面:历史类型(是什么)、功能部分(在哪里)和存储形式(如何)进行总结。据我们所知,这是第一项系统地研究在训练深度学习网络时使用各种历史统计方法的 survey。与循环/记忆网络、集成学习和强化学习等相关的主题进行了讨论。我们也揭示了这个主题的未来挑战,并鼓励社区在制定算法时关注历史学习原则。与历史学习相关的论文列表可以在 \url{this https URL.} 找到。
https://arxiv.org/abs/2303.12992
This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification and sentiment analysis of product reviews. A "cross-context" setting is enabled using test sets that are distinct from the training data. Specifically, in the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews. This comparison is aimed at exploring the limits of the representative power of today's Natural Language Processing systems on the path to the systems that are generalizable to real-life scenarios. The models are fine-tuned and fed into a Feed-Forward Neural Network and a Bidirectional Long Short Term Memory network. Multinomial Naive Bayes and Linear Support Vector Machine are used as traditional baselines. The results show that, in binary text classification, DistilBERT is significantly better than ELMo on generalizing to the cross-context setting. ELMo is observed to be significantly more robust to the cross-context test data than both baselines. On the other hand, the baselines performed comparably well to ELMo when the training and test data are subsets of the same corpus (no cross-context). DistilBERT is also found to be 30% smaller and 83% faster than ELMo. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also favorable in incorporating into real-life systems for it requires a smaller computational training budget. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional ML algorithms can still be considered as more economic alternatives to deep language representations.
这项研究评估了ELMo和DistilBERT这两种最先进的深度学习上下文语言表示的稳健性,它们在监督学习二进制评论新闻分类和商品评论情感分析方面的表现。使用与训练数据不同的测试集,实现了一个“跨上下文”设置。具体而言,在新闻分类任务中,模型基于印度本地新闻和中国本地新闻开发,在情感分析任务中,模型基于电影评论训练,并在顾客评论中测试。这个比较旨在探索当今自然语言处理系统的代表作力的极限,以使其能够适用于实际场景。模型经过了优化,并输入到Feed-Forward神经网络和双向长期短期记忆网络中。多nomial Naive Bayes和线性支持向量机作为传统的基线。结果显示,在二进制文本分类中,DistilBERT在跨上下文 setting上的表现比ELMo更好。ELMo观察到其对跨上下文测试数据的稳定性比两个基线都强。另一方面,当训练和测试数据都是同一个语料库的子集(没有跨上下文)时,ELMo的表现与DistilBERT相当。DistilBERT也被发现比ELMo小30%,运行速度更快83%。结果显示,DistilBERT可以更好地将通用语义知识向其他领域转移,比ELMo更有效。DistilBERT也被认为更适合融入实际系统,因为它需要的计算训练预算较小。当泛化不是最优先考虑时,测试领域与训练领域相似,传统的机器学习算法仍然可以被视为深度学习表示的更经济的选择。
https://arxiv.org/abs/2303.12936