We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at this https URL
我们提出了GIFT(生成可解释的微调转换器)方法,以在参数效率的前提下,在下游任务中微调预训练(通常较大)Transformer模型。我们的GIFT是一种深度参数残差学习方法,它解决了在微调预训练Transformer模型时遇到的两个问题:如何将参数高效的微调(PEFT)应用到模型的最后投影(线性)层,以及如何通过直接利用预训练模型的知识来学习PEFT。对于第一个问题,我们选择Transformer模型的多头自注意力最后投影层,并验证其有效性。对于第二个问题,与先前的艺术作品不同,我们提出了一种学习生成微调参数的方法。我们的GIFT是一种超Transformer,它接受投影层的预训练参数,并通过所提出的参数到聚类的注意(PaCa)方法生成微调参数。PaCa产生一个简单的聚类为基础的向前解释器,在测试中扮演着语义分割的角色。在实验中,我们对VTAB基准和细粒度视觉分类(FGVC)基准进行了测试。与先前的艺术作品相比,我们的GIFT取得了显著的更好的性能。我们的代码可在此处访问:https://www.xxxxxx
https://arxiv.org/abs/2312.00700
Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.
表格结构识别(TSR)旨在将无结构表格图像转换为结构格式,如HTML序列。一种流行的解决方案是使用检测模型检测表格的组件,如列和行,然后应用基于规则的后处理方法将检测结果转换为HTML序列。然而,现有基于检测的研究通常具有以下局限性。首先,这些研究通常更加关注提高检测性能,这并不一定导致关于细胞层面的指标(如TEDS)的更好表现。其次,一些解决方案过于简单化问题,可能遗漏一些关键信息。最后,尽管一些研究将问题定义为检测更多的组件以提供尽可能多的信息,但这些研究忽略了这个问题是一个多标签检测,因为行、投影行头和列头可以共享相同的边界框。此外,即使在它们在COCO指标上的性能类似,两阶段和Transformer-based检测模型之间也存在性能差距。因此,我们重新审视了现有基于检测的研究的局限性,比较了两阶段和Transformer-based检测模型,并确定了TSR任务中两阶段检测模型成功的关键设计方面,包括多分类问题定义、锚框生成方面和基础知识网络的特征生成。我们使用简单的方法来提高这些方面,达到了最先进的性能,并将基线Cascade R-CNN模型在结构-only TEDS上的性能提高了19.32%、11.56%和14.77%。
https://arxiv.org/abs/2312.00699
To train well-performing generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part. The results also show that no major difference in performance or similarity could be seen between frozen and unfrozen backbone.
为了训练性能卓越的泛化神经网络,需要足够大且多样化的数据集。在遵守隐私法的前提下收集数据变得越来越困难,同时对大型数据集进行注释也是资源耗竭和时间充裕的任务。克服这些困难的方法是使用合成数据,因为合成数据本质上具有可扩展性,可以自动标注。然而,合成数据训练对神经网络层的影响仍不清楚。在本文中,我们在城市环境中训练YOLOv3目标检测器,并使用中心卷积对齐(CKA)进行相似性分析,探讨合成数据对层的影响。分析 capturing 探测器架构的同时显示不同模型之间的不同和相似模式。通过这种相似性分析,我们希望为训练合成数据对每个层的影响提供洞察,并为复杂神经网络的内部工作提供更好的理解。结果显示,在真实数据训练的检测器与合成数据训练的检测器之间,最大的相似性在早期层,最大的差异在头部部分。此外,结果显示在冻结和未冻结骨干之间,没有看到明显的性能或相似性的差异。
https://arxiv.org/abs/2312.00694
Developing and evaluating vision science methods require robust and efficient tools for assessing their performance in various real-world scenarios. This study presents a novel virtual reality (VR) simulation tool that simulates real-world optical methods while giving high experimental control to the experiment. The tool incorporates an experiment controller, to smoothly and easily handle multiple conditions, a generic eye-tracking controller, that works with most common VR eye-trackers, a configurable defocus simulator, and a generic VR questionnaire loader to assess participants' behavior in virtual reality. This VR-based simulation tool bridges the gap between theoretical and applied research on new optical methods, corrections, and therapies. It enables vision scientists to increase their research tools with a robust, realistic, and fast research environment.
开发和评估 vision science 方法需要 robust 和 efficient 的工具来评估其在各种现实场景中的表现。本研究介绍了一种新颖的虚拟现实(VR)模拟工具,在模拟现实光学方法的同时为实验提供了高水平的实验控制。该工具包括一个实验控制器,以平滑地处理多种条件,一个通用的眼跟踪控制器,适用于大多数常见的 VR 眼跟踪器,一个可配置的散焦模拟器和一种通用的 VR 问卷加载器,以评估参与者在虚拟现实中的行为。基于 VR 的模拟工具在理论和技术研究之间架起了一道桥梁,使 vision 科学家能够利用更真实、更快速的研究环境来增加他们的研究工具。
https://arxiv.org/abs/2312.00692
We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g. CAD or video sequence) is required at inference, (iii) the object is imaged from two different viewpoints of two different scenes, and (iv) the object was not observed during the training phase. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from two distinct scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 39 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Project website: this https URL.
我们介绍了一个新的设置,即开放词汇对象6D姿态估计,其中使用文本提示指定感兴趣的对象。与现有方法相比,在我们的设置中,(1)仅通过文本提示指定感兴趣的对象,(2)不需要物体模型(例如CAD或视频序列)进行推理,(3)从两个不同场景的不同视角对物体进行图像,(4)在训练阶段没有观察到该物体。要操作在这种设置中,我们引入了一种利用Vision-Language Model对感兴趣的对象进行分割并估计其相对6D姿态的新颖方法。我们方法的关键是一个精心设计的策略,将提示提供的物体级别信息与局部图像特征相结合,产生一个可以扩展到新概念的特征空间。我们对该方法在一个包含两个受欢迎数据集(REAL275和Toyota-Light)的新基准上进行了验证,这些数据集总共包含40个物体实例。结果表明,我们的方法在估计不同场景中物体的相对6D姿态方面优于既定的手工制作方法和基于深度学习的最近基线。项目网站:此https URL。
https://arxiv.org/abs/2312.00690
The ability of generative models to accurately fit data distributions has resulted in their widespread adoption and success in fields such as computer vision and natural language processing. In this chapter, we provide a brief overview of the application of generative models in the domain of infrared (IR) image super-resolution, including a discussion of the various challenges and adversarial training methods employed. We propose potential areas for further investigation and advancement in the application of generative models for IR image super-resolution.
生成模型的能力准确地适配数据分布已经在计算机视觉和自然语言处理等领域得到广泛应用和成功。在本章中,我们简要介绍了生成模型在红外图像超分辨率领域的应用,包括讨论了各种挑战和使用的对抗训练方法。我们提出了在红外图像超分辨率应用中进一步研究和发展的潜在领域。
https://arxiv.org/abs/2312.00689
Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.
受语法结构的指导,单词组合形成句子,受语篇结构的指导,句子组合形成对话和文档。句子和语篇单位的组合方面常常被机器学习算法忽视。最近的一个名为量子自然语言处理(QNLP)的倡议将词义学习为希尔伯特空间中的点,并通过将语篇结构从语法结构翻译为参数化量子电路(PQCs)来对这些词义进行操作。之前的 work 将 QNLP 翻译扩展到语篇结构,利用希尔伯特空间中的点。在本文中,我们通过 winograd 风格的同义词消解任务评估了这种翻译。我们训练了一个二分类变分量子分类器(VQC),并实现了端到端同义词消解系统。在 IBMQ 软件上执行的模拟获得了 87.20%的 F1 分数。该模型超过了两个三分之二的经典同义词消解系统,并接近于最先进的 SpanBERT。一种混合量子-经典模型还提高了这些结果,F1 分数增加了约 6%。
https://arxiv.org/abs/2312.00688
The neural architectures of language models are becoming increasingly complex, especially that of Transformers, based on the attention mechanism. Although their application to numerous natural language processing tasks has proven to be very fruitful, they continue to be models with little or no interpretability and explainability. One of the tasks for which they are best suited is the encoding of the contextual sense of words using contextualized embeddings. In this paper we propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words by modeling semantic compositionality. Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes. A partial implementation of the proposed model is carried out and compared with Transformer-based architectures for a given semantic task, namely the similarity calculation of word senses in context. The results obtained show that it is possible to be competitive with linguistically motivated models instead of using the black boxes underlying complex neural architectures.
语言模型的神经架构变得越来越复杂,尤其是基于注意机制的Transformer架构。尽管它们在许多自然语言处理任务上的应用已经证明非常有益,但它们仍然是具有少量或完全不可解释性和可解释性的模型。它们最适合的任务之一是使用上下文嵌入编码单词的上下文意义。在本文中,我们提出了一个透明、可解释、有语言动机的方法,通过建模语义组合性来编码单词的上下文意义。特别关注依赖关系和语义概念,如选择偏好和范式类。对所提出的模型的部分实现进行了比较,并将其与基于Transformer的架构在给定语义任务上的相似性进行了比较。得到的结果表明,可以与语义驱动的模型竞争,而不使用复杂神经架构背后的黑盒。
https://arxiv.org/abs/2312.00680
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains, reshaping the artificial general intelligence landscape. However, the increasing computational and memory demands of these models present substantial challenges, hindering both academic research and practical applications. To address these issues, a wide array of methods, including both algorithmic and hardware solutions, have been developed to enhance the efficiency of LLMs. This survey delivers a comprehensive review of algorithmic advancements aimed at improving LLM efficiency. Unlike other surveys that typically focus on specific areas such as training or model compression, this paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs. Specifically, it covers various topics related to efficiency, including scaling laws, data utilization, architectural innovations, training and tuning strategies, and inference techniques. This paper aims to serve as a valuable resource for researchers and practitioners, laying the groundwork for future innovations in this critical research area. Our repository of relevant references is maintained at url{this https URL}.
大语言模型(LLMs)的快速增长一直是改变各种领域的驱动力,塑造了人工智能领域的格局。然而,这些模型的计算和内存需求不断增加,产生了巨大的挑战,阻碍了学术研究和实际应用的发展。为解决这些问题,已经开发了各种方法,包括算法和硬件解决方案,以提高LLM的效率。本调查对旨在提高LLM效率的算法的全面回顾。与其他调查通常关注特定领域(如训练或模型压缩)不同,本文从多维度探讨了提高LLM效率的必要性。具体来说,它涵盖了各种关于效率的主题,包括扩展定律、数据利用率、架构创新、训练和调整策略以及推理技术。本文旨在为研究人员和实践者提供宝贵的资源,为未来在这个关键研究领域的新兴创新铺平道路。我们的相关参考文献库网址为url{this <https://url.to/this> 。
https://arxiv.org/abs/2312.00678
In recent studies on MRI reconstruction, advances have shown significant promise for further accelerating the MRI acquisition. Most state-of-the-art methods require a large amount of fully-sampled data to optimise reconstruction models, which is impractical and expensive under certain clinical settings. On the other hand, for unsupervised scan-specific reconstruction methods, overfitting is likely to happen due to insufficient supervision, while restrictions on acceleration rates and under-sampling patterns further limit their applicability. To this end, we propose an unsupervised, adaptive coarse-to-fine framework that enhances reconstruction quality without being constrained by the sparsity levels or patterns in under-sampling. The framework employs an implicit neural representation for scan-specific MRI reconstruction, learning a mapping from multi-dimensional coordinates to their corresponding signal intensities. Moreover, we integrate a novel learning strategy that progressively refines the use of acquired k-space signals for self-supervision. This approach effectively adjusts the proportion of supervising signals from unevenly distributed information across different frequency bands, thus mitigating the issue of overfitting while improving the overall reconstruction. Comprehensive evaluation on a public dataset, including both 2D and 3D data, has shown that our method outperforms current state-of-the-art scan-specific MRI reconstruction techniques, for up to 8-fold under-sampling.
在最近的研究中,MRI重建取得了显著的进展,进一步加速了MRI采集。大多数最先进的方法需要大量完全采样数据来优化重建模型,这在某些临床设置下是不切实际的,而且代价昂贵。另一方面,对于无监督的扫描特定重建方法,由于监督不足,过拟合很可能发生,而限制加速率和欠采样模式进一步限制了它们的适用性。因此,我们提出了一个无监督、自适应的粗到细框架,该框架不会受到欠采样级数或模式的限制。该框架使用了一个隐式神经表示来进行扫描特定的MRI重建,从多维坐标到相应的信号强度学习映射。此外,我们还集成了一种新的学习策略,该策略逐渐优化了已获得的k空间信号的自监督使用。这种方法有效地调整了不同频率带之间信息分布不均匀的情况下监督信号的比例,从而缓解了过拟合的问题,同时提高了整体重建。在公开数据集上进行全面的评估,包括2D和3D数据,已经证明了我们的方法在失真量可以达到8倍的情况下优于当前的扫描特定MRI重建技术。
https://arxiv.org/abs/2312.00677
Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls within the training data distribution for diagnostic purposes. Uncertainty estimation approaches exist but focus on providing an uncertainty map to radiologists, rather than assessing the training distribution fit. In this work, we propose a method based on the local Lipschitz-based metric to distinguish out-of-distribution images from in-distribution with an area under the curve of 99.94%. Empirically, we demonstrate a very strong relationship between the local Lipschitz value and mean absolute error (MAE), supported by a high Spearman's rank correlation coefficient of 0.8475, which determines the uncertainty estimation threshold for optimal model performance. Through the identification of false positives, the local Lipschitz and MAE relationship was used to guide data augmentation and reduce model uncertainty. Our study was validated using the AUTOMAP architecture for sensor-to-image Magnetic Resonance Imaging (MRI) reconstruction. We compare our proposed approach with baseline methods: Monte-Carlo dropout and deep ensembles, and further analysis included MRI denoising and Computed Tomography (CT) sparse-to-full view reconstruction using UNET architectures. We show that our approach is applicable to various architectures and learned functions, especially in the realm of medical image reconstruction, where preserving the diagnostic accuracy of reconstructed images remains paramount.
精确的图像重建是医学影像诊断的核心。已研究了使用监督深度学习方法来解决包括图像重建在内的反问题。然而,这些训练好的模型在部署时会面临从训练数据分布广泛偏移的未见数据分布。因此,在诊断目的下评估给定输入是否落在训练数据分布非常重要。不确定性估计方法存在,但主要关注提供给放射医生的不确定性地图,而不是评估训练分布的拟合度。在这项工作中,我们提出了一种基于局部Lipschitz度量的方法来区分离散和分布中的图像。通过99.94%面积下的平均绝对误差(MAE)与局部Lipschitz值的关系,我们实验证明了一个非常强的关系。通过识别假阳性,我们利用局部Lipschitz和MAE关系指导数据增强和减少模型不确定性。我们的研究通过AUTOMAP架构对传感器到图像的磁共振成像(MRI)重建进行了验证。我们比较了我们的方法与基线方法:蒙特卡洛随机失活和深度集成,还进一步分析了使用UNET架构的CT稀疏到全视图重建。我们证明了我们的方法适用于各种架构和学习函数,尤其是在医学图像重建领域,保留重建图像的诊断准确性至关重要。
https://arxiv.org/abs/2305.07618
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
像CLIP这样的视觉语言预训练在各种下游任务上的表现都相当出色,例如零散射击图像分类和图像-文本检索。大多数现有的CLIP类似作品通常采用较大的图像编码器,如ResNet50和ViT,而轻量级的版本很少被讨论。在本文中,我们提出了一个多级交互范式来训练轻量级CLIP模型。首先,为了减轻一些图像-文本对不是严格一对一对应的问题,我们通过逐渐软化负样本的标签来改进传统的全局实例级对齐目标。其次,引入了一个基于标记的轻量化二元匹配的平滑二元匹配基于词级对齐目标,用于对图像补丁和文本单词进行细粒度对齐。此外,根据观察到CLIP模型的准确性不会随着文本编码器参数的增加而相应增加,我们引入了遮蔽语言建模(MLM)的额外目标,以提高缩短的文本编码器的潜力。在实践中,我们提出了一个在网络阶段为遮蔽图像嵌入注入的辅助融合模块,以增强MLM。大量实验证明,在无需在推理过程中引入额外计算成本的情况下,所提出的方法在多个下游任务上取得了更高的性能。
https://arxiv.org/abs/2312.00674
In recent years, several unsupervised cell segmentation methods have been presented, trying to omit the requirement of laborious pixel-level annotations for the training of a cell segmentation model. Most if not all of these methods handle the instance segmentation task by focusing on the detection of different cell instances ignoring their type. While such models prove adequate for certain tasks, like cell counting, other applications require the identification of each cell's type. In this paper, we present CellMixer, an innovative annotation-free approach for the semantic segmentation of heterogeneous cell populations. Our augmentation-based method enables the training of a segmentation model from image-level labels of homogeneous cell populations. Our results show that CellMixer can achieve competitive segmentation performance across multiple cell types and imaging modalities, demonstrating the method's scalability and potential for broader applications in medical imaging, cellular biology, and diagnostics.
近年来,出现了许多无需劳动密集型像素级注释的细胞分割方法,旨在省略训练细胞分割模型所需的繁琐的像素级注释要求。大多数方法通过专注于检测不同细胞实例来处理实例分割任务,而忽略了它们的类型。虽然这些模型对于某些任务(如细胞计数)证明足够有效,但其他应用需要识别每个细胞的类型。在本文中,我们提出了CellMixer,一种创新的无需注释的异质细胞群体语义分割方法。我们的增强驱动方法使来自同质细胞群体的图像级标签的分割模型的训练成为可能。我们的结果表明,CellMixer可以在多种细胞类型和成像模态之间实现竞争力的分割性能,证明了该方法在医学成像、细胞生物学和诊断领域广泛应用的可能性。
https://arxiv.org/abs/2312.00671
Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: this https URL.
深度神经网络模型在关闭设置和完整标签的情况下训练在3D场景理解方面取得了显著的进步。然而,当前的3D识别方法的主要瓶颈是,它们无法识别任何未在训练类别之外的新颖类别的现实世界应用。与此同时,最先进的3D场景理解方法主要需要高质量的标签来训练神经网络,而仅仅在完全监督的方式下表现良好。本文提出了一种处理有限标注场景的通用的简单框架。为了从预训练的视觉-语言模型中提取知识,我们提出了一种层次特征对齐的预训练和知识蒸馏策略,以提取和蒸馏大规模视觉-语言模型中的有意义的信息,从而帮助解决开箱见光的场景理解任务。为了利用边界信息,我们提出了一种基于能量的损失函数,其中边界感知使区域级别边界预测受益。为了鼓励潜在实例区分,并确保效率,我们提出了一种自监督的区域级别语义对比学习方案,基于神经网络的自信预测来区分中间特征嵌入的多阶段。在室内和室外场景的广泛实验中,我们的方法在数据有效的学习和开放世界少数样本学习方面都取得了显著的效果。所有代码、模型和数据都公开发布在以下这个链接上:https://this URL。
https://arxiv.org/abs/2312.00663
The current paradigm of large-scale pre-training and fine-tuning Transformer large language models has lead to significant improvements across the board in natural language processing. However, such large models are susceptible to overfitting to their training data, and as a result the models perform poorly when the domain changes. Also, due to the model's scale, the cost of fine-tuning the model to the new domain is large. Nonparametric Variational Information Bottleneck (NVIB) has been proposed as a regulariser for training cross-attention in Transformers, potentially addressing the overfitting problem. We extend the NVIB framework to replace all types of attention functions in Transformers, and show that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation. We then show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalisation without any training. This success supports the hypothesis that pretrained Transformers are implicitly NV Bayesian models.
当前的大规模预训练和微调Transformer大型语言模型的范式在自然语言处理方面的表现已经取得了显著的提高。然而,这些大型模型容易过拟合到其训练数据,因此当领域发生变化时,这些模型表现不佳。此外,由于模型的规模,微调模型以新领域为代价的成本较高。非参数变分信息瓶颈(NVIB)作为一种跨注意力的 regulariser 被提出,可能解决过拟合问题。我们扩展了NVIB框架,用它来替代Transformer中所有类型的注意力函数,并表明,通过所提出的身份初始化,预训练的Transformer可以重新解释为非参数变分模型。然后我们证明了,初始化会引入一种新的信息论上的后训练 regularisation,在注意力机制中,从而改善离域泛化能力,而无需进行训练。这一成功支持了预训练Transformers是隐式NV贝叶斯模型的假设。
https://arxiv.org/abs/2312.00662
Purpose: To develop an efficient dual-domain reconstruction framework for multi-contrast MRI, with the focus on minimising cross-contrast misalignment in both the image and the frequency domains to enhance optimisation. Theory and Methods: Our proposed framework, based on deep learning, facilitates the optimisation for under-sampled target contrast using fully-sampled reference contrast that is quicker to acquire. The method consists of three key steps: 1) Learning to synthesise data resembling the target contrast from the reference contrast; 2) Registering the multi-contrast data to reduce inter-scan motion; and 3) Utilising the registered data for reconstructing the target contrast. These steps involve learning in both domains with regularisation applied to ensure their consistency. We also compare the reconstruction performance with existing deep learning-based methods using a dataset of brain MRI scans. Results: Extensive experiments demonstrate the superiority of our proposed framework, for up to an 8-fold acceleration rate, compared to state-of-the-art algorithms. Comprehensive analysis and ablation studies further present the effectiveness of the proposed components. Conclusion:Our dual-domain framework offers a promising approach to multi-contrast MRI reconstruction. It can also be integrated with existing methods to further enhance the reconstruction.
目的:开发一个高效的多对比MRI双域重构框架,重点在于在图像和频域中最小化交叉对比错配,以提高优化。理论和方法:基于深度学习的 proposed 框架通过完全采样的参考对比来优化低采样目标对比。该方法包括三个关键步骤:1)从参考对比中学习生成与目标对比相似的数据;2)将多对比数据注册以减少扫描间的运动;3)利用注册的数据进行目标对比的重建。这些步骤涉及在两个领域中应用正则化,以确保它们的一致性。我们还使用脑MRI扫描数据集与现有的基于深度学习的方法进行比较。结果:广泛的实验证明了我们提出的框架与最先进的算法的优越性,相比性能提升可达8倍。全面的分析和消融研究进一步证明了所提出的组件的有效性。结论:我们的双域框架为多对比MRI重建提供了一个有前途的方法。它还可以与现有的方法集成,进一步增强重建。
https://arxiv.org/abs/2312.00661
We consider a setting where a population of artificial learners is given, and the objective is to optimize aggregate measures of performance, under constraints on training resources. The problem is motivated by the study of peer learning in human educational systems. In this context, we study natural knowledge diffusion processes in networks of interacting artificial learners. By `natural', we mean processes that reflect human peer learning where the students' internal state and learning process is mostly opaque, and the main degree of freedom lies in the formation of peer learning groups by a coordinator who can potentially evaluate the learners before assigning them to peer groups. Among else, we empirically show that such processes indeed make effective use of the training resources, and enable the design of modular neural models that have the capacity to generalize without being prone to overfitting noisy labels.
我们考虑一个设置,其中给定了一个由人工学习者组成的 population,目标是优化在训练资源限制下的绩效总和,同时满足约束条件。这个问题源于人类教育系统中的 peer learning 研究。在这样一个背景下,我们研究了相互作用的人工学习者网络中的自然知识扩散过程。通过“自然”一词,我们指的是反映了人类 peer learning 的过程,其中学生的内部状态和学习过程大部分都是不透明的,而主要自由度在于协调者(可以评估学习者并将其分配到 peer 组中)形成 peer 学习小组。除此之外,我们通过实证研究证明了,这样的过程确实有效地利用了培训资源,并使得设计具有泛化能力而不易出现过拟合的噪声标签的模块化神经模型成为可能。
https://arxiv.org/abs/2312.00660
We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.
我们考虑可迁移性估计,即估计深度学习模型从源任务到目标任务的转移性能。我们关注回归任务,这些任务尚未得到足够的关注,并提出了两种简单且计算效率高的方法,基于线性回归模型的负严格平均误差来估计可迁移性。我们证明了我们的方法将实际可迁移性转移到从迁移学习过程获得的优化目标模型的结果与我们的方法联系起来。尽管它们的简单性,但我们的方法在准确性和效率上显著优于现有的回归传输性估计器。在两个大型关键点回归基准上,我们的方法平均可提供12%至36%的更好的结果,而至少比以前的方法快27%。
https://arxiv.org/abs/2312.00656
Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.
扩散模型已经在生成数据用于感知任务(如图像分类和目标检测)中取得了突出地位。然而,在生成高质量跟踪序列方面,在视频感知领域中还没有完全研究。为了填补这一空白,我们提出了TrackDiffusion,一种新架构,旨在从跟踪器中生成连续视频序列。TrackDiffusion在传统布局到图像(L2I)生成和复制的合成中取得了显著的突破,通过使图像扩散模型涵盖动态和连续跟踪轨迹,从而捕捉复杂运动细节并确保视频帧之间的实例一致性。为了首次证明,生成的视频序列可以用于训练多目标跟踪(MOT)系统,从而显著提高跟踪器性能。实验结果表明,我们的模型在生成视频序列方面显著提高了实例一致性,从而提高了感知指标。我们的方法在YTVIS数据集上的TrackAP和TrackAP$_{50}分别提高了8.7和11.8,强调了其重新定义视频数据生成标准,适用于MOT任务以及更广泛的视频感知应用。
https://arxiv.org/abs/2312.00651
Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at this https URL .
无监督以物体为中心的学习旨在将场景分解为可解释的对象实体,称为插槽。基于插槽的自动编码器被视为这种任务的一个突出方法。在其中,关键方面包括指导编码器生成特定对象的插槽,并确保解码器在重构过程中使用它们。本文介绍了两种新颖的技术,(i) 一个基于注意力的自训练方法,它从解码器中提取出色的插槽级注意力,增强物体分割,和(ii) 一种创新的自回归变换器补码策略,它加强了插槽向量在重构中的作用。这些策略的有效性在实验中得到了展示。这种结合方法在无监督物体分割方面显著超过了先前的插槽自动编码器方法,特别是对于复杂现实图像。我们在这里提供了实现代码的URL。
https://arxiv.org/abs/2312.00648