The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未训练过的类上表现良好。以前的算法从开始训练就开始训练特征金字塔和检测头,这破坏了在预训练期间建立的视觉文本特征对齐,并努力防止语言模型忘记未训练过的类。我们提出了三种方法来缓解这些问题。第一种方法是使用简单的方案来增加文本嵌入,以防止在训练期间看到的少数类上过度拟合,同时同时节省内存和计算。第二种方法是修改特征金字塔网络和检测头,包括可训练的门控快捷方式,这鼓励视觉文本特征对齐,并在检测训练开始时保证它。最后一种方法是利用更大的图像文本对语料库,从而提高检测在这些类上没有人类标注 bounding box 的检测性能。我们三种方法在 LVIS 基准测试的零样本版本上进行评估,每个方法都表现出明显和重要的 benefits。我们的最终网络在 mAP-all 度量上实现了新的前沿技术,并表现出 mAP-罕见的类上的 competitive 性能,以及与 COCO 和 Object365 相比更好的传输性能。
https://arxiv.org/abs/2303.13518
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
Diffusion-based models for text-to-image generation have gained immense popularity due to recent advancements in efficiency, accessibility, and quality. Although it is becoming increasingly feasible to perform inference with these systems using consumer-grade GPUs, training them from scratch still requires access to large datasets and significant computational resources. In the case of medical image generation, the availability of large, publicly accessible datasets that include text reports is limited due to legal and ethical concerns. While training a diffusion model on a private dataset may address this issue, it is not always feasible for institutions lacking the necessary computational resources. This work demonstrates that pre-trained Stable Diffusion models, originally trained on natural images, can be adapted to various medical imaging modalities by training text embeddings with textual inversion. In this study, we conducted experiments using medical datasets comprising only 100 samples from three medical modalities. Embeddings were trained in a matter of hours, while still retaining diagnostic relevance in image generation. Experiments were designed to achieve several objectives. Firstly, we fine-tuned the training and inference processes of textual inversion, revealing that larger embeddings and more examples are required. Secondly, we validated our approach by demonstrating a 2\% increase in the diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we performed simulations by interpolating between healthy and diseased states, combining multiple pathologies, and inpainting to show embedding flexibility and control of disease appearance. Finally, the embeddings trained in this study are small (less than 1 MB), which facilitates easy sharing of medical data with reduced privacy concerns.
散射模型用于文本到图像生成已经因其效率和可用性的进步而获得了极大的流行度。尽管使用消费者级别的GPU进行推断已经成为越来越可行的方法,但对于训练从 scratch 开始的全新模型仍然需要访问大量的数据集和重要的计算资源。在医学图像生成方面,包含文本报告的大规模公共数据集的可用性因为法律和伦理问题而受到限制。虽然训练一个私有数据集可能可以解决这一问题,但对于缺乏必要的计算资源的机构来说并不是 always 可行的。这项工作证明了训练先前训练于自然图像上的稳定扩散模型,可以将其适应各种医学成像模式,通过训练文本嵌入来实现。在本研究中,我们使用仅包含 100 个样本的医疗数据集训练了文本嵌入,仅仅需要几个小时,但仍然能够在图像生成中保留诊断相关性。实验旨在实现多个目标。首先,我们优化了文本逆置的训练和推断过程,揭示了需要更多的嵌入和更多的示例才能实现。其次,我们证明了我们的方法的有效性,通过演示在 MRI 中检测前列腺癌时,诊断准确性(AUC)的提高,从 0.78 提高到了 0.80。第三,我们使用平滑过渡在不同健康和患病状态之间进行建模,结合多种病理学,并进行了涂色,以展示嵌入的灵活性和控制疾病的外观。最后,训练在 this 研究中使用的嵌入非常小(小于 1 MB),这便于更轻松地分享医学数据,同时减少隐私担忧。
https://arxiv.org/abs/2303.13430
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: this https URL.
CLIP已经实现了新的、令人兴奋的视觉语言联合应用,其中之一就是开放词汇分割,它可以在任何给定的文本查询中定位任意Segment。在我们的研究中,我们提出了一个新的问题:是否可以在没有用户指导的情况下,以文本查询或预先定义的类别形式发现语义Segment,并使用自然语言自动标签它们?我们提出了一个崭新的问题零指导分割和第一个基准,它利用DINO和CLIP两个预训练通用模型来解决这个问题,而不需要 Fine-tuning 或分割数据集。我们的主要想法是首先将图像分割成较小的过分割块,将它们编码到 CLIP 的视觉语言空间中,将其转换为文本标签,并将语义相似的Segment 合并在一起。然而,关键挑战是如何将一个视觉Segment 编码为特定的嵌入,以平衡全局和局部上下文信息,这对识别都有益。我们的主要贡献是一个新的注意遮蔽技术,通过分析 CLIP 内部的注意力层,平衡了两个上下文信息。我们还介绍了几个指标来评价这个新任务。利用 CLIP 固有的知识,我们的方法可以精确地在博物馆人群中定位蒙娜丽莎画作。项目页面:这个 https URL。
https://arxiv.org/abs/2303.13396
Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
标签稀缺是改善特定领域的任务表现的瓶颈。我们提出了一种全新的组件化 Transfer Learning 框架(DoT5 - 域组件式零次输入 T5),用于零次输入域转移。在没有访问域内标签的情况下,DoT5 以多任务方式共同学习域知识和任务知识(从未标记的域内自由文本的 LM 中提取任务知识,并从任务训练更常见的通用数据集中提取数据增强和 NLU)。为了提高任务训练的可转移性,我们设计了一种名为 NLGU 的策略:我们同时训练 In-domain 标签到数据生成 NLG,这可以实现数据增强的自训练和标签预测的 NLU。我们在生物医学领域和放射学资源受限的子领域评估了 DoT5,重点关注 NLI、文本摘要和嵌入学习。DoT5 通过多任务学习证明了组件化转移学习的 effectiveness。特别是,DoT5 在 RadNLI 上的零次输入转移中比当前的最佳方法高出超过 7 的绝对点的准确性。我们通过实验和案例研究验证了 DoT5 的能力,以解决需要域内专业知识的具有挑战性的 NLI 示例。
https://arxiv.org/abs/2303.13386
In this paper, we introduce a Variational Autoencoder (VAE) based training approach that can compress and decompress cancer pathology slides at a compression ratio of 1:512, which is better than the previously reported state of the art (SOTA) in the literature, while still maintaining accuracy in clinical validation tasks. The compression approach was tested on more common computer vision datasets such as CIFAR10, and we explore which image characteristics enable this compression ratio on cancer imaging data but not generic images. We generate and visualize embeddings from the compressed latent space and demonstrate how they are useful for clinical interpretation of data, and how in the future such latent embeddings can be used to accelerate search of clinical imaging data.
在本文中,我们介绍了一种基于变分自编码器(VAE)的训练方法,该方法能够在1:512的压缩比下对癌症病理学切片进行压缩和解压。这种方法比文献中先前报道的最高水平(SOTA)还要好,同时在临床验证任务中仍然保持了准确性。该压缩方法在更常见的计算机视觉数据集CIFAR10上进行测试,我们探索了哪些图像特征能够在癌症影像数据上实现这种压缩比,而不仅仅是一般图像。我们从压缩后的隐藏空间中生成并可视化嵌入,并演示了这些嵌入对于数据临床解释有何用途,以及未来如何使用这些隐藏的嵌入来加速临床影像数据的搜索。
https://arxiv.org/abs/2303.13332
Unsupervised Domain Adaptation Regression (DAR) aims to bridge the domain gap between a labeled source dataset and an unlabelled target dataset for regression problems. Recent works mostly focus on learning a deep feature encoder by minimizing the discrepancy between source and target features. In this work, we present a different perspective for the DAR problem by analyzing the closed-form ordinary least square~(OLS) solution to the linear regressor in the deep domain adaptation context. Rather than aligning the original feature embedding space, we propose to align the inverse Gram matrix of the features, which is motivated by its presence in the OLS solution and the Gram matrix's ability to capture the feature correlations. Specifically, we propose a simple yet effective DAR method which leverages the pseudo-inverse low-rank property to align the scale and angle in a selected subspace generated by the pseudo-inverse Gram matrix of the two domains. We evaluate our method on three domain adaptation regression benchmarks. Experimental results demonstrate that our method achieves state-of-the-art performance. Our code is available at this https URL.
Unsupervisedsupervised domain adaptation regression(DAR)的目标是在分类问题中,将标记源数据集和未标记目标数据集之间的域差桥接起来。最近的研究大多关注通过学习深度特征编码器来最小化源和目标特征之间的差异。在这项工作中,我们对DAR问题提出了不同的视角,通过分析深度域适配上下文中线性回归器的开括形式最小二乘法解决方案。我们不建议对齐原始特征嵌入空间,而是建议对齐特征的逆 Gram 矩阵,这受到OLS解决方案中存在的特征逆Gram矩阵和Gram矩阵捕捉特征相关性的启发。具体而言,我们提出了一种简单但有效的DAR方法,该方法利用伪逆低秩性质,在两个域的伪逆Gram矩阵生成的选定子空间中对齐大小和角度。我们评估了我们的方法和三个域适配回归基准数据集。实验结果显示,我们的方法和最先进的性能达到了水平。我们的代码可在该httpsURL上获取。
https://arxiv.org/abs/2303.13325
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
在本研究中,我们提出了一个端到端的知识图问答系统,名为GETT-QA。GETT-QA使用了一个流行的文本到文本预训练语言模型T5。该模型以自然语言问题作为输入,并生成简化版的SPARQL查询。在简化版中,模型并不直接生成实体和关系ID。相反,它生成相应的实体和关系标签。在后续步骤中,标签被连接到知识实体和关系ID。为了进一步改善结果,我们要求模型为每个实体生成一份知识实体嵌入的截断版本。截断知识实体嵌入为实现更细的歧义查找而提供了便利。我们发现,T5能够无需改变损失函数而学习截断知识实体嵌入,从而提高了KGQA性能。因此,我们报告了LC-QuAD 2.0和SimpleQuestions-Wikidata datasets在Wikidata上端到端KGQA方面的出色结果。
https://arxiv.org/abs/2303.13284
Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.
Prompttuning是一种有效的方法,通过使用与任务相关的文本代币将预训练的视觉语言模型(VLM)适应后续任务,而代表性的COOp工作将可学习文本代币与类代币结合,以获取特定的文本知识。然而,对于未知的类,这种特定的文本知识是更加泛化到它们,因为它们忘记了具有强泛化能力的一般性的文本知识。为了解决这个问题,我们介绍了一种新的知识引导上下文优化(KgCoOp),以增强未知类可学习prompt的泛化能力。KgCoOp的关键洞察力是,忘记重要的知识可以通过减少可学习prompt和手工制作prompt之间的差异来缓解。特别是,KgCoOp最小化由可学习prompt生成的文本嵌入与手工制作prompt之间的差异。最后,在对比度损失的基础上添加KgCoOp可以生成对于可见任务和未知任务具有区分性的prompt。对多个基准任务的广泛评估表明,提议的知识引导上下文优化是一种Prompttuning的有效方法, \emph{i.e.},在更少的训练时间内获得更好的性能。
https://arxiv.org/abs/2303.13283
Self-supervised learning is attracting large attention in point cloud understanding. However, exploring discriminative and transferable features still remains challenging due to their nature of irregularity and sparsity. We propose a geometrically and adaptively masked auto-encoder for self-supervised learning on point clouds, termed \textit{PointGame}. PointGame contains two core components: GATE and EAT. GATE stands for the geometrical and adaptive token embedding module; it not only absorbs the conventional wisdom of geometric descriptors that captures the surface shape effectively, but also exploits adaptive saliency to focus on the salient part of a point cloud. EAT stands for the external attention-based Transformer encoder with linear computational complexity, which increases the efficiency of the whole pipeline. Unlike cutting-edge unsupervised learning models, PointGame leverages geometric descriptors to perceive surface shapes and adaptively mines discriminative features from training data. PointGame showcases clear advantages over its competitors on various downstream tasks under both global and local fine-tuning strategies. The code and pre-trained models will be publicly available.
自监督学习在点云理解中引起了广泛关注。然而,探索具有区分性和可转移性的特征仍然由于它们的不规则性和稀疏性的性质而非常困难。我们提出了一种基于几何和自适应掩膜的自编码器,称为 \textit{PointGame}。PointGame包含两个核心组件:Gate和EAT。Gate表示基于几何特征的自适应 token 嵌入模块,不仅吸收了传统的几何描述符,有效地捕捉表面形状的经验,而且还利用自适应偏差来集中关注点云的突出部分。EAT表示基于外部注意力的Transformer编码器,具有线性计算复杂性,提高了整个流程的效率。与最先进的无监督学习模型不同,PointGame利用几何描述符来感知表面形状,并自适应地从训练数据中挖掘具有区分性的特征。PointGame在不同全球和 local微调策略下的多个后续任务中展现了明显的优势,代码和预训练模型将公开可用。
https://arxiv.org/abs/2303.13100
In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world. In this paper, we suggest the semantic multi-view model to resolve these two challenges: (1) SBERT for General Embedding (GE), (2) Multi Domain Batch (MDB) for dialogue domain knowledge, and (3) Proxy Gradient Transfer (PGT) for cluster-specialized semantic. MDB feeds diverse dialogue datasets to the model at once to tackle the multi-domain problem by learning the multiple domain knowledge. We introduce a novel method PGT, which employs the Siamese network to fine-tune the model with a clustering method directly.Our model can learn how to cluster dialogue utterances by using PGT. Experimental results demonstrate that our multi-view model with MDB and PGT significantly improves the Open Intent Induction performance compared to baseline systems.
在任务定向对话系统(TOD)中,检测和引入新的意图是将其应用于现实世界的两个主要挑战。在本文中,我们建议采用语义多视角模型来解决这两个挑战:(1) SBERT用于一般嵌入(GE),(2) 对话领域知识的多域批量(MDB),(3) 代理梯度转移(PGT)用于簇特定语义。 MDB通过一次性向模型注入多样化的对话数据集来解决多域问题,通过学习多个域知识。我们介绍了一种新的方法PGT,它使用iamese网络微调模型,并通过聚类方法直接进行。我们的模型可以通过使用PGT来学习如何聚类对话 utterances。实验结果表明,与基线系统相比,我们的多视角模型使用MDB和PGT显著提高了开放意图引入性能。
https://arxiv.org/abs/2303.13099
The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
日益增加对虚拟现实中直觉交互的需求引发了 facial expression recognition (FER) 领域的繁荣。为了解决现有方法(例如狭隘接收域和同质监督信号)的局限性,并进一步巩固 FER 工具的能力,本论文提出了一种独特的多粒度监督驱动Transformer,称为 FER former。该方法采用多粒度嵌入集成、混合自注意力机制和异质领域监督。具体来说,为了深入探究普遍存在的CNN和Transformer特征组合的优点,一种混合基线被设计用于同时递归两种学习范式。其中,一个特别的Transformer机制旨在描述传统的硬一hot标签聚焦和CLIP-based文本方向 tokens 的最终分类。为了缓解标注混淆的问题,我们提出了一种异质领域监督模块,通过监督图像特征和文本特征之间的相似性,使图像特征也具有文本空间语义相关性。此外,除了多种 token 头的合作,具有多种 global 接收域和各种多模态语义线索,从而提供了卓越的学习能力。在流行的基准测试数据上进行广泛的实验证明了所述 FER former 相对于现有技术水平的优越性。
https://arxiv.org/abs/2303.12997
This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: this https URL.
本研究专注于 Sign Language 检索--一项最近提出的理解 sign language 的任务。 Sign Language 检索包括两个子任务:文本到Sign视频(T2V)检索和Sign视频到文本(V2T)检索。与传统的 video-text 检索不同,Sign 视频不仅包含视觉信号,而且本身携带丰富的语义含义,因为 sign 语言也是自然语言。考虑到这一特点,我们将 Sign Language 检索界定为跨语言检索问题和视频-text 检索任务。具体来说,我们考虑了 Sign 语言和自然语言的语言学特征,同时同时识别 fine-grained 跨语言映射(即 sign-to-word 映射),而在 joint embedding 空间中,同时比较文本和 Sign 视频。这一过程被称为跨语言对比学习。此外,数据稀缺问题也带来了挑战--Sign 语言数据集的规模比语音识别数据集小得多。我们通过伪标签方式将具有广泛 Sign 语言训练数据的Sign 编码器应用于目标域。我们的框架,称为跨语言对比学习的 Sign 语言检索(CiCo),在多个数据集上比先驱方法表现更好,例如,How2Sign 数据集上的 T2V 检索改进了 22.4 倍,V2T 检索改进了 28.0 倍,而 PHOENIX-2014T 数据集上的 V2T 检索改进了 13.7 倍和 17.1 倍。代码和模型可在 this https URL 中找到。
https://arxiv.org/abs/2303.12793
We introduce Uni-Fusion, an universal continuous mapping framework for surfaces, surface properties (color, infrared, etc.) and more (latent features in CLIP embedding space, etc.). We propose the first Universal Implicit Encoding model that supports encoding of both geometry and various types of properties (RGB, infrared, feature and etc.) without the need for any training. Based on that, our framework divides the point cloud into regular grid voxels and produces a latent feature in each voxel to form a Latent Implicit Map (LIM) for geometries and arbitrary properties. Then, by fusing a Local LIM of new frame to Global LIM, an incremental reconstruction is approached. Encoded with corresponding types of data, our Latent Implicit Map is capable to generate continuous surfaces, surface properties fields, surface feature fields and any other possible options. To demonstrate the capabilities of our model, we implement three applications: (1) incremental reconstruction for surfaces and color (2) 2D-to-3D fabricated properties transfers (3) open-vocabulary scene understanding by producing a text CLIP feature field on surfaces. We evaluate Uni-Fusion by comparing in corresponding applications, from which, Uni-Fusion shows high flexibility to various of application while performing best or competitive. The project page of Uni-Fusion is available at this https URL
我们引入了 Uni-Fusion,一个适用于表面、表面属性(颜色、红外等)以及更多的 universal 连续映射框架。我们提出了第一个 universal implicit 编码模型,该模型无需任何训练即可支持几何体和任意属性的编码(如 RGB、红外、特征等)。基于该模型,我们将其点云按 regular grid voxels 分割成单个的隐式映射(LIM)单元,并在每个 voxel 中产生隐式特征,以形成几何体和任意属性的隐式映射(LIM)。然后,通过将新帧的 local LIM 与 global LIM 融合,增量重建被 approached。与相应的数据编码,我们的隐式 implicit 映射可以生成连续的表面、表面属性场、表面特征场和任何其他可能的选择。为了展示我们模型的能力,我们实现了三个应用:(1)增量重建用于表面和颜色;(2)2D 到 3D 制造属性转移;(3)通过在表面上生成文本 CLIP 特征场,实现开放词汇场景理解。我们比较了相应的应用,Uni-Fusion 在表现最佳或竞争环境中表现出高灵活性。Uni-Fusion 项目的页面在此 https URL 上可用。
https://arxiv.org/abs/2303.12678
This paper describes our submission to ICASSP 2023 MUG Challenge Track 4, Keyphrase Extraction, which aims to extract keyphrases most relevant to the conference theme from conference materials. We model the challenge as a single-class Named Entity Recognition task and developed techniques for better performance on the challenge: For the data preprocessing, we encode the split keyphrases after word segmentation. In addition, we increase the amount of input information that the model can accept at one time by fusing multiple preprocessed sentences into one segment. We replace the loss function with the multi-class focal loss to address the sparseness of keyphrases. Besides, we score each appearance of keyphrases and add an extra output layer to fit the score to rank keyphrases. Exhaustive evaluations are performed to find the best combination of the word segmentation tool, the pre-trained embedding model, and the corresponding hyperparameters. With these proposals, we scored 45.04 on the final test set.
本论文描述了我们对ICASSP 2023 MUG Challenge track 4,“关键词提取”,提交的申请。该任务旨在从会议材料中提取与会议主题最相关的关键词。我们将挑战建模为单个类别命名实体识别任务,并开发了在挑战中更好的表现的技术:对于数据预处理,我们在词分割后编码分割的关键词。此外,我们增加模型可以同时接受的时间步长的输入信息数量,通过将多个预处理句子融合成一个段来增加。我们替换损失函数为多类聚焦损失,以解决关键词稀疏性问题。此外,我们对每个关键词的出现进行评分,并添加额外的输出层以匹配评分来排名关键词。进行了充分的评估,以找到最佳的组合词分割工具、预训练嵌入模型和相应的超参数。基于这些建议,我们在最终测试集上得分45.04。
https://arxiv.org/abs/2303.13463
A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.
激光雷达基于3D物体检测的主要挑战是捕获大规模3D场景中的足够特征,特别是对于远离或/且遮挡物体的特征。尽管最近 Transformers 进行了长序列建模努力,但它们未能适当平衡准确性和效率,因为接收域不足或粗粒度的整体相关性。在本文中,我们提出了基于Octree的Transformer,名为 OcTr,以解决这个问题。它通过在层级特征金字塔上进行自注意力,在最高级别上构建动态Octree,然后递归地传播到被Octants限制的低级别上,以捕获丰富的全球上下文,并以粗到细的方式同时控制计算复杂性。此外,为了提高前景感知,我们提出了一种混合位置嵌入,由语义 aware 位置嵌入和注意力掩码组成,以充分利用语义和几何线索。在谷歌自动驾驶数据和KITTI数据集上进行了大量实验,OcTr取得了最新的先进技术结果。
https://arxiv.org/abs/2303.12621
Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods.
文本到图像人物检索的目标是根据给定的文本描述查询识别目标人物。主要挑战是学习视觉和文本模式的映射到共同的潜在空间。以前的工作曾尝试通过利用单独训练的 unimodal 模型提取视觉和文本特征来解决这一挑战。但是这些方法缺乏有效地匹配多模式数据所需的必要 underlying alignment 能力。此外,这些工作使用先前信息来探索显式部分匹配,这可能导致modality 内部信息失真。为了减轻这些问题,我们提出了IRRA:一种跨modal隐含关系推理和对齐框架,该框架学习 local 视觉文本代币之间的关系并提高全球图像文本匹配的性能,而无需额外的先前监督。具体来说,我们首先在掩码语言建模范式中设计了一个隐含关系推理模块。这通过将视觉线索集成到文本代币中使用跨modal多模式交互编码器实现跨modal交互。其次,为了全局对齐视觉和文本嵌入,我们提出了相似度分布匹配方法,以最小化图像文本相似度分布和正则化标签匹配分布的KL 差异。 proposed method 在三个公共数据集上实现了新的前沿技术结果,与以前的方法相比,Rank-1 准确性提高了约3%-9%。
https://arxiv.org/abs/2303.12501
Renal transplantation emerges as the most effective solution for end-stage renal disease. Occurring from complex causes, a substantial risk of transplant chronic dysfunction persists and may lead to graft loss. Medical imaging plays a substantial role in renal transplant monitoring in clinical practice. However, graft supervision is multi-disciplinary, notably joining nephrology, urology, and radiology, while identifying robust biomarkers from such high-dimensional and complex data for prognosis is challenging. In this work, taking inspiration from the recent success of Large Language Models (LLMs), we propose MEDIMP -- Medical Images and Prompts -- a model to learn meaningful multi-modal representations of renal transplant Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE MRI) by incorporating structural clinicobiological data after translating them into text prompts. MEDIMP is based on contrastive learning from joint text-image paired embeddings to perform this challenging task. Moreover, we propose a framework that generates medical prompts using automatic textual data augmentations from LLMs. Our goal is to learn meaningful manifolds of renal transplant DCE MRI, interesting for the prognosis of the transplant or patient status (2, 3, and 4 years after the transplant), fully exploiting the available multi-modal data in the most efficient way. Extensive experiments and comparisons with other renal transplant representation learning methods with limited data prove the effectiveness of MEDIMP in a relevant clinical setting, giving new directions toward medical prompts. Our code is available at this https URL.
移植成为处理终末型糖尿病最有效的方法。由于来自复杂的原因,移植的 chronic dysfunction 仍然存在并可能导致移植失败。医学成像在临床实践中对于移植监测非常重要。但是,移植监督是多学科的,特别是与神经学、肾脏学和影像学联合会管。然而,从这种高维度和复杂的数据中识别可靠的生物标志物,对于预测病情预后来说是一项挑战性的任务。在这项工作中,借鉴了大型语言模型(LLM)近期的成功,我们提出了 medIMP - 医疗图像和提示 - 一种模型,通过将结构生物信息学数据翻译成文本提示,将移植的动态 contrast-enhanced 磁共振成像(DCE MRI)有意义的多模态表示学习出来。medIMP 基于对比学习从联合文本-图像配对嵌入中进行挑战性的任务。此外,我们提出了一个框架,使用LLM自动生成的文本数据增强来生成医疗提示。我们的目标是学习移植 DCE MRI 的有意义的多模式分支,对于移植预后或患者状况(移植后2、3和4年)非常有趣,以最高效的方式充分利用可用的多模态数据。广泛的实验和与仅有有限数据的其他移植表示学习方法进行比较证明medIMP在相关临床环境中的有效性,为医疗提示提供了新的方向。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2303.12445
We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end, we propose a cross-modal framework VMCML that finds a shared embedding space between video and music representations. To ensure the embedding space can be effectively shared by both representations, we leverage CosFace loss based on margin-based cosine similarity loss. Furthermore, we establish a large-scale dataset called MSVD, in which we provide 390 individual music and the corresponding matched 150,000 videos. We conduct extensive experiments on Youtube-8M and our MSVD datasets. Our quantitative and qualitative results demonstrate the effectiveness of our proposed framework and achieve state-of-the-art video and music matching performance.
我们提出了一种基于内容匹配视频和背景音乐的系统。该系统旨在解决对新用户或新音乐提供简短视频的音乐推荐面临的挑战。为此,我们提出了一种跨媒体框架VMCML,该框架在视频和音乐表示之间找到了共享嵌入空间。为了确保嵌入空间能够 effectively shared by both representations,我们利用基于 margin-based cosine similarity loss的CosFace损失。此外,我们建立了一个大型数据集MSVD,其中提供了390个单独的音乐视频,并匹配了150,000个相关视频。我们在Youtube-8M和MSVD数据集上进行了广泛的实验。我们的定量和定性结果证明了我们提出的框架的有效性,并实现了最先进的视频和音乐匹配表现。
https://arxiv.org/abs/2303.12379
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream tasks. Conventional KGE methods require relatively high-dimensional entity representations to preserve the structural information of knowledge graph, but lead to oversized model parameters. Recent methods reduce model parameters by adopting low-dimensional entity representations, while developing techniques (e.g., knowledge distillation) to compensate for the reduced dimension. However, such operations produce degraded model accuracy and limited reduction of model parameters. Specifically, we view the concatenation of all entity representations as an embedding layer, and then conventional KGE methods that adopt high-dimensional entity representations equal to enlarging the width of the embedding layer to gain expressiveness. To achieve parameter efficiency without sacrificing accuracy, we instead increase the depth and propose a deeper embedding network for entity representations, i.e., a narrow embedding layer and a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that the proposed method (implemented based on TransE and DistMult) with 4-dimensional entity representations achieves more accurate link prediction results than counterpart parameter-efficient KGE methods and strong KGE baselines, including TransE and DistMult with 512-dimensional entity representations.
将实体和关系映射为向量表示的知识图嵌入(KGE)对于后续任务是至关重要的。传统的KGE方法需要较高的向量实体表示来保持知识图的结构信息,但会导致模型参数过拟合。最近的方法通过采用低向量实体表示来减少模型参数,同时开发技术(例如知识蒸馏)来补偿减少维度。然而,这种操作会损害模型精度,并限制模型参数的减少。具体来说,我们将所有实体表示拼接起来视为嵌入层,而采用高向量实体表示的传统KGE方法等同于扩大嵌入层的宽度以获得表达能力。为了实现参数效率,而不是牺牲精度,我们会增加深度,并提出更深入的实体嵌入网络,即狭窄的嵌入层和一个多层维度提升网络( liftNet)。在三个公共数据集上的实验表明,采用4向量实体表示的新方法(基于TransE和DistMult)比参数高效的KGE方法和强KGE基线,包括TransE和DistMult使用512向量实体表示的方法,实现更准确的链接预测结果。
https://arxiv.org/abs/2303.12816