Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.
将对象属性和关系嵌入三维场景是许多人工智能任务的必要条件,例如视觉grounded对话和身体操纵。然而,三维领域的变量导致了两个基本挑战:1)Labeling的成本和2)3D grounded语言的复杂性。因此,模型的重要目标是数据高效性,适用于不同数据分布和具有未呈现语义形式的任务,以及 ground 复杂的语言语义(例如,观点Anchoring和多物体参考)。为了解决这些挑战,我们提出了NS3D,一个神经符号化的三维基元框架。NS3D利用大型语言到代码模型将语言转换为具有层次结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D扩展了先前的神经符号性视觉推理方法,引入了功能模块,有效地推理高arity关系(即,关系 among 超过两个物体),这在复杂三维场景中澄清物体是至关重要的。模块化和组合式架构使NS3D能够在数据效率和泛化设置方面实现最先进的结果,并证明零shot转移至一个未呈现的三维问答任务。
https://arxiv.org/abs/2303.13483
Unsupervised Domain Adaptation Regression (DAR) aims to bridge the domain gap between a labeled source dataset and an unlabelled target dataset for regression problems. Recent works mostly focus on learning a deep feature encoder by minimizing the discrepancy between source and target features. In this work, we present a different perspective for the DAR problem by analyzing the closed-form ordinary least square~(OLS) solution to the linear regressor in the deep domain adaptation context. Rather than aligning the original feature embedding space, we propose to align the inverse Gram matrix of the features, which is motivated by its presence in the OLS solution and the Gram matrix's ability to capture the feature correlations. Specifically, we propose a simple yet effective DAR method which leverages the pseudo-inverse low-rank property to align the scale and angle in a selected subspace generated by the pseudo-inverse Gram matrix of the two domains. We evaluate our method on three domain adaptation regression benchmarks. Experimental results demonstrate that our method achieves state-of-the-art performance. Our code is available at this https URL.
Unsupervisedsupervised domain adaptation regression(DAR)的目标是在分类问题中,将标记源数据集和未标记目标数据集之间的域差桥接起来。最近的研究大多关注通过学习深度特征编码器来最小化源和目标特征之间的差异。在这项工作中,我们对DAR问题提出了不同的视角,通过分析深度域适配上下文中线性回归器的开括形式最小二乘法解决方案。我们不建议对齐原始特征嵌入空间,而是建议对齐特征的逆 Gram 矩阵,这受到OLS解决方案中存在的特征逆Gram矩阵和Gram矩阵捕捉特征相关性的启发。具体而言,我们提出了一种简单但有效的DAR方法,该方法利用伪逆低秩性质,在两个域的伪逆Gram矩阵生成的选定子空间中对齐大小和角度。我们评估了我们的方法和三个域适配回归基准数据集。实验结果显示,我们的方法和最先进的性能达到了水平。我们的代码可在该httpsURL上获取。
https://arxiv.org/abs/2303.13325
Domain generalization (DG) tends to alleviate the poor generalization capability of deep neural networks by learning model with multiple source domains. A classical solution to DG is domain augmentation, the common belief of which is that diversifying source domains will be conducive to the out-of-distribution generalization. However, these claims are understood intuitively, rather than mathematically. Our explorations empirically reveal that the correlation between model generalization and the diversity of domains may be not strictly positive, which limits the effectiveness of domain augmentation. This work therefore aim to guarantee and further enhance the validity of this strand. To this end, we propose a new perspective on DG that recasts it as a convex game between domains. We first encourage each diversified domain to enhance model generalization by elaborately designing a regularization term based on supermodularity. Meanwhile, a sample filter is constructed to eliminate low-quality samples, thereby avoiding the impact of potentially harmful information. Our framework presents a new avenue for the formal analysis of DG, heuristic analysis and extensive experiments demonstrate the rationality and effectiveness.
域泛化(DG)倾向于通过学习多个源域的模型来缓解深度学习网络的泛化能力不佳的问题。经典的解决方案是域扩展(Domain Augmentation),其普遍的观点认为,多样化的源域有助于非均匀泛化。然而,这些主张 intuitive,而不是数学上的理解。我们的实验 empirical 表明,模型泛化与域的多样性之间的关系可能不一定非负,这限制了域扩展的有效性。因此,本工作旨在保证和进一步增强这一方向的效力。为此,我们提出了一种新的视角,将其重构为两个域之间的凸博弈。我们首先鼓励每个多样化的域优化模型泛化,通过基于超共基的 Regularization Term 细致地设计一个正则化项。同时,我们建立了样本过滤器,以消除低质量样本,从而避免可能有害的信息的影响。我们的框架提供了 formal 分析 DG 的新途径,启发式分析和广泛的实验展示了其合理性和有效性。
https://arxiv.org/abs/2303.13297
Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks.
现代手术在复杂且动态的场景中进行,包括医疗人员、患者和设备之间的不断变化的互动。因此,对手术空间的整个建模是一个具有挑战性但必要的任务,有潜力优化手术团队的表现,并帮助开发新的手术技术,提高患者的治疗效果。将手术场景作为一个语义场景图(SGG)的整个建模,其中实体表示为节点,它们之间的关系表示为边,是一个高精度的语义OR理解有前途的方向。我们首次提出了使用时间信息来进行更准确且一致的整个OR建模。具体来说,我们引入了记忆场景图,其中之前的时间步骤的场景图作为时间表示指导当前预测。我们设计了一个端到端架构,智能地融合我们的轻量级记忆场景图的时间信息与点云和图像的视觉信息。我们在4DOR数据集上评估了我们的方法,并证明了将时间整合在一起会导致更准确且一致的结果,实现+5%的增加,并在宏观F1中获得了一个新的SOTA。这项工作开辟了用记忆场景图代表整个手术历史并改善OR整体理解的道路。引入场景图作为记忆表示可以提供许多时间理解任务中的一种宝贵的工具。
https://arxiv.org/abs/2303.13293
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
在本研究中,我们提出了一个端到端的知识图问答系统,名为GETT-QA。GETT-QA使用了一个流行的文本到文本预训练语言模型T5。该模型以自然语言问题作为输入,并生成简化版的SPARQL查询。在简化版中,模型并不直接生成实体和关系ID。相反,它生成相应的实体和关系标签。在后续步骤中,标签被连接到知识实体和关系ID。为了进一步改善结果,我们要求模型为每个实体生成一份知识实体嵌入的截断版本。截断知识实体嵌入为实现更细的歧义查找而提供了便利。我们发现,T5能够无需改变损失函数而学习截断知识实体嵌入,从而提高了KGQA性能。因此,我们报告了LC-QuAD 2.0和SimpleQuestions-Wikidata datasets在Wikidata上端到端KGQA方面的出色结果。
https://arxiv.org/abs/2303.13284
This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central to the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram. Without assuming access to dataset labels, the BoP representation provides a rich characterization of the dataset semantic distribution. Furthermore, BoP representations cooperate well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Although very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
这项工作研究了数据集向量化的两个数据集级别任务:评估训练集是否适合目标领域和测试集的难度。前者衡量训练集是否适用于目标领域,后者则研究学习模型测试集的难度。这两个任务的核心是测量数据集之间的 underlying 关系。这需要一种理想的数据集向量化方案,应该尽可能保留有用的数据集信息,以便生成的数据向量之间的距离可以反映数据集之间的相似性。为此,我们提出了一种原型袋(prototype bag)的数据集表示方法,该表示方法将图像级别的袋子(包含 patch 描述符)扩展到数据集级别的袋子(包含语义原型)。具体来说,我们开发了一个包含 K 个原型簇的参考数据集的代码库。给定要编码的数据集,我们将每个图像特征编码到代码库中的某个原型上,并得到 K 维哈夫曼分布。在没有假设数据集标签的情况下,原型袋表示提供了丰富的数据集语义分布特征。此外,原型袋表示还与 Jensen-Shannon 差异测量法很好地合作,用于衡量数据集之间的相似性。尽管非常简单,原型袋表示在两个数据集级别任务的一系列基准上 consistently 显示出它的优势。
https://arxiv.org/abs/2303.13251
Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
场景Graph生成(SGG)旨在从图像中提取<主题、谓词、对象>关系以视觉理解。尽管最近的工作在SGG方面取得了稳定的进展,但它们仍然面临长尾巴分布问题,长谓词训练代价更高,且由于少量的注释数据,与频繁谓词相比难以区分。现有的平衡策略试图通过先前规则来实现,但仍然局限于预定义条件,这对各种模型和数据集是不可扩展的。在本文中,我们提出了一种跨模态预比较增强(CaCao)框架,其中视觉提示的语言模型以低资源方式生成多种精细的谓词。 proposed CaCao可以以一种可插拔的方式应用,并自动加强现有的SGG以解决长尾巴问题。基于这一点,我们进一步介绍了一种名为“开放世界谓词场景Graph生成(Epic)”的全新的、相互交织的跨模态提示方法,其中模型可以在零样本情况下 generalization到未观察到的谓词。对三个基准数据集的全面实验表明,CaCao consistentlyBoost了多个场景Graph生成模型的性能,以一种模型无关的方式。此外,我们的Epic在开放世界谓词预测方面实现了竞争性能。
https://arxiv.org/abs/2303.13233
Machine learning algorithms, especially Neural Networks (NNs), are a valuable tool used to approximate non-linear relationships, like the AC-Optimal Power Flow (AC-OPF), with considerable accuracy -- and achieving a speedup of several orders of magnitude when deployed for use. Often in power systems literature, the NNs are trained with a fixed dataset generated prior to the training process. In this paper, we show that adapting the NN training dataset during training can improve the NN performance and substantially reduce its worst-case violations. This paper proposes an algorithm that identifies and enriches the training dataset with critical datapoints that reduce the worst-case violations and deliver a neural network with improved worst-case performance guarantees. We demonstrate the performance of our algorithm in four test power systems, ranging from 39-buses to 162-buses.
机器学习算法,特别是神经网络(NNs),是一种宝贵的工具,用于近似非线性关系,如交流最优能量流(AC-OPF),具有相当准确的精度,并在部署时实现数 orders of magnitude 的提速。通常在电力系统文献中,NNs 通常是在训练过程开始前生成固定的数据集进行训练。在本文中,我们表明,在训练期间适应NN训练数据集可以改进NN性能,并显著减少其最坏情况下的违反。本文提出了一种算法,可以识别并丰富训练数据集中的关键数据点,以减少最坏情况下的违反,并生成NNs 具有改进最坏情况下性能保证。我们展示了我们算法在四个测试电力系统中的表现,这些电力系统的规模从39辆到162辆不等。
https://arxiv.org/abs/2303.13228
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中,我们对 predicate 进行更细致的观察,并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above),而模式级别的分布偏差相对较轻。基于这一认识,我们提出了一种分离标签学习(DLL)范式,从模式级角度解决顽固的视觉关系预测问题。具体而言,DLL 将 predicate 标签分离,并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外,我们提出了一种知识级标签分离方法,从同一模式中的头predicate 到尾predicate 转移非目标知识,以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性,即 VidVRD。广泛的实验表明,DLL 提供了一种非常简单但非常有效的解决方案,解决长尾巴问题,实现 VidSGG 的先进技术表现。
https://arxiv.org/abs/2303.13209
Non-additive uncertainty theories, typically possibility theory, belief functions and imprecise probabilities share a common feature with modal logic: the duality properties between possibility and necessity measures, belief and plausibility functions as well as between upper and lower probabilities extend the duality between possibility and necessity modalities to the graded environment. It has been shown that the all-or-nothing version of possibility theory can be exactly captured by a minimal epistemic logic (MEL) that uses a very small fragment of the KD modal logic, without resorting to relational semantics. Besides, the case of belief functions has been studied independently, and a belief function logic has been obtained by extending the modal logic S5 to graded modalities using Łukasiewicz logic, albeit using relational semantics. This paper shows that a simpler belief function logic can be devised by adding Łukasiewicz logic on top of MEL. It allows for a more natural semantics in terms of Shafer basic probability assignments.
非累加性不确定性理论,通常称为可能性理论,信念函数和不确定的概率与模态逻辑有共同的特征:可能性和必要性测量之间的双对称性性质、信念和可能性函数以及上界和下界概率之间的双对称性性质将可能性和必要性模态扩展到梯度环境中。已经证明,可能性理论的无备选方案版本可以完全被一个最小知识逻辑(MEL)所捕捉,该逻辑使用KD模态逻辑的一个非常小的片段,而无需使用关系语义。此外,信念函数的案例也已经独立地研究了,并通过使用Łukasiewicz逻辑将模态逻辑S5扩展为梯度模态,虽然使用关系语义。这篇论文表明,通过在MEL之上添加Łukasiewicz逻辑,可以设计出更简单的信念函数逻辑。这允许在Shafer基本概率 assignments 方面实现更加自然语义。
https://arxiv.org/abs/2303.13168
Out of distribution (OOD) medical images are frequently encountered, e.g. because of site- or scanner differences, or image corruption. OOD images come with a risk of incorrect image segmentation, potentially negatively affecting downstream diagnoses or treatment. To ensure robustness to such incorrect segmentations, we propose Laplacian Segmentation Networks (LSN) that jointly model epistemic (model) and aleatoric (data) uncertainty in image segmentation. We capture data uncertainty with a spatially correlated logit distribution. For model uncertainty, we propose the first Laplace approximation of the weight posterior that scales to large neural networks with skip connections that have high-dimensional outputs. Empirically, we demonstrate that modelling spatial pixel correlation allows the Laplacian Segmentation Network to successfully assign high epistemic uncertainty to out-of-distribution objects appearing within images.
分布外的医疗图像经常遇到,例如由于站点或扫描仪差异,或者图像损坏等原因。分布外图像有可能导致图像分割不正确,可能对该后续诊断或治疗产生负面影响。为了确保对不正确分割的鲁棒性,我们提出了拉普拉斯分割网络(LSN),该网络同时建模图像分割中的知识(模型)和 aleatoric(数据)不确定性。我们使用空间相关logit分布来捕捉数据不确定性。对于模型不确定性,我们提出了拉普拉斯后估计权重的第一项近似,该近似可以扩展到具有高维输出的 skip 连接的大型神经网络。经验上,我们证明建模空间像素相关性可以让拉普拉斯分割网络成功地将分布外物体在图像中出现的知识不确定性分配给它们。
https://arxiv.org/abs/2303.13123
Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at this https URL are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.
细胞检测是计算病理学中的基本概念任务,可以用于从整张切片图像中提取高级别的医疗信息。对于准确的细胞检测,病理学家通常放大以理解组织级结构,并放大以根据细胞的形态和周围环境进行分类。然而,缺乏在细胞检测模型中体现病理学家行为的努力,主要是因为缺乏包含细胞和组织注释有重叠区域的dataset。为了克服这一限制,我们提议并公开发布OCELOT,这是一个专门用于研究细胞-组织关系的研究dataset。OCELOT提供从多个器官获取的重叠细胞和组织注释的图像。在此情况下,我们也提出了多任务学习方法,可以从同时学习细胞和组织任务中受益匪浅。与仅训练用于细胞检测任务模型相比,我们提出的方法在3个dataset上提高了细胞检测性能:提议的OCELOT、公共鲸鱼和内部CARPdataset。在OCELOT测试集上,我们表现出高达6.79的提高F1得分。我们相信本文的贡献,包括在此httpsURL上的发布OCELOTdataset,是计算病理学中融入细胞-组织关系的重要研究方向的关键起点。
https://arxiv.org/abs/2303.13110
Existing Optimal Transport (OT) methods mainly derive the optimal transport plan/matching under the criterion of transport cost/distance minimization, which may cause incorrect matching in some cases. In many applications, annotating a few matched keypoints across domains is reasonable or even effortless in annotation burden. It is valuable to investigate how to leverage the annotated keypoints to guide the correct matching in OT. In this paper, we propose a novel KeyPoint-Guided model by ReLation preservation (KPG-RL) that searches for the optimal matching (i.e., transport plan) guided by the keypoints in OT. To impose the keypoints in OT, first, we propose a mask-based constraint of the transport plan that preserves the matching of keypoint pairs. Second, we propose to preserve the relation of each data point to the keypoints to guide the matching. The proposed KPG-RL model can be solved by Sinkhorn's algorithm and is applicable even when distributions are supported in different spaces. We further utilize the relation preservation constraint in the Kantorovich Problem and Gromov-Wasserstein model to impose the guidance of keypoints in them. Meanwhile, the proposed KPG-RL model is extended to the partial OT setting. Moreover, we deduce the dual formulation of the KPG-RL model, which is solved using deep learning techniques. Based on the learned transport plan from dual KPG-RL, we propose a novel manifold barycentric projection to transport source data to the target domain. As applications, we apply the proposed KPG-RL model to the heterogeneous domain adaptation and image-to-image translation. Experiments verified the effectiveness of the proposed approach.
现有的最优传输(OT)方法主要基于运输成本/距离最小化的标准来推导最优传输计划/匹配,这可能在某些情况下导致不匹配。在许多应用中,对跨域匹配的一些关键点进行注释是合理的,甚至注释负担更轻松。研究如何利用注释的关键点在OT中指导正确的匹配非常重要。在本文中,我们提出了一种新的关键点引导模型,称为ReLation preservation(KPG-RL),它搜索最优匹配(即传输计划)由OT中的关键点引导。为了在OT中强加关键点,我们首先提出了基于掩膜的运输计划约束,以保留匹配关键点的一对关键点。其次,我们提出了保留每个数据点与关键点的关系以指导匹配。提出的KPG-RL模型可以使用Sinkhorn算法解决,即使在不同的空间中支持分布的情况下也适用。我们还利用 Kantorovich Problem和Gromov-Wasserstein模型中的关系保留约束来强加关键点的指导。同时,我们推导了KPG-RL模型的 dual 形式,该形式使用深度学习技术解决。基于从双KPG-RL模型中学习的运输计划,我们提出了一种独特的多平面巴尔干投影,以将源数据传输到目标域。作为应用,我们应用提出的KPG-RL模型到异质域适应和图像到图像翻译。实验证实了提出的这种方法的 effectiveness。
https://arxiv.org/abs/2303.13102
Channel pruning can effectively reduce both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP$^3$, which is a Channel Pruning Plug-in for Point-based network. CP$^3$ is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN's sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP$^3$ constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%.
通道剪枝可以有效地降低原始网络的计算成本和内存 footprint,同时保持相同的精度性能。虽然对2D图像based卷积神经网络(CNN)的通道剪枝已经取得了巨大的成功,但现有的工作很少将通道剪枝方法扩展到3D点based神经网络(PNNs)。直接实现2D CNN通道剪枝方法到PNNs会削弱PNNs的性能,因为2D图像和3D点云的表示不同,以及网络架构差异大。在本文中,我们提出了CP$^3$,它是一个基于点based网络的通道剪枝插件。CP$^3$精心设计,利用点云和PNNs的特点,以便为PNNs实现2D通道剪枝方法。具体来说,它提出了一种坐标增强的通道重要性度量,以反映维度信息和个体通道特征之间的相关性,它回收了PNNs采样过程中丢弃的点,并重新考虑它们的可能 exclusive 信息,以增强通道剪枝的稳健性。对各种PNN架构的实验表明,CP$^3$ constantly improves the state-of-the-art 2D CNN通道剪枝方法在不同点云任务中的性能。例如,我们的压缩PointNeXt-S在扫描对象NN上的点云任务中,具有88.52%的精度,剪枝率为57.8%,比基准剪枝方法的精度提高1.94%。
https://arxiv.org/abs/2303.13097
Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both the academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these benchmarks. As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common. All these factors may lead to failures in information extraction. Therefore, as the second contribution, we explore an alternative approach to precisely and robustly extract key information from document images under such tough conditions. Specifically, in contrast to previous methods, which usually either incorporate visual information into a multi-modal architecture or train text spotting and information extraction in an end-to-end fashion, we explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities, which could largely benefit entity labeling and linking. Extensive experiments on standard benchmarks in this field as well as the proposed dataset demonstrate that the proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models. Dataset is available at this https URL.
最近,视觉信息提取(VIE)在学术界和工业界都变得越来越重要,因为这些应用涵盖了广泛的现实世界应用。在此之前,已经提出了许多作品来解决这个问题。然而,用于评估这些方法的标准基准相对简单,即现实世界的复杂性在这些基准中未被完全反映。作为这项工作的贡献之一,我们编辑和发布了一个新的VIE数据集,其中文档图像从实际应用程序中获取,面临更困难的挑战,例如模糊、部分遮挡和印刷错误等。所有这些因素都可能导致信息提取失败。因此,作为这项工作的另一个贡献,我们探索了一种 alternative approach,以在如此困难的条件下精确和稳健地从文档图像中提取关键信息。具体来说,与以前的方法不同,它们通常要么将视觉信息融入多模态架构中,要么以最终端方式训练文本 spotting和信息提取,而我们将实体明确地建模为语义点,即实体的中心点 enrich 着语义信息,描述了不同实体的属性和关系,这在很大程度上有助于实体标注和链接。该领域的标准基准以及我们所提出的数据集进行了广泛的实验,证明了我们所提出的方法在实体标注和链接方面能够实现显著增强的性能,与以前的先进模型相比。数据集可在此 https URL 上获取。
https://arxiv.org/abs/2303.13095
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
最近的开放词汇检测方法旨在通过从训练大量图像文本对视觉语言模型(VLMs)的知识进行蒸馏,检测新对象。为了改进这些方法的效果,研究人员使用了大量的词汇表数据,其中包含大量对象类别,假设这些数据可以让模型提取关于各种对象的关系和更广泛地应用于未观测的对象类别的全面知识。在本研究中,我们认为需要更多的精细标签才能提取更丰富的知识,包括对象属性和关系,除了名称。为了解决这个挑战,我们提出了一种简单的有效的方法名为“伪标题标签”(PCL),该方法使用图像标题生成模型生成描述对象实例的不同角度的摘要。所产生的伪标题标签提供了密度丰富的知识蒸馏样本。在LVIS基准测试中,我们训练的最优模型在未重复训练的视觉基因组数据集上取得了34.5的AP和30.6的APr,与最先进的性能相当。PCL的简单易用和灵活性是其他显著的特征,它是一种简单的预处理技术,可以与任何图像标题生成模型一起使用,而无需对模型架构或训练过程施加任何限制。
https://arxiv.org/abs/2303.13040
The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
日益增加对虚拟现实中直觉交互的需求引发了 facial expression recognition (FER) 领域的繁荣。为了解决现有方法(例如狭隘接收域和同质监督信号)的局限性,并进一步巩固 FER 工具的能力,本论文提出了一种独特的多粒度监督驱动Transformer,称为 FER former。该方法采用多粒度嵌入集成、混合自注意力机制和异质领域监督。具体来说,为了深入探究普遍存在的CNN和Transformer特征组合的优点,一种混合基线被设计用于同时递归两种学习范式。其中,一个特别的Transformer机制旨在描述传统的硬一hot标签聚焦和CLIP-based文本方向 tokens 的最终分类。为了缓解标注混淆的问题,我们提出了一种异质领域监督模块,通过监督图像特征和文本特征之间的相似性,使图像特征也具有文本空间语义相关性。此外,除了多种 token 头的合作,具有多种 global 接收域和各种多模态语义线索,从而提供了卓越的学习能力。在流行的基准测试数据上进行广泛的实验证明了所述 FER former 相对于现有技术水平的优越性。
https://arxiv.org/abs/2303.12997
Compliant grippers, owing to adaptivity and safety, have attracted considerable attention for unstructured grasping in real applications, such as industrial or logistic scenarios. However, accurate construction of the mathematical model depicting the bidirectional relationship between shape deformation and contact force for such grippers, such as the Fin-Ray grippers, remains stagnant to date. To address this research gap, this article devises, presents, and experimentally validates a universal bidirectional force-displacement mathematical model for compliant grippers based on the co-rotational concept, which endows such grippers with an intrinsic force sensing capability and offers a better insight into the design optimization. In Part 1 of the article, we introduce the fundamental theory of the co-rotational approach, where arbitrary large deformation of beam elements can be modeled. Its intrinsic principle enables the theoretical modeling to consider various types of configurations and key design parameters with very few assumptions made. Further, a force control algorithm is proposed, providing accurate displacement estimations of the gripper under external forces with minor computational loads. The performance of the proposed method is experimentally verified through comparison with Finite Element Analysis, where the influence of four key design parameters on the gripper s performance is investigated, facilitating systematical design optimization. Part 2 of this article demonstrating the force sensing capabilities and the effects of representative co-rotational modeling parameters on model accuracy is released in Google Drive.
符合要求的抓握手具有适应性和安全性,因此在实际应用中,如工业或物流场景,对无结构抓取吸引了相当的注意力。然而,准确构建数学模型描述这种符合要求的抓握手,如Fin-Ray抓握手,的双向形状变形和接触力的关系,迄今为止仍然停滞不前。为了解决这一研究空白,本文提出了一种实验验证过的通用双向力量-位移数学模型,基于共旋转概念,赋予这种抓握手具有内在的力量感知能力,并提供更好的设计优化的见解。本文第一部分介绍了共旋转方法的基本理论,其中可以任意大的形状变形建模。其内在原理使理论建模可以考虑各种配置类型和关键设计参数,只需要少量的假设。此外,提出了一种力量控制算法,提供在外部力量下准确的位置估计,只需要轻微的计算负载。该方法的性能通过与有限元分析的比较进行了实验验证,研究了四个关键设计参数对抓握手性能的影响,从而促进了系统级设计优化。本文第二部分展示了力量感知能力和代表性共旋转建模参数对模型精度的影响,将其发布在Google Drive中。
https://arxiv.org/abs/2303.12987
Electronic medical records (EMRs) are stored in relational databases. It can be challenging to access the required information if the user is unfamiliar with the database schema or general database fundamentals. Hence, researchers have explored text-to-SQL generation methods that provide healthcare professionals direct access to EMR data without needing a database expert. However, currently available datasets have been essentially "solved" with state-of-the-art models achieving accuracy greater than or near 90%. In this paper, we show that there is still a long way to go before solving text-to-SQL generation in the medical domain. To show this, we create new splits of the existing medical text-to-SQL dataset MIMICSQL that better measure the generalizability of the resulting models. We evaluate state-of-the-art language models on our new split showing substantial drops in performance with accuracy dropping from up to 92% to 28%, thus showing substantial room for improvement. Moreover, we introduce a novel data augmentation approach to improve the generalizability of the language models. Overall, this paper is the first step towards developing more robust text-to-SQL models in the medical domain.\footnote{The dataset and code will be released upon acceptance.
电子医疗记录(EMRs)存储在关系型数据库中。如果用户不熟悉数据库表 schema或一般数据库基础结构,那么访问所需的信息可能会非常困难。因此,研究人员已经探索了文本到SQL生成方法,以便提供医疗保健专业人员直接访问EMR数据,而不需要数据库专家。然而,目前可用的数据集基本上已经“解决”,最先进的模型准确率超过或接近于90%。在本文中,我们表明,在医疗领域中解决文本到SQL生成问题还有很长的路要走。为了展示这一点,我们创造了新的医疗文本到SQL数据集MIMICSQL的分集,更好地衡量结果模型的通用性。我们评估了最先进的语言模型在我们的新分集中的表现,显示性能大幅度下降,准确率从高达92%降至28%,因此表明有很大的改进空间。此外,我们引入了一种新的数据增强方法,以提高语言模型的通用性。总的来说,本文是开发医疗领域中更稳定的文本到SQL模型的第一步。
https://arxiv.org/abs/2303.12898