Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at this https URL.
最近,视频对象分割(VOS)由多种modal信号,如语言和音频,引起 industry 和学术界的广泛关注。探索modal之间语义对齐以及不同帧之间的视觉对应是挑战性的任务。然而,现有方法采用不同的modal网络架构,并忽略了帧间间的时间交互。在本文中,我们提出了MUTR,一个多模态统一时间Transformer,以 refering 视频对象分割。第一次使用统一框架,MUTR采用DETR风格的Transformer,能够以文本或音频参考分别分割指定视频对象。具体来说,我们介绍了两种策略,以 fully 探索视频和modal信号之间的时间关系。首先,在Transformer之前进行低级别时间聚合,我们使多模态参考能够从连续的视频帧捕获多尺度的视觉线索。这 effectively赋予文本或音频信号时间知识,并增强modal之间的语义对齐。其次,在Transformer之后进行高级别时间交互,我们进行不同物体嵌入之间的帧间特征通信,有助于更好地跟踪视频对象。在ref-YouTube-VOS和AVSbench数据集,以相应的文本和音频参考,MUTR取得了与当前方法相比 +4.2% 和 +4.2%的J&F改进,这表明我们对于统一多模态VOS的重要性。代码在此https URL发布。
https://arxiv.org/abs/2305.16318
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
Equivariance has gained strong interest as a desirable network property that inherently ensures robust generalization. However, when dealing with complex systems such as articulated objects or multi-object scenes, effectively capturing inter-part transformations poses a challenge, as it becomes entangled with the overall structure and local transformations. The interdependence of part assignment and per-part group action necessitates a novel equivariance formulation that allows for their co-evolution. In this paper, we present Banana, a Banach fixed-point network for equivariant segmentation with inter-part equivariance by construction. Our key insight is to iteratively solve a fixed-point problem, where point-part assignment labels and per-part SE(3)-equivariance co-evolve simultaneously. We provide theoretical derivations of both per-step equivariance and global convergence, which induces an equivariant final convergent state. Our formulation naturally provides a strict definition of inter-part equivariance that generalizes to unseen inter-part configurations. Through experiments conducted on both articulated objects and multi-object scans, we demonstrate the efficacy of our approach in achieving strong generalization under inter-part transformations, even when confronted with substantial changes in pointcloud geometry and topology.
一致性作为一个重要的网络属性,本身就确保了稳健泛化。然而,在与复杂的系统,如多关节疼痛对象或多对象场景时,有效地捕捉部分之间的变换面临着挑战,因为它与整体结构和局部变换交织在一起。部分分配和每个部分的独立行动需要一种独特的一致性定义,以便它们的协同演化。在本文中,我们介绍了banana,一个由Banach fixed-point网络构建的一致性分割网络,具有部分一致性。我们的关键发现是解决一个固定点问题,该问题点部分分配标签和每个部分的SE(3)一致性同时演化。我们提供了每个步骤的一致性和全球收敛的理论推导,导致一致性最终的收敛状态。我们的定义自然地提供了部门一致性的严格定义,可以泛化到未知的部门配置。通过在多关节疼痛对象和多对象扫描的实验中实施,我们证明了我们的方法在部分变换下实现稳健泛化的有效性,即使面对点云几何和拓扑的重大变化。
https://arxiv.org/abs/2305.16314
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Composed image retrieval的目标是找到与给定的多项式用户查询包含参考图像和文本一对的最优匹配图像。现有的方法通常会对整个语料库进行图像嵌入的预处理,并在测试时比较参考图像嵌入由查询文本修改后的结果。这种管道在测试时非常高效,因为可以快速计算向量距离来评估候选人,但仅通过简短的文本描述指导修改参考图像嵌入可能会很困难,特别是与潜在候选人独立的。另一种方法是允许查询和每个可能候选人之间的交互,即参考文本候选人三件套,并从中选择最好的。尽管这种方法更加歧视性,但对于大型数据集,计算成本过高,因为预计算候选人嵌入不再可行。我们提议使用两个阶段的模型将两种方案的优点结合起来。我们的第一阶段采用传统的向量距离度量,并快速修剪候选人之间的中间结果。与此同时,我们的第二阶段采用双编码架构,有效地关注输入的参考文本-候选人三件套并重新评估候选人。两个阶段都使用视觉和语言预训练网络,已经证明对于各种后续任务有益。我们的方法在任务的标准基准测试中 consistently 优于最先进的方法。
https://arxiv.org/abs/2305.16304
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.
本论文研究如何将神经网络辐射场(NeRF)与语义增强其应用范围。尽管NeRF在虚拟现实和数字创造等实际应用领域已经被证明有用,但缺乏语义会阻碍复杂场景下与物体的互动。我们提议仿效现有的感知模型的主干特性,以通过直接渲染语义特征来实现NeRF的零次元语义分割。我们的框架重写了分割过程,仅从感知模型中应用解码器,从而消除了昂贵的主干需求并实现了3D一致性。此外,我们可以将学到的语义投影到提取的网格表面,实现实时交互。利用最先进的分割任意模型(SAM),我们的框架将分割速度提高了16倍,与同等掩模质量相比。实验结果证明了我们方法的有效性和计算优势。项目页面: \url{https://me.kiui.moe/san/}。
https://arxiv.org/abs/2305.16233
Multi-agent routing problems have drawn significant attention nowadays due to their broad industrial applications in, e.g., warehouse robots, logistics automation, and traffic control. Conventionally, they are modelled as classical planning problems. In this paper, we argue that it is beneficial to formulate them as universal planning problems. We therefore propose universal plans, also known as policies, as the solution concepts, and implement a system called ASP-MAUPF (Answer Set Programming for Multi-Agent Universal Plan Finding) for computing them. Given an arbitrary two-dimensional map and a profile of goals for the agents, the system finds a feasible universal plan for each agent that ensures no collision with others. We use the system to conduct some experiments, and make some observations on the types of goal profiles and environments that will have feasible policies, and how they may depend on agents' sensors. We also demonstrate how users can customize action preferences to compute more efficient policies, even (near-)optimal ones.
多Agent 路由问题现在引起了广泛关注,因为它们在广泛的工业应用中具有广泛的应用,例如仓库机器人、物流自动化和交通控制。传统上,它们被建模为经典规划问题。在本文中,我们认为将这些问题建模为通用规划问题有益处。因此,我们提出了称为策略的通用计划,并实现了一个名为 ASP-MAUPF(多Agent通用计划求解系统)的系统,以计算它们。给定任意二维地图和每个Agent的目标轮廓,系统找到每个Agent可行的通用计划,并确保没有与其他Agent碰撞。我们使用该系统进行了一些实验,并观察了将可行策略的目标轮廓和环境类型,以及它们可能依赖于Agent的传感器。我们还展示了用户如何自定义行动偏好以计算更高效的政策,甚至接近最优的政策。
https://arxiv.org/abs/2305.16203
Latent Graph Inference (LGI) relaxed the reliance of Graph Neural Networks (GNNs) on a given graph topology by dynamically learning it. However, most of LGI methods assume to have a (noisy, incomplete, improvable, ...) input graph to rewire and can solely learn regular graph topologies. In the wake of the success of Topological Deep Learning (TDL), we study Latent Topology Inference (LTI) for learning higher-order cell complexes (with sparse and not regular topology) describing multi-way interactions between data points. To this aim, we introduce the Differentiable Cell Complex Module (DCM), a novel learnable function that computes cell probabilities in the complex to improve the downstream task. We show how to integrate DCM with cell complex message passing networks layers and train it in a end-to-end fashion, thanks to a two-step inference procedure that avoids an exhaustive search across all possible cells in the input, thus maintaining scalability. Our model is tested on several homophilic and heterophilic graph datasets and it is shown to outperform other state-of-the-art techniques, offering significant improvements especially in cases where an input graph is not provided.
隐式图推断(LGI)通过动态学习来放松对给定图拓扑的依赖。然而,大多数LGI方法都假设有一个(嘈杂、不完整、可改进的、 ...)的输入图来重构,并且只能学习 regular graph 拓扑。在图拓扑深度学习(TDL)取得成功之后,我们研究了隐式拓扑推断(LTI),以学习高阶细胞群(稀疏且不是规则拓扑),描述数据点之间的多方交互。为了实现这一目标,我们引入了可变形细胞群模块(DCM),这是一种新的可学习函数,计算细胞群概率在群中改善后续任务。我们展示了如何整合DCM与细胞群传递网络层,并以端到端的方式训练它,得益于两个推断步骤,避免了在输入图中搜索所有可能细胞的方法,从而保持了扩展性。我们的模型对多个同向性和异向性图数据集进行了测试,并表明它优于其他最先进的技术,特别是在输入图未被提供的情况下提供了显著的改进。
https://arxiv.org/abs/2305.16174
Multimodal relation extraction (MRE) is the task of identifying the semantic relationships between two entities based on the context of the sentence image pair. Existing retrieval-augmented approaches mainly focused on modeling the retrieved textual knowledge, but this may not be able to accurately identify complex relations. To improve the prediction, this research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We further develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities. Extensive experiments and analyses show that the proposed method is able to effectively select and compare evidence across modalities and significantly outperforms state-of-the-art models.
多模态关系提取(MRE)的任务是根据句子图像 pair 上下文确定两个实体之间的语义关系。现有的检索增强方法主要关注 Modeling 检索到的文字知识,但这可能无法准确识别复杂的关系。为了改善预测,本研究提议基于对象、句子和整个图像检索文字和视觉证据。我们进一步开发了一种新的方法来合成对象级、图像级和句子级信息,以更好地在不同模态之间进行推理。广泛的实验和分析表明, proposed 方法能够有效地选择和比较不同模态的证据,并显著优于最先进的模型。
https://arxiv.org/abs/2305.16166
As the deployment of pre-trained language models (PLMs) expands, pressing security concerns have arisen regarding the potential for malicious extraction of training data, posing a threat to data privacy. This study is the first to provide a comprehensive survey of training data extraction from PLMs. Our review covers more than 100 key papers in fields such as natural language processing and security. First, preliminary knowledge is recapped and a taxonomy of various definitions of memorization is presented. The approaches for attack and defense are then systemized. Furthermore, the empirical findings of several quantitative studies are highlighted. Finally, future research directions based on this review are suggested.
随着预训练语言模型(PLM)的部署扩展,有关恶意提取训练数据的潜在安全关切已经浮现,这构成了数据隐私的威胁。这项研究是第一次全面调查PLM从记忆模型中提取训练数据的。我们的综述涵盖了自然语言处理和安全问题领域的超过100篇论文。首先,回顾了先前的知识,并呈现了记忆的各种定义的分类。攻击和防御的方法随后系统化了。此外, several 定量研究的实证结论被重点强调。最后,基于这个综述提出了未来的研究方向。
https://arxiv.org/abs/2305.16157
This paper presents a novel design for a Variable Stiffness 3 DoF actuated wrist to improve task adaptability and safety during interactions with people and objects. The proposed design employs a hybrid serial-parallel configuration to achieve a 3 DoF wrist joint which can actively and continuously vary its overall stiffness thanks to the redundant elastic actuation system, using only four motors. Its stiffness control principle is similar to human muscular impedance regulation, with the shape of the stiffness ellipsoid mostly depending on posture, while the elastic cocontraction modulates its overall size. The employed mechanical configuration achieves a compact and lightweight device that, thanks to its anthropomorphous characteristics, could be suitable for prostheses and humanoid robots. After introducing the design concept of the device, this work provides methods to estimate the posture of the wrist by using joint angle measurements and to modulate its stiffness. Thereafter, this paper describes the first physical implementation of the presented design, detailing the mechanical prototype and electronic hardware, the control architecture, and the associated firmware. The reported experimental results show the potential of the proposed device while highlighting some limitations. To conclude, we show the motion and stiffness behavior of the device with some qualitative experiments.
本文提出了一种Variable Stiffness 3 DoF主动控制手腕的新设计,以改善与人和物品交互时的任务适应性和安全性。该设计采用 hybrid serial-parallel 混合序列-并行结构,以实现具有3个DoF的手腕关节,该关节可以通过冗余 elastic actuation 系统主动和连续地改变其总刚度,仅使用四个电机。其刚度控制 principle 类似于人类肌肉阻力调节,刚度 ellipsoid 的形状主要取决于姿势,而 Elastic cocontraction 则 modifiability 其总大小。采用机械配置实现了一个紧凑和轻便的设备,由于其人类形态特征,可能适合假肢和人形机器人。在介绍设备设计概念后,该工作提供了方法,通过关节角度测量估计手腕的姿势,并 Modifiability 其刚度。此后,本文描述了首次的物理实现,详细描述了机械原型和电子硬件、控制架构和相关的 firmware。 reported 的实验结果展示了所提出设备的 potential 并突出了一些限制。因此,我们展示了设备的运动和刚度行为,通过一些定性实验。
https://arxiv.org/abs/2305.16154
Automated planning is concerned with developing efficient algorithms to generate plans or sequences of actions to achieve a specific goal in a given environment. Emerging Large Language Models (LLMs) can answer questions, write high-quality programming code, and predict protein folding, showcasing their versatility in solving various tasks beyond language-based problems. In this paper, we aim to explore how LLMs can also be used for automated planning. To do so, we seek to answer four key questions. Firstly, we want to understand the extent to which LLMs can be used for plan generation. Secondly, we aim to identify which pre-training data is most effective in facilitating plan generation. Thirdly, we investigate whether fine-tuning or prompting is a more effective approach for plan generation. Finally, we explore whether LLMs are capable of plan generalization. By answering these questions, the study seeks to shed light on the capabilities of LLMs in solving complex planning problems and provide insights into the most effective approaches for using LLMs in this context.
自动规划关注开发高效的算法,在给定环境中生成计划或行动序列,以实现特定目标。新兴的大型语言模型(LLMs)可以回答问题、编写高质量的编程代码,并预测蛋白质折叠,展示出它们在解决基于语言的问题之外的多种任务方面的多功能性。在本文中,我们旨在探索LLMs如何也可以用于自动规划。为此,我们寻求回答四个关键问题。首先,我们希望理解LLMs可以用于计划生成的程度。其次,我们旨在确定哪些预处理数据最有利于促进计划生成。第三,我们研究是否微调或Prompting是更有效的计划生成方法。最后,我们探索LLMs是否具备计划泛化的能力。通过回答这些问题,研究旨在阐明LLMs在解决复杂规划问题方面的能力,并提供在此背景下使用LLMs的最 effective方法的启示。
https://arxiv.org/abs/2305.16151
Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions by differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper, we challenge this interpretation and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.
粒子based deep生成模型,如梯度流和评分based扩散模型,最近由于其惊人的表现而获得了进展。它们通过微分方程替代粒子分布的替代性原则,通常被视为与之前广泛采用的生成对抗网络(GANs)不同的生成模型。在这篇文章中,我们挑战了这一解释,并提出了一个独特的框架,将生成器训练视为粒子模型的扩展。这意味着生成器是任何这种生成模型的可选添加。因此,将生成器融入评分based扩散模型中,并自然从我们的框架中产生GAN进行训练。我们 empirical 测试了这些原始模型的有效性,作为我们框架的潜在应用概念的证明。
https://arxiv.org/abs/2305.16150
Face swapping combines one face's identity with another face's non-appearance attributes (expression, head pose, lighting) to generate a synthetic face. This technology is rapidly improving, but falls flat when reconstructing some attributes, particularly gaze. Image-based loss metrics that consider the full face do not effectively capture the perceptually important, yet spatially small, eye regions. Improving gaze in face swaps can improve naturalness and realism, benefiting applications in entertainment, human computer interaction, and more. Improved gaze will also directly improve Deepfake detection efforts, serving as ideal training data for classifiers that rely on gaze for classification. We propose a novel loss function that leverages gaze prediction to inform the face swap model during training and compare against existing methods. We find all methods to significantly benefit gaze in resulting face swaps.
脸交换将一个人脸的身份与另一个人脸的非出现属性(表情、头部姿势、照明)生成一个合成人脸。这项技术正在迅速发展,但在某些属性的重建方面表现平平,特别是眼睛区域。考虑整个面部的损失度量并没有有效捕捉感知上重要但空间上较小的眼睛区域。改善脸交换中的眼睛区域可以改善自然性和真实感,受益于娱乐、人机交互和其他应用领域。改善眼睛区域将直接改善 Deepfake 检测努力,作为依赖于眼睛识别的分类器的理想训练数据。我们提议一种新损失函数,利用眼睛预测在训练期间通知脸交换模型,并与其他方法进行比较。我们发现所有方法都显著地受益于最终脸交换中的眼睛区域。
https://arxiv.org/abs/2305.16138
In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. For the challenge, we build our model on top of StillFast with Guided Attention applied on fast network. Our model obtains better performance on the validation set and also achieves state-of-the-art (SOTA) results on the challenge test set for EGO4D Short-Term Object Interaction Anticipation Challenge.
在本技术报告中,我们描述了基于指导注意机制的 EGO4D 短期物体交互预测挑战的解决方案。该解决方案结合了视频片段中的对象检测和时空特征,增强运动和上下文信息,并进一步解码对象和运动中心信息,以解决 egocentric 视频中的STA问题。为了解决这个问题,我们在 stillFast 之上构建我们的模型,并将指导注意应用于快速网络。我们的模型在验证集上表现出更好的性能,并在 EGO4D 短期物体交互预测挑战的挑战测试集上取得了最先进的结果(SOTA)。
https://arxiv.org/abs/2305.16066
Blurry images usually exhibit similar blur at various locations across the image domain, a property barely captured in nowadays blind deblurring neural networks. We show that when extracting patches of similar underlying blur is possible, jointly processing the stack of patches yields superior accuracy than handling them separately. Our collaborative scheme is implemented in a neural architecture with a pooling layer on the stack dimension. We present three practical patch extraction strategies for image sharpening, camera shake removal and optical aberration correction, and validate the proposed approach on both synthetic and real-world benchmarks. For each blur instance, the proposed collaborative strategy yields significant quantitative and qualitative improvements.
模糊图像通常在图像域中在各种位置表现出类似的模糊,这是当今的无监督去模糊神经网络几乎无法捕捉到的特性。我们证明,当提取具有相似底层模糊的图案时,联合处理Stack of patches可以获得比单独处理更好的精度。我们的合作方案是在Stack dimension上有一个Pooling layer的神经网络架构中实现的。我们提出了三个实用的图案提取策略,用于图像增强、相机抖动去除和光学畸变纠正,并在不同的合成和实际基准点上进行了验证。对于每个模糊实例,我们提出的合作策略都带来了显著的数量和质量改进。
https://arxiv.org/abs/2305.16034
Convolutional neural networks (CNN) and Transformer variants have emerged as the leading medical image segmentation backbones. Nonetheless, due to their limitations in either preserving global image context or efficiently processing irregular shapes in visual objects, these backbones struggle to effectively integrate information from diverse anatomical regions and reduce inter-individual variability, particularly for the vasculature. Motivated by the successful breakthroughs of graph neural networks (GNN) in capturing topological properties and non-Euclidean relationships across various fields, we propose NexToU, a novel hybrid architecture for medical image segmentation. NexToU comprises improved Pool GNN and Swin GNN modules from Vision GNN (ViG) for learning both global and local topological representations while minimizing computational costs. To address the containment and exclusion relationships among various anatomical structures, we reformulate the topological interaction (TI) module based on the nature of binary trees, rapidly encoding the topological constraints into NexToU. Extensive experiments conducted on three datasets (including distinct imaging dimensions, disease types, and imaging modalities) demonstrate that our method consistently outperforms other state-of-the-art (SOTA) architectures. All the code is publicly available at this https URL.
卷积神经网络(CNN)和Transformer变体已成为医学图像分割的主要骨架。然而,由于它们要么无法保留全局图像上下文,要么无法有效地处理视觉对象中的不规则形状,这些骨架 struggles 有效地整合来自不同解剖学区域的信息,并减少个体间差异,特别是针对血管。鉴于Graph Neural Networks(GNN)在捕捉拓扑属性和非欧几何关系在不同领域的成功突破,我们提出了NexToU,一种 novel 混合架构,用于医学图像分割。NexToU由改进的池化GNN和 Swin GNN模块(来自视觉GNN(ViG)),用于学习全局和 local 拓扑表示,同时最小化计算成本。为了解决不同结构之间的包含和排除关系,我们基于二进制树的性质重新规范了拓扑交互(TI)模块,迅速编码拓扑约束 into NexToU。在三个数据集(包括明确的成像维度、疾病类型和成像模式)上进行广泛的实验表明,我们的方法 consistently outperforms other state-of-the-art (SOTA) architectures。所有代码都在 this https URL 上公开可用。
https://arxiv.org/abs/2305.15911
Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at this http URL.
过去的研究已经确定了一组丰富的手工手工语言特征,这些特征可能有助于各种任务。然而,它们的数量众多使得有效选择和利用现有的手工手工语言特征变得非常困难。此外,在研究工作中一致性的实现问题也导致了没有分类Scheme或普遍接受的特征名称的问题,这带来了不必要的混淆。此外,大多数现有的手工特征提取库都不是开源的,也没有积极地维护。因此,研究人员往往需要从头构建这种提取系统。我们基于过去文学作品收集和分类了超过220个流行的手工手工语言特征。然后,我们对几个特定任务的数据集进行了相关性分析研究,并报告了每个特征的潜在用例。最后,我们发明了一种多语言手工语言特征提取系统,以 systematic 可扩展的方式向公众开放源代码。我们的系统名为LFTK,它是同类系统中最大的。可以在这里找到我们的网站http://www.lfTK.org。
https://arxiv.org/abs/2305.15878
Semi-supervised learning has been an important approach to address challenges in extracting entities and relations from limited data. However, current semi-supervised works handle the two tasks (i.e., Named Entity Recognition and Relation Extraction) separately and ignore the cross-correlation of entity and relation instances as well as the existence of similar instances across unlabeled data. To alleviate the issues, we propose Jointprop, a Heterogeneous Graph-based Propagation framework for joint semi-supervised entity and relation extraction, which captures the global structure information between individual tasks and exploits interactions within unlabeled data. Specifically, we construct a unified span-based heterogeneous graph from entity and relation candidates and propagate class labels based on confidence scores. We then employ a propagation learning scheme to leverage the affinities between labelled and unlabeled samples. Experiments on benchmark datasets show that our framework outperforms the state-of-the-art semi-supervised approaches on NER and RE tasks. We show that the joint semi-supervised learning of the two tasks benefits from their codependency and validates the importance of utilizing the shared information between unlabeled data.
半监督学习一直是从有限数据中分离实体和关系的重要方法,以解决从无标记数据中提取实体和关系的挑战。然而,当前的半监督工作分别处理两个任务(即命名实体识别和关系提取),并忽略了实体和关系实例之间的交叉联系以及无标记数据中的类似实例的存在。为了缓解这些问题,我们提出了 Jointprop,一个基于不同形态的 Graph 传播框架,用于联合半监督实体和关系提取,该框架捕捉个体任务之间的全球结构信息,并利用无标记数据内部的交互。具体而言,我们从一个统一的距离基于形态学的结构不同的不同形态的 Graph 从实体和关系候选人中构建,并根据信任分数传播类标签。然后,我们采用传播学习方案来利用标记和无标记样本之间的亲和力。在基准数据集的实验结果表明,我们的框架在 NER 和 Recherché 任务上优于当前的半监督方法。我们证明了两个任务之间的联合半监督学习从它们的相互依赖中受益,并验证利用无标记数据之间的共享信息的重要性。
https://arxiv.org/abs/2305.15872
Generating and editing a 3D scene guided by natural language poses a challenge, primarily due to the complexity of specifying the positional relations and volumetric changes within the 3D space. Recent advancements in Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities across various domains. Surprisingly, these models also show great potential in realizing and interpreting the 3D space. In light of this, we propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter into the off-the-shelf layout-to-3D generative models, allowing users to flexibly and interactively generate visual content. Specifically, we design a versatile layout structure base on the bounding boxes and semantics to prompt the LLMs to model the spatial generation and reasoning from language. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content. We validate the effectiveness of LI3D, primarily in 3D generation and editing through multi-round interactions, which can be flexibly extended to 2D generation and editing. Various experiments demonstrate the potential benefits of incorporating LLMs in generative AI for applications, e.g., metaverse. Moreover, we benchmark the layout reasoning performance of LLMs with neural visual artist tasks, revealing their emergent ability in the spatial layout domain.
由自然语言指导生成和编辑3D场景是一项挑战,主要原因是指定3D空间中的位图和体积变化的复杂性。近年来,大型语言模型(LLMs)的进步已经展示了在各种领域中出色的推理、对话和零次生成能力。令人惊讶地,这些模型也表现出在实现和解释3D空间的巨大潜力。基于这一点,我们提出了一种 novel 语言-指导的交互3D生成系统,称为LI3D,它将LLMs作为3D布局解释器融入现有的布局-3D生成模型中,使用户可以灵活和交互式地生成视觉内容。具体来说,我们基于边界框和语义设计了一个多功能的布局结构,以促使LLMs从语言模型中预测空间生成和推理。我们的系统还集成了LLaVA,一个大型语言和视觉助手,以提供从视觉方面的生成反馈,以提高生成内容的视觉质量。我们通过多次交互验证LI3D的有效性,主要是在3D生成和编辑方面,可以灵活扩展到2D生成和编辑。各种实验展示了将LLMs融入生成AI应用程序中的潜在好处,例如元宇宙。此外,我们比较了LLMs的布局推理性能,与神经网络视觉艺术家任务基准,揭示了它们在空间布局领域的 emergent ability。
https://arxiv.org/abs/2305.15808
Attention mechanisms have greatly improved the performance of deep-learning models on visual, NLP, and multimodal tasks while also providing tools to aid in the model's interpretability. In particular, attention scores over input regions or concrete image features can be used to measure how much the attended elements contribute to the model inference. The recently proposed Concept Transformer (CT) generalizes the Transformer attention mechanism from such low-level input features to more abstract, intermediate-level latent concepts that better allow human analysts to more directly assess an explanation for the reasoning of the model about any particular output classification. However, the concept learning employed by CT implicitly assumes that across every image in a class, each image patch makes the same contribution to concepts that characterize membership in that class. Instead of using the CT's image-patch-centric concepts, object-centric concepts could lead to better classification performance as well as better explainability. Thus, we propose Concept-Centric Transformers (CCT), a new family of concept transformers that provides more robust explanations and performance by integrating a novel concept-extraction module based on object-centric learning. We test our proposed CCT against the CT and several other existing approaches on classification problems for MNIST (odd/even), CIFAR100 (super-classes), and CUB-200-2011 (bird species). Our experiments demonstrate that CCT not only achieves significantly better classification accuracy than all selected benchmark classifiers across all three of our test problems, but it generates more consistent concept-based explanations of classification output when compared to CT.
注意力机制已经极大地改进了深度学习模型在视觉、自然语言处理和多任务任务中的表现,同时提供了工具来帮助模型的解释性。特别是,注意力得分 over 输入区域或具体图像特征可以使用来衡量被关注元素对模型推理的贡献。最近提出的概念Transformer(CT)将Transformer注意力机制从这样的低级别输入特征扩展到更抽象、中等级别的潜在概念,更好地允许人类分析员更直接评估模型对于任何特定输出分类推理的解释。然而,CT所使用的概念学习隐含地假设在每个类中的每个图像 patch 都对概念“属于”该类的特征作出相同贡献。相反,不使用CT的图像 patch 中心概念,对象中心概念可能会导致更好的分类性能和更好的解释性。因此,我们提出了概念中心Transformer(CCT),一个新型的概念Transformer转换器家族,通过基于对象中心学习的 novel 概念提取模块,集成了一个独特的概念提取模块。我们对MNIST(奇偶性)、CIFAR100(超类)、CUB-200-2011(鸟类物种)等训练问题进行了CT和多个其他现有方法的比较测试,我们的实验结果表明,CCT不仅比我们测试的三个所有精选基准分类器在所有三个测试问题上实现了更好的分类精度,而且生成分类输出的概念基于解释更一致。
https://arxiv.org/abs/2305.15775