Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at this https URL.
最近,视频对象分割(VOS)由多种modal信号,如语言和音频,引起 industry 和学术界的广泛关注。探索modal之间语义对齐以及不同帧之间的视觉对应是挑战性的任务。然而,现有方法采用不同的modal网络架构,并忽略了帧间间的时间交互。在本文中,我们提出了MUTR,一个多模态统一时间Transformer,以 refering 视频对象分割。第一次使用统一框架,MUTR采用DETR风格的Transformer,能够以文本或音频参考分别分割指定视频对象。具体来说,我们介绍了两种策略,以 fully 探索视频和modal信号之间的时间关系。首先,在Transformer之前进行低级别时间聚合,我们使多模态参考能够从连续的视频帧捕获多尺度的视觉线索。这 effectively赋予文本或音频信号时间知识,并增强modal之间的语义对齐。其次,在Transformer之后进行高级别时间交互,我们进行不同物体嵌入之间的帧间特征通信,有助于更好地跟踪视频对象。在ref-YouTube-VOS和AVSbench数据集,以相应的文本和音频参考,MUTR取得了与当前方法相比 +4.2% 和 +4.2%的J&F改进,这表明我们对于统一多模态VOS的重要性。代码在此https URL发布。
https://arxiv.org/abs/2305.16318
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: this https URL
文本到图像模型个性化的目标是将用户提供的概念引入模型,并在多种情境下进行合成。然而,当前的方法主要关注从多个图像中学习单一概念的情况,并在适应不同情境时面临困难。在本文中,我们介绍了文本场景分解任务:给定一张可能包含多个概念的图像,我们的目标是提取每个概念的 distinct 文本 token,从而实现对生成的场景的精细控制。为此,我们建议增加输入图像上的掩码,以指示目标概念的存在。这些掩码可以由用户提供或由预先训练的分割模型自动生成。然后我们介绍了一种独特的两阶段定制过程,该过程优化了一组专门化文本嵌入(handles)和模型权重,实现在准确捕捉概念和避免过拟合之间的微妙平衡。我们采用Masked Diffusion Loss来实现handles 生成其指定的概念,并添加Cross-Attention Loss以防止纠缠,我们还介绍了合并采样训练策略,旨在提高生成图像中多个概念的合并能力。我们使用多个自动指标对方法和多个基准进行比较,并使用用户研究进一步确认结果。最后,我们展示了我们方法的多个应用。项目页面可用如下: this https URL
https://arxiv.org/abs/2305.16311
We propose a learning-based method to recover normals, specularity, and roughness from a single diffuse image of a material, using microgeometry appearance as our primary cue. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. In contrast, in this work, we propose a novel capture approach that leverages a generative network with attention and a U-Net discriminator, which shows outstanding performance integrating global information at reduced computational complexity. We showcase the performance of our method with a real dataset of digitized textile materials and show that a commodity flatbed scanner can produce the type of diffuse illumination required as input to our method. Additionally, because the problem might be illposed -more than a single diffuse image might be needed to disambiguate the specular reflection- or because the training dataset is not representative enough of the real distribution, we propose a novel framework to quantify the model's confidence about its prediction at test time. Our method is the first one to deal with the problem of modeling uncertainty in material digitization, increasing the trustworthiness of the process and enabling more intelligent strategies for dataset creation, as we demonstrate with an active learning experiment.
我们提出了一种基于学习的方法,从材料的一个均匀图像中恢复正则性、亮度和粗糙度,使用微几何外观作为主要线索。以前的方法和在单个图像上工作的方法和以前的方法都倾向于产生平滑的输出带有 artifacts,只能在有限的分辨率下操作或训练每个类别的单个模型,几乎没有泛化能力。相比之下,在本文中,我们提出了一种新的捕获方法,利用注意力机制和 U-Net 分类器,表现出卓越的性能,在减少计算复杂度的情况下整合全球信息。我们使用数字纺织材料数据集展示了我们方法的性能,并表明商品平板扫描仪可以产生所需的均匀照明类型,此外,由于问题可能不存在或不完备,可能需要多个均匀图像才能区分正则反射,或者训练数据集不足以代表真实分布,我们提出了一种新的框架,以量化模型在测试时的自信程度。我们的方法是第一个处理材料数字化中的不确定性问题的方法,提高了过程的可靠性,并 enabling 更聪明的数据集创建策略,正如我们使用主动学习实验所证明的那样。
https://arxiv.org/abs/2305.16312
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的,这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说,人类手是一个麻烦。然而,通常手也提供有价值的信号,例如手的姿势可能暗示着正在握着什么物体。在这项工作中,我们提议提取一个Factored Representation,将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配,同时保留信号,从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM),它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性:物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。
https://arxiv.org/abs/2305.16301
While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.
虽然变分自编码器在自然语言处理方面取得了显著的成功,但他们的注意力机制巨大的内存要求已经限制了他们处理更长上下文的能力。先前的方法,如循环记忆或检索增强,要么牺牲了注意力的随机访问灵活性(即整个上下文中选择任意 token 的能力)要么依赖于 separate 机制来获取相关上下文,这可能与模型的注意力不兼容。在本文中,我们提出了一种新的方法来访问整个上下文,同时保留随机访问灵活性,几乎像整个上下文运行注意力一样。我们的方法使用地标性 token 来代表输入的每个块,并训练注意力使用它来选择相关块,从而使块直接通过注意力机制进行检索,而不是通过 separate 机制。我们的方法无缝集成了 specialized 数据结构和系统的记忆层次结构,使可以处理任意长的上下文长度。我们证明了,我们的方法可以与 Transformer-XL 取得类似的性能,同时显著减少每个步骤检索 token 的数量。最后,我们展示了,通过与我们的方法 fine-tuning LLaMA 7B,成功地将上下文长度能力扩展到 32k tokens,使可以在 GPT-4 的上下文长度上进行推理。
https://arxiv.org/abs/2305.16300
Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
基于自监督学习的语音前训练受到了广泛关注,因为它能够从大量未标记数据中学习到丰富的表示。然而,对于语音前训练,使用弱监督数据的探索较少。为了填补这一差距,我们提出了一种基于语音识别者意识的语音前训练方法。该方法采用了与广泛使用的掩码语音预测框架类似的训练流程,同时添加目标语音识别者 enrollment信息作为辅助输入。这样, learned 表示就会被引导向目标语音识别者,即使在存在高度重叠的干扰情况下也是如此,从而允许潜在的应用领域进行目标语音识别等任务。我们在Libri2混合和WSJ0-2混合数据集上的实验表明, proposed model相比具有去噪能力的WavLM,在语音识别性能方面取得了显著更好的表现。
https://arxiv.org/abs/2305.16286
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{this https URL}
近年来,去噪扩散概率模型(DDPM)吸引了大量关注。通过构建始于数据域的马尔可夫过程,然后逐渐添加噪声,直到达到纯白色噪声的水平,这些模型在学习数据分布方面表现出更好的性能。然而,这些模型需要许多扩散步骤来产生审美上满意的样本,效率较低。此外,与常见的生成对抗网络不同,扩散模型的隐状态空间无法解释。在本文中,我们提议将去噪扩散过程泛化为增采样扩散概率模型(UDPM),其中我们除了传统的噪声水平增加外,还减少了隐变量维度。因此,我们只需要7个扩散步骤就能样本大小为256×256的图像,比标准DDPM的规模小得多。我们正式开发了UDPM的马尔可夫扩散过程,并在流行的FFHQ、LCNS horses、ImageNet和AFHQv2数据集上展示了其生成能力。UDPM的另一个有利特性是,它很容易进行隐状态空间的插值,而标准扩散模型则无法做到。我们的代码现在在线 \url{this https URL}。
https://arxiv.org/abs/2305.16269
Semi-supervised medical image segmentation offers a promising solution for large-scale medical image analysis by significantly reducing the annotation burden while achieving comparable performance. Employing this method exhibits a high degree of potential for optimizing the segmentation process and increasing its feasibility in clinical settings during translational investigations. Recently, cross-supervised training based on different co-training sub-networks has become a standard paradigm for this task. Still, the critical issues of sub-network disagreement and label-noise suppression require further attention and progress in cross-supervised training. This paper proposes a cross-supervised learning framework based on dual classifiers (DC-Net), including an evidential classifier and a vanilla classifier. The two classifiers exhibit complementary characteristics, enabling them to handle disagreement effectively and generate more robust and accurate pseudo-labels for unlabeled data. We also incorporate the uncertainty estimation from the evidential classifier into cross-supervised training to alleviate the negative effect of the error supervision signal. The extensive experiments on LA and Pancreas-CT dataset illustrate that DC-Net outperforms other state-of-the-art methods for semi-supervised segmentation. The code will be released soon.
半监督医学图像分割提供了一个有前途的解决方案,通过显著减少标注负担而实现类似的性能。使用这种方法可以展示高度的潜力,以优化分割过程并增加在临床实验期间 Translational 研究期间的实践可行性。最近,基于不同的协同训练子网络的交叉监督训练已经成为该任务的标准范式。然而,子网络不同意和标签噪声抑制等关键问题需要进一步的关注和进展的交叉监督训练。本文提出了基于双重分类器(DC-Net)的交叉监督学习框架,包括证据分类器和无分类分类器。两个分类器具有互补的特征,使他们能够有效地处理不同意并生成未标记数据更为稳健和准确的伪标签。我们还将证据分类器的不确定估计引入交叉监督训练,以减轻错误监督信号的负面影响。在LA和肝脏CT数据集上的广泛实验表明,DC-Net在半监督分割方面优于其他先进的方法。代码将很快发布。
https://arxiv.org/abs/2305.16216
Many real-world decision-making tasks, such as safety-critical scenarios, cannot be fully described in a single-objective setting using the Markov Decision Process (MDP) framework, as they include hard constraints. These can instead be modeled with additional cost functions within the Constrained Markov Decision Process (CMDP) framework. Even though CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches use Monte Carlo cost estimates to avoid constraint violations. However, these suffer from high variance which results in conservative performance with respect to costs. We propose Constrained MCTS (C-MCTS), an algorithm that estimates cost using a safety critic. The safety critic training is based on Temporal Difference learning in an offline phase prior to agent deployment. This critic limits the exploration of the search tree and removes unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards compared to previous work. As a nice byproduct, the planner is more efficient requiring fewer planning steps. Most importantly, we show that under model mismatch between the planner and the real world, our approach is less susceptible to cost violations than previous work.
许多现实世界的决策任务,如安全关键场景,无法在单一目标环境下使用Markov决策过程(MDP)框架完全描述,因为它们包含硬约束。这些可以而是在MDP框架内使用额外的成本函数建模。尽管MDP在 reinforcement learning 文献中已经广泛研究,但很少 attention 被给予了基于采样的计划算法,如 MCTS 来解决它们。以往的方法使用蒙特卡罗成本估计以避免约束违反。但是,这些受到高方差影响,导致在成本方面表现保守。我们建议使用 Constrained MCTS(C-MCTS)算法,这是一种使用安全批评估计成本的计划算法。安全批评训练基于在代理部署之前进行的离线阶段的时间差异学习。该批评限制了搜索树的探索,并在部署期间在 MCTS 内删除不安全的路径。C-MCTS满足了成本约束,但更接近约束边界,比以前的工作实现了更高的奖励。作为一种美好的结果,规划器更高效,需要更少的计划步骤。最重要的是,我们表明,在规划器和现实世界模型不匹配的情况下,我们的方法比以前的工作更容易违反成本。
https://arxiv.org/abs/2305.16209
Multi-agent routing problems have drawn significant attention nowadays due to their broad industrial applications in, e.g., warehouse robots, logistics automation, and traffic control. Conventionally, they are modelled as classical planning problems. In this paper, we argue that it is beneficial to formulate them as universal planning problems. We therefore propose universal plans, also known as policies, as the solution concepts, and implement a system called ASP-MAUPF (Answer Set Programming for Multi-Agent Universal Plan Finding) for computing them. Given an arbitrary two-dimensional map and a profile of goals for the agents, the system finds a feasible universal plan for each agent that ensures no collision with others. We use the system to conduct some experiments, and make some observations on the types of goal profiles and environments that will have feasible policies, and how they may depend on agents' sensors. We also demonstrate how users can customize action preferences to compute more efficient policies, even (near-)optimal ones.
多Agent 路由问题现在引起了广泛关注,因为它们在广泛的工业应用中具有广泛的应用,例如仓库机器人、物流自动化和交通控制。传统上,它们被建模为经典规划问题。在本文中,我们认为将这些问题建模为通用规划问题有益处。因此,我们提出了称为策略的通用计划,并实现了一个名为 ASP-MAUPF(多Agent通用计划求解系统)的系统,以计算它们。给定任意二维地图和每个Agent的目标轮廓,系统找到每个Agent可行的通用计划,并确保没有与其他Agent碰撞。我们使用该系统进行了一些实验,并观察了将可行策略的目标轮廓和环境类型,以及它们可能依赖于Agent的传感器。我们还展示了用户如何自定义行动偏好以计算更高效的政策,甚至接近最优的政策。
https://arxiv.org/abs/2305.16203
For automotive applications, the Graph Attention Network (GAT) is a prominently used architecture to include relational information of a traffic scenario during feature embedding. As shown in this work, however, one of the most popular GAT realizations, namely GATv2, has potential pitfalls that hinder an optimal parameter learning. Especially for small and sparse graph structures a proper optimization is problematic. To surpass limitations, this work proposes architectural modifications of GATv2. In controlled experiments, it is shown that the proposed model adaptions improve prediction performance in a node-level regression task and make it more robust to parameter initialization. This work aims for a better understanding of the attention mechanism and analyzes its interpretability of identifying causal importance.
对汽车应用而言,Graph Attention Network (GAT)是一个重要的架构,用于在特征嵌入期间包含交通场景的关系信息。正如本文所示,然而,最受欢迎的GAT实现之一,即GATv2,存在一些潜在的陷阱,可能会妨碍最佳的参数学习。特别是对于小型和稀疏的图结构,正确的优化问题很严重。为了超越限制,本文提出了GATv2的建筑学修改建议。在控制实验中,它表明,所提出的模型适应度可以提高节点级回归任务的预测性能,并使其更加鲁棒地对参数初始化进行初始化。本文旨在更好地理解注意力机制,并分析其识别因果关系的解释性。
https://arxiv.org/abs/2305.16196
Abstractive summary generation is a challenging task that requires the model to comprehend the source text and generate a concise and coherent summary that captures the essential information. In this paper, we explore the use of an encoder/decoder approach for abstractive summary generation in the Urdu language. We employ a transformer-based model that utilizes self-attention mechanisms to encode the input text and generate a summary. Our experiments show that our model can produce summaries that are grammatically correct and semantically meaningful. We evaluate our model on a publicly available dataset and achieve state-of-the-art results in terms of Rouge scores. We also conduct a qualitative analysis of our model's output to assess its effectiveness and limitations. Our findings suggest that the encoder/decoder approach is a promising method for abstractive summary generation in Urdu and can be extended to other languages with suitable modifications.
摘要生成是一种挑战性的任务,要求模型理解源文本,生成简洁、连贯的摘要,提取关键信息。在本文中,我们探讨了在土耳其语中进行摘要生成的encoder/decoder方法。我们使用基于Transformer的模型,利用自注意力机制将输入文本编码并生成摘要。我们的实验结果表明,我们的模型可以生成语法正确、语义有意义的摘要。我们使用公开数据集对模型进行评估,取得了 Rouge 评分上最先进的结果。我们还对模型的输出进行定性分析,以评估其有效性和局限性。我们的发现表明,encoder/decoder方法在土耳其语中的摘要生成是一项有前途的方法,可以与其他语言进行适当的修改扩展。
https://arxiv.org/abs/2305.16195
Explainability techniques are crucial in gaining insights into the reasons behind the predictions of deep learning models, which have not yet been applied to chemical language models. We propose an explainable AI technique that attributes the importance of individual atoms towards the predictions made by these models. Our method backpropagates the relevance information towards the chemical input string and visualizes the importance of individual atoms. We focus on self-attention Transformers operating on molecular string representations and leverage a pretrained encoder for finetuning. We showcase the method by predicting and visualizing solubility in water and organic solvents. We achieve competitive model performance while obtaining interpretable predictions, which we use to inspect the pretrained model.
解释性技术在获取深度学习模型预测背后的原因非常重要,而这些模型还没有应用于化学语言模型。我们提出了一种可解释的人工智能技术,将个体原子的重要性与其对这些模型预测的准确性赋予关联。我们的算法将相关性信息向化学输入字符串反向传播,并可视化个体原子的重要性。我们专注于在分子字符串表示中进行自我注意力Transformer的运行,并利用预训练编码器进行微调。我们展示该方法的方法是通过预测和可视化在水中和有机溶剂中的溶解度来实现的。我们实现了竞争模型性能,同时获得了可解释的预测,这些预测我们用于检查预训练模型。
https://arxiv.org/abs/2305.16192
In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. For the challenge, we build our model on top of StillFast with Guided Attention applied on fast network. Our model obtains better performance on the validation set and also achieves state-of-the-art (SOTA) results on the challenge test set for EGO4D Short-Term Object Interaction Anticipation Challenge.
在本技术报告中,我们描述了基于指导注意机制的 EGO4D 短期物体交互预测挑战的解决方案。该解决方案结合了视频片段中的对象检测和时空特征,增强运动和上下文信息,并进一步解码对象和运动中心信息,以解决 egocentric 视频中的STA问题。为了解决这个问题,我们在 stillFast 之上构建我们的模型,并将指导注意应用于快速网络。我们的模型在验证集上表现出更好的性能,并在 EGO4D 短期物体交互预测挑战的挑战测试集上取得了最先进的结果(SOTA)。
https://arxiv.org/abs/2305.16066
Audio-visual person recognition (AVPR) has received extensive attention. However, most datasets used for AVPR research so far are collected in constrained environments, and thus cannot reflect the true performance of AVPR systems in real-world scenarios. To meet the request for research on AVPR in unconstrained conditions, this paper presents a multi-genre AVPR dataset collected `in the wild', named CN-Celeb-AV. This dataset contains more than 420k video segments from 1,136 persons from public media. In particular, we put more emphasis on two real-world complexities: (1) data in multiple genres; (2) segments with partial information. A comprehensive study was conducted to compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the results demonstrated that CN-Celeb-AV is more in line with real-world scenarios and can be regarded as a new benchmark dataset for AVPR research. The dataset also involves a development set that can be used to boost the performance of AVPR systems in real-life situations. The dataset is free for researchers and can be downloaded from this http URL.
听觉视觉个人识别(AVPR)已经受到了广泛的关注。然而,目前用于AVPR研究的大多数数据集都是在限制环境下收集的,因此无法反映现实世界中AVPR系统的真实表现。为了满足对在没有限制条件下研究AVPR的要求,本文介绍了一个多体裁AVPR数据集,名为CN-Celeb-AV,该数据集是从公共媒体中收集的超过420,000个视频片段,涉及1,136名公众。特别强调的是两个现实世界的复杂性:(1)多个体裁的数据;(2)部分信息的视频片段。进行了一项全面的研究,以比较CN-Celeb-AV与两个流行的公共AVPR基准数据集,结果表明,CN-Celeb-AV更接近现实世界场景,可以被视为AVPR研究的新基准数据集。数据集还包括一个开发集,可用于提高AVPR系统在现实世界场景中的性能。数据集是免费向公众开放的,可以从这个httpURL下载。
https://arxiv.org/abs/2305.16049
We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than any existing approach.
我们提出了一种视觉扎实的语音模型,该模型从少量单词图像示例 pairs 中获取新单词及其视觉描述。给定一组测试图像和口语查询,我们询问模型哪个图像描述了查询单词。先前的研究通过使用数字单词图像示例或使用每个类别大量示例来简化了这个问题。我们提出了一种方法,可以在自然单词图像示例中工作,但使用更少的例子,即更少的拍摄次数。我们的方法涉及使用给定的单词图像示例 pairs 从大型未标记语音和图像集合中挖掘新的无监督单词图像训练对。此外,我们使用单词到图像注意力机制来确定单词图像相似性。通过使用这个新模型,我们比任何现有方法更少的拍摄次数实现了更好的性能。
https://arxiv.org/abs/2305.15937
Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.
基于Transformer的跟踪器已经在标准基准测试中取得了很高的准确性。然而,它们的效率仍然是在GPU和CPU平台上实现实际部署的一个障碍。在本文中,为了克服这个问题,我们提出了一种 fully Transformer 跟踪框架,称为 \emph{ MixFormerV2},而不需要Dense convolutional操作和复杂的得分预测模块。我们的设计关键是引入四个特殊的预测代币,并将它们与目标模板和搜索区域中的代币拼接在一起。然后,我们将这些混合代币序列应用统一的Transformer主链。这些预测代币能够通过混合注意力捕获目标模板和搜索区域之间的复杂相关性。基于它们,我们可以轻松地预测跟踪框,并通过简单的MLP头估计其置信分数。为了进一步改善 MixFormerV2 的效率,我们提出了一种新的蒸馏模型减少范式,包括Dense-to-Sparse蒸馏和Deep-to-Shallow蒸馏。前者旨在将DenseHead-based MixViT的知识转移给我们的 fully Transformer 跟踪器,而后者则用于修剪主链的某些层。我们实例化了两种 MixForemrV2 类型,其中 MixFormerV2-B 在LaSOT上实现了70.6%的AUC,而在TNL2k上实现了57.4%的AUC,且GPU速度达到165FPS。同时, MixFormerV2-S在LaSOT上通过实时CPU速度实现了与FEAR-L相比2.7%的AUC的提高。
https://arxiv.org/abs/2305.15896
Knowledge graph completion (KGC), the task of predicting missing information based on the existing relational data inside a knowledge graph (KG), has drawn significant attention in recent years. However, the predictive power of KGC methods is often limited by the completeness of the existing knowledge graphs from different sources and languages. In monolingual and multilingual settings, KGs are potentially complementary to each other. In this paper, we study the problem of multi-KG completion, where we focus on maximizing the collective knowledge from different KGs to alleviate the incompleteness of individual KGs. Specifically, we propose a novel method called CKGC-CKD that uses relation-aware graph convolutional network encoder models on both individual KGs and a large fused KG in which seed alignments between KGs are regarded as edges for message propagation. An additional mutual knowledge distillation mechanism is also employed to maximize the knowledge transfer between the models of "global" fused KG and the "local" individual KGs. Experimental results on multilingual datasets have shown that our method outperforms all state-of-the-art models in the KGC task.
知识图 completion (KGC),即基于知识图(KG)内现有的关系数据预测缺失信息的任务,近年来吸引了大量关注。然而,KGC方法的预测能力往往受到来自不同来源和语言的最新知识图的完整性限制。在单语和多语环境下,KGs可能互相补充。在本文中,我们研究了多KG completion问题,我们重点是最大限度地利用不同KG中的集体知识来减轻个体KG的不完整性。具体来说,我们提出了一种名为CKGC-CKD的新方法,该方法在个体KG和大型合并KG上使用关系 aware 的图卷积神经网络编码模型,并将KG之间的种子对齐视为消息传播的边。此外,我们还采用了额外的互相知识蒸馏机制,以最大限度地促进“全球”合并KG上的“本地”个体KG之间的知识转移。在多语数据集上的实验结果显示,我们的方法在KGC任务中优于所有最先进的模型。
https://arxiv.org/abs/2305.15895