Event Relation Extraction (ERE) aims to extract multiple kinds of relations among events in texts. However, existing methods singly categorize event relations as different classes, which are inadequately capturing the intrinsic semantics of these relations. To comprehensively understand their intrinsic semantics, in this paper, we obtain prototype representations for each type of event relation and propose a Prototype-Enhanced Matching (ProtoEM) framework for the joint extraction of multiple kinds of event relations. Specifically, ProtoEM extracts event relations in a two-step manner, i.e., prototype representing and prototype matching. In the first step, to capture the connotations of different event relations, ProtoEM utilizes examples to represent the prototypes corresponding to these relations. Subsequently, to capture the interdependence among event relations, it constructs a dependency graph for the prototypes corresponding to these relations and utilized a Graph Neural Network (GNN)-based module for modeling. In the second step, it obtains the representations of new event pairs and calculates their similarity with those prototypes obtained in the first step to evaluate which types of event relations they belong to. Experimental results on the MAVEN-ERE dataset demonstrate that the proposed ProtoEM framework can effectively represent the prototypes of event relations and further obtain a significant improvement over baseline models.
事件关系提取(ERE)的目标是在文本中提取不同类型的关系。然而,现有方法单独将事件关系分类为不同的类别,这些类别未能充分捕捉到这些关系的内在语义。为了全面理解这些关系的内在语义,在本文中,我们提出了每个类型的事件关系原型表示,并提出了原型增强匹配(ProtoEM)框架,用于同时提取多种类型的事件关系。具体来说,ProtoEM采用两步提取方法,即原型表示和原型匹配。在第一步中,为了捕捉不同事件关系的内涵,ProtoEM使用示例表示这些关系的原型。随后,为了捕捉事件关系之间的依赖关系,它构建了一个原型依赖图,用于表示这些关系的原型,并使用基于Graph Neural Network(GNN)模块进行建模。在第二步中,它获取了新的事件对的表示,并计算它们与在第一步中获取的原型之间的相似性,以评估它们属于哪种事件关系。Maven-ERE数据集的实验结果表明,提出的ProtoEM框架可以 effectively representing 原型事件关系原型,并进一步优于基准模型。
https://arxiv.org/abs/2309.12892
Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
许多数学模型被用来利用设计表示知识图(KG)实体和关系嵌入,以进行链接预测和其他许多后续任务。这些数学模型不仅具有在大型KG中进行推理的高度可扩展性,而且还具有许多可解释的优势,在建模不同的关系模式时,可以通过形式证明和实证结果进行验证。在本文中,我们进行了全面综述KG完成的研究现状。特别是,我们重点关注KG嵌入(KGE)设计的两个主要分支:距离方法和语义匹配方法。我们发现了最近提出模型之间的联系,并提出了可能有助于研究人员发明新且更有效模型的潜在趋势。接下来,我们将探讨结合化合物E和化合物E3D,分别从2D和3D阿夫洛夫操作中汲取灵感。它们涵盖了包括距离方法和语义方法在内的广泛这些方法。此外,我们还将讨论KG完成新兴的方法,利用预训练的语言模型(PLMs)和实体和关系文本描述,并提供关于将KGE嵌入方法与PLMs用于KG完成之间的集成的洞察。
https://arxiv.org/abs/2309.12501
Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.
全景视频包含了更丰富的空间信息,因此在一些领域如自动驾驶和虚拟现实中吸引了大量的关注,例如。然而,现有的视频分割数据集只关注传统的平面图像。为了应对这个问题,在本文中,我们提出了一个全景视频数据集,PanoVOS。该数据集提供了150个高分辨率的视频和多种运动。为了量化2D平面视频和全景视频之间的领域差异,我们评估了15个现有视频物体分割模型(VOS)在PanoVOS上的表现。通过错误分析,我们发现它们都无法解决全景视频的像素级内容中断问题。因此,我们提出了全景空间一致性Transformer(PSC former),它能够有效利用上一句的语义边界信息,对当前帧进行像素级匹配。广泛的实验结果表明,与以前的SOTA模型相比,我们的PSC former网络在全景设置下的视频分割结果表现优异。我们的数据集在全景VOS方面提出了新的挑战,我们希望能够推动全景分割/跟踪的发展。
https://arxiv.org/abs/2309.12303
Equipping multi-fingered robots with tactile sensing is crucial for achieving the precise, contact-rich, and dexterous manipulation that humans excel at. However, relying solely on tactile sensing fails to provide adequate cues for reasoning about objects' spatial configurations, limiting the ability to correct errors and adapt to changing situations. In this paper, we present Tactile Adaptation from Visual Incentives (TAVI), a new framework that enhances tactile-based dexterity by optimizing dexterous policies using vision-based rewards. First, we use a contrastive-based objective to learn visual representations. Next, we construct a reward function using these visual representations through optimal-transport based matching on one human demonstration. Finally, we use online reinforcement learning on our robot to optimize tactile-based policies that maximize the visual reward. On six challenging tasks, such as peg pick-and-place, unstacking bowls, and flipping slender objects, TAVI achieves a success rate of 73% using our four-fingered Allegro robot hand. The increase in performance is 108% higher than policies using tactile and vision-based rewards and 135% higher than policies without tactile observational input. Robot videos are best viewed on our project website: this https URL.
装备多指机器人的触觉感知对于实现人类 excel 于精确的、接触丰富的、灵活的操纵至关重要。然而,仅仅依靠触觉感知无法提供足够的线索来进行物体空间配置推理,限制着纠正错误和适应变化情况的能力。在本文中,我们提出了Tactile Adaptation from Visual Incentives(TAVI),一种通过视觉奖励优化灵活的 policies 的新框架,以增强基于触觉的灵巧操作。首先,我们使用对比基目标来学习视觉表示。接着,我们利用这些视觉表示通过最佳传输匹配在一个人演示中构建奖励函数。最后,我们在我们的机器人上使用在线强化学习来优化基于触觉的政策,以最大化视觉奖励。在六个挑战任务中,例如钩子的选取、倒掉瓶子和翻转细长物体,TAVI 使用我们的四指 Allegro 机器人手取得了 73% 的准确率。性能的提高比使用触觉和视觉奖励的政策高 108%,比没有触觉观察输入的政策高 135%。机器人视频最好在我们项目网站上查看:这个 https URL。
https://arxiv.org/abs/2309.12300
As robotic systems increasingly encounter complex and unconstrained real-world scenarios, there is a demand to recognize diverse objects. The state-of-the-art 6D object pose estimation methods rely on object-specific training and therefore do not generalize to unseen objects. Recent novel object pose estimation methods are solving this issue using task-specific fine-tuned CNNs for deep template matching. This adaptation for pose estimation still requires expensive data rendering and training procedures. MegaPose for example is trained on a dataset consisting of two million images showing 20,000 different objects to reach such generalization capabilities. To overcome this shortcoming we introduce ZS6D, for zero-shot novel object 6D pose estimation. Visual descriptors, extracted using pre-trained Vision Transformers (ViT), are used for matching rendered templates against query images of objects and for establishing local correspondences. These local correspondences enable deriving geometric correspondences and are used for estimating the object's 6D pose with RANSAC-based PnP. This approach showcases that the image descriptors extracted by pre-trained ViTs are well-suited to achieve a notable improvement over two state-of-the-art novel object 6D pose estimation methods, without the need for task-specific fine-tuning. Experiments are performed on LMO, YCBV, and TLESS. In comparison to one of the two methods we improve the Average Recall on all three datasets and compared to the second method we improve on two datasets.
机器人系统越来越面临复杂和非限制性的现实世界场景,有需求识别不同的物体。目前最先进的6D物体姿态估计方法依赖于特定的训练,因此不能适用于未观察到的物体。最近提出了新的物体姿态估计方法,通过任务特定的优化卷积神经网络来进行深度模板匹配。这种方法还需要昂贵的数据渲染和训练程序。例如,MegaPose方法通过训练一个包含两百万图像的dataset,显示可以产生这种泛化能力。为了克服这一缺陷,我们引入了ZS6D,用于零样本的新物体6D姿态估计。从预先训练的视觉Transformer提取的视觉特征用于匹配渲染模板与物体的查询图像,并建立局部对应关系。这些局部对应关系可以实现从几何对应关系的提取,并使用RANSAC-based PnP方法估计物体的6D姿态。这种方法展示了提取从预先训练的ViTs提取的视觉特征的方法非常适合实现比两个最先进的新物体6D姿态估计方法显著的改进,而不需要任务特定的优化。实验在LMO、YCBV和TLESS上进行。与其中一个方法相比,我们提高了所有三个dataset的平均召回率,与第二个方法相比,我们提高了两个dataset的数据。
https://arxiv.org/abs/2309.11986
Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed \textit{Fully Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $\mathcal{J\&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $\mathcal{J}$ on the latter one.
refering Video Object Segmentation (RVOS) 要求对通过自然语言查询的视频对象进行分割。现有的方法主要依赖复杂的管道来处理这样的跨模态任务,并并未明确 Modeling the object-level spatial context which plays an important role in locate the referred object. Therefore,我们提出一个端到端的 RVOS 框架,完全基于transformers,称为 \textit{Fully Transformer-Equipped Architecture} (FTEA),该框架将 RVOS 任务视为mask sequence 学习问题,并将所有物体在视频视为候选物体。给定一个带有文本查询的视频片段,编码器将产生视觉文本特征,而相应的像素级和词级特征在语义相似性方面对齐。为了捕捉物体级别的空间上下文,我们开发了Stacked Transformer,它 individually characterized 每种候选物体的视觉外观,其特征映射直接解码为二进制掩码序列。最后,模型找到最佳匹配掩码序列和文本查询。此外,为了多样化生成候选物体的掩码,我们向模型施加多样性损失,以捕捉所 refer 物体更准确的掩码。经验证研究表明, proposed 方法在三个基准上具有优势,例如,FETA 在A2D 句子(3782 视频)和 J-HMDB 句子(928 视频)的mAP(每千次查询访问)分别实现45.1%和38.7%;ref-YouTube-VOS(3975 视频和7451 对象)中实现56.6%。特别是,与最佳候选方法相比,它在这两个方面分别实现了2.1%和3.2%的P$@$0.5,而在另一方面实现了2.9%的$\mathcal{J}$。
https://arxiv.org/abs/2309.11933
Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.
几十年来,在结构从运动(SfM)方面取得了大量成就。然而,绝大多数研究基本上采用 offline 的方式,即先拍摄图像,然后将它们输入到 SfM 流程中以获得姿态和稀疏点云。在本研究中,我们提出了一种实时的 SfM:在图像拍摄的同时运行在线的 SfM,并将新拍摄的图像在线估计其对应的姿态和点,即你得到什么就是什么。具体来说,我们的方法和先使用基于学习 global 特征的无监督训练来快速检索新飞入图像的图像。然后,我们介绍了一种基于最小平方的强特征匹配机制以提高图像对齐性能。最后,通过研究新飞入图像相邻图像的影响,我们使用了高效的分层加权局部卷积调整(BA)进行优化。广泛的实验结果证明,实时的 SfM可以在在线方式下稳健地对齐图像。
https://arxiv.org/abs/2309.11883
In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at this https URL.
过去几年,视频实例分割(VIS)取得了巨大的进展,许多 offline 和 online 方法都实现了最先进的性能。虽然 offline 方法具有产生时间一致性预测的优势,但不适合实时场景。相反, online 方法更实用,但维持时间一致性仍然是一个挑战性的任务。在本文中,我们提出了一种全新的 online 方法,称为 TCOVIS,它 fully 利用了视频片段中的时间信息。我们的方法和核心由一个全局实例分配策略和一个空间-时间增强模块组成,以提高特征的时间一致性。具体来说,我们在整个视频片段中执行全球最优匹配,并监督模型以全球最优目标。我们还捕捉空间特征,并在帧之间将它们与语义特征相结合,从而实现空间-时间增强。我们评估了 four widely adopted VIS 基准点,即 YouTube-VIS 2019/2021/2022 和 OVIS,并在所有基准点 without bell-and-whistle 情况下实现了最先进的性能。例如,在 YouTube-VIS 2021 中,TCOVIS 使用 ResNet-50 和 Swin-L 骨干网络分别实现 49.5 元和 61.3 元的性能。代码可在本页的 https 链接中获取。
https://arxiv.org/abs/2309.11857
Neural radiance fields (NeRFs) are a powerful tool for implicit scene representations, allowing for differentiable rendering and the ability to make predictions about previously unseen viewpoints. From a robotics perspective, there has been growing interest in object and scene-based localisation using NeRFs, with a number of recent works relying on sampling-based or Monte-Carlo localisation schemes. Unfortunately, these can be extremely computationally expensive, requiring multiple network forward passes to infer camera or object pose. To alleviate this, a variety of sampling strategies have been applied, many relying on keypoint recognition techniques from classical computer vision. This work conducts a systematic empirical comparison of these approaches and shows that in contrast to conventional feature matching approaches for geometry-based localisation, sampling-based localisation using NeRFs benefits significantly from stable features. Results show that rendering stable features can result in a tenfold reduction in the number of forward passes required, a significant speed improvement.
神经辐射场(NeRFs)是用于 implicit 场景表示的强大工具,可实现不同的渲染和对未来未观测视角进行预测的能力。从机器人角度来看,使用 NeRFs 进行对象和场景定位的研究正在增加,一些最近的工作依赖于采样或蒙特卡罗定位方案。不幸的是,这些方案非常计算昂贵,需要多次网络的前向迭代来推断相机或物体姿态。为了减轻这种情况,采用了多种采样策略,其中许多依赖于经典计算机视觉中的关键点识别技术。这项工作进行了系统的经验比较这些方法,并表明与基于几何特征的定位方法相比,使用 NeRFs 进行采样定位方法从稳定特征中获得了显著的好处。结果表明,渲染稳定特征可以导致所需的前向迭代数量减少十倍,实现了显著的速度改进。
https://arxiv.org/abs/2309.11698
Federated Learning (FL) addresses the need to create models based on proprietary data in such a way that multiple clients retain exclusive control over their data, while all benefit from improved model accuracy due to pooled resources. Recently proposed Neural Graphical Models (NGMs) are Probabilistic Graphical models that utilize the expressive power of neural networks to learn complex non-linear dependencies between the input features. They learn to capture the underlying data distribution and have efficient algorithms for inference and sampling. We develop a FL framework which maintains a global NGM model that learns the averaged information from the local NGM models while keeping the training data within the client's environment. Our design, FedNGMs, avoids the pitfalls and shortcomings of neuron matching frameworks like Federated Matched Averaging that suffers from model parameter explosion. Our global model size remains constant throughout the process. In the cases where clients have local variables that are not part of the combined global distribution, we propose a `Stitching' algorithm, which personalizes the global NGM models by merging the additional variables using the client's data. FedNGM is robust to data heterogeneity, large number of participants, and limited communication bandwidth.
分布式学习(FL)解决了多个客户共享数据、同时受益于共享资源提高模型准确性的需求。最近提出的神经网络图形模型(NGMs)是一种概率图形模型,利用神经网络的表达能力学习输入特征之间的复杂非线性依赖关系。它们学习捕捉 underlying 数据分布,并使用高效的推理和采样算法。我们开发了FL框架,其中维护了一个全球NGM模型,该模型从本地NGM模型中学习平均信息,并将训练数据留在客户环境中。我们的设计是FedNGMs,避免了类似分布式匹配平均模型(FLA)的错误和缺点,后者因为模型参数爆炸而表现不佳。我们全球模型大小在整个过程中保持不变。当客户具有不是全球总和的本地变量时,我们提出了一个“拼接”算法,该算法通过使用客户数据将额外的变量合并到全球NGM模型中,以个性化地训练该模型。FedNGM对数据异质性、大量参与者和通信带宽有限的条件非常鲁棒。
https://arxiv.org/abs/2309.11680
The prevalence of ubiquitous location-aware devices and mobile Internet enables us to collect massive individual-level trajectory dataset from users. Such trajectory big data bring new opportunities to human mobility research but also raise public concerns with regard to location privacy. In this work, we present the Conditional Adversarial Trajectory Synthesis (CATS), a deep-learning-based GeoAI methodological framework for privacy-preserving trajectory data generation and publication. CATS applies K-anonymity to the underlying spatiotemporal distributions of human movements, which provides a distributional-level strong privacy guarantee. By leveraging conditional adversarial training on K-anonymized human mobility matrices, trajectory global context learning using the attention-based mechanism, and recurrent bipartite graph matching of adjacent trajectory points, CATS is able to reconstruct trajectory topology from conditionally sampled locations and generate high-quality individual-level synthetic trajectory data, which can serve as supplements or alternatives to raw data for privacy-preserving trajectory data publication. The experiment results on over 90k GPS trajectories show that our method has a better performance in privacy preservation, spatiotemporal characteristic preservation, and downstream utility compared with baseline methods, which brings new insights into privacy-preserving human mobility research using generative AI techniques and explores data ethics issues in GIScience.
普遍的地理位置感知设备和移动互联网使得我们可以从用户那里收集大量的个人级轨迹数据集。这些轨迹大数据为人类移动研究带来了新的机会,但也引发了关于位置隐私的公众关注。在本研究中,我们提出了条件对抗轨迹合成(CATS),这是一种基于深度学习的GeoAI方法框架,用于隐私保护轨迹数据的生产和发布。CATS将K匿名化人类运动轨迹的时间和地点分布提供分布级别的强隐私保护。通过利用条件对抗训练基于K匿名化人类运动矩阵、利用注意力机制轨迹全球上下文学习,以及相邻轨迹点经常的二分类图匹配,CATS可以从条件采样位置重构轨迹拓扑,并生成高质量的个人级合成轨迹数据,作为隐私保护轨迹数据发布的原始数据的补充或替代品。对超过900,000GPS轨迹的实验结果显示,我们的方法和 baseline方法在隐私保护、时间和空间特征保留以及后续 utility方面表现更好,这为使用生成AI技术进行隐私保护人类移动研究带来了新的见解,并探索GIScience中的数据伦理问题。
https://arxiv.org/abs/2309.11587
Previous attempts to incorporate a mention detection step into end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention span data as well as other entity information. This paper presents a coreference model that learns singletons as well as features such as entity type and information status via a multi-task learning-based approach. This approach achieves new state-of-the-art scores on the OntoGUM benchmark (+2.7 points) and increases robustness on multiple out-of-domain datasets (+2.3 points on average), likely due to greater generalizability for mention detection and utilization of more data from singletons when compared to only coreferent mention pair matching.
以往的尝试在英语中引入提及检测步骤,并将其融入端到端神经网络同义性解决中,受到了缺少单个提及跨度数据和其他实体信息的困扰。本文介绍了一种同义性模型,通过任务学习方法学习单个提及和实体类型和信息状态的多个特征。这种方法在拓扑GUM基准测试中获得了先进的得分(+2.7分),并在多个跨域数据集上提高了鲁棒性(平均+2.3分),可能是由于提及检测和从单个提及匹配到同义提及配对的更广泛适用性,而与仅匹配核心提及对相比。
https://arxiv.org/abs/2309.11582
Neuromorphic computing is one of the few current approaches that have the potential to significantly reduce power consumption in Machine Learning and Artificial Intelligence. Imam & Cleland presented an odour-learning algorithm that runs on a neuromorphic architecture and is inspired by circuits described in the mammalian olfactory bulb. They assess the algorithm's performance in "rapid online learning and identification" of gaseous odorants and odorless gases (short "gases") using a set of gas sensor recordings of different odour presentations and corrupting them by impulse noise. We replicated parts of the study and discovered limitations that affect some of the conclusions drawn. First, the dataset used suffers from sensor drift and a non-randomised measurement protocol, rendering it of limited use for odour identification benchmarks. Second, we found that the model is restricted in its ability to generalise over repeated presentations of the same gas. We demonstrate that the task the study refers to can be solved with a simple hash table approach, matching or exceeding the reported results in accuracy and runtime. Therefore, a validation of the model that goes beyond restoring a learned data sample remains to be shown, in particular its suitability to odour identification tasks.
神经可塑性计算是当前少数几种可能显著减少机器学习和人工智能能源消耗的方法之一。Idem & Cleland介绍了一种气味学习算法,运行在神经可塑性架构上,并受哺乳动物嗅觉bulb电路的启发。他们使用不同气味呈现的气体传感器记录录音,并通过冲动噪声污染它们,评估了算法在快速在线学习和识别气体(即“气体”一词)方面的表现。我们重复了研究的部分,并发现可能影响某些结论的限制。首先,使用的数据集受到传感器漂移和不随机测量协议的限制,使其对于气味识别基准的应用范围有限。其次,我们发现模型在其对相同气体的多次呈现泛化能力受到限制。我们证明了研究中所提到的任务可以通过简单的哈希表方法解决,在准确性和运行时与 reported 结果匹配或超过。因此,超越恢复学习数据样本的模型验证仍然需要展示,特别是其适用于气味识别任务的能力。
https://arxiv.org/abs/2309.11555
In today's globalized world, effective communication with people from diverse linguistic backgrounds has become increasingly crucial. While traditional methods of language translation, such as written text or voice-only translations, can accomplish the task, they often fail to capture the complete context and nuanced information conveyed through nonverbal cues like facial expressions and lip movements. In this paper, we present an end-to-end video translation system that not only translates spoken language but also synchronizes the translated speech with the lip movements of the speaker. Our system focuses on translating educational lectures in various Indian languages, and it is designed to be effective even in low-resource system settings. By incorporating lip movements that align with the target language and matching them with the speaker's voice using voice cloning techniques, our application offers an enhanced experience for students and users. This additional feature creates a more immersive and realistic learning environment, ultimately making the learning process more effective and engaging.
当今世界全球化,与来自不同语言背景的人进行有效的沟通变得越来越重要。尽管传统的语言翻译方法,如书面文本或仅进行语音翻译,可以完成任务,但它们通常无法捕捉到通过面部表情和嘴唇动作的非语言信号传达的完整上下文和微妙的信息。在本文中,我们提出了一种端到端的视频翻译系统,不仅进行口语翻译,还同步翻译演讲的语音与演讲者的嘴唇动作。我们的系统专注于翻译多种印度语言的教育演讲,并设计即使在资源匮乏的系统环境中也能够有效。通过采用与目标语言对齐的嘴唇动作,并通过语音克隆技术将其与演讲者的声音匹配,我们的应用程序为学生和用户提供了增强的体验。这一额外功能创造了更加沉浸和真实的学习环境,最终使学习过程更加有效和有成效。
https://arxiv.org/abs/2309.11338
Camera localization in 3D LiDAR maps has gained increasing attention due to its promising ability to handle complex scenarios, surpassing the limitations of visual-only localization methods. However, existing methods mostly focus on addressing the cross-modal gaps, estimating camera poses frame by frame without considering the relationship between adjacent frames, which makes the pose tracking unstable. To alleviate this, we propose to couple the 2D-3D correspondences between adjacent frames using the 2D-2D feature matching, establishing the multi-view geometrical constraints for simultaneously estimating multiple camera poses. Specifically, we propose a new 2D-3D pose tracking framework, which consists: a front-end hybrid flow estimation network for consecutive frames and a back-end pose optimization module. We further design a cross-modal consistency-based loss to incorporate the multi-view constraints during the training and inference process. We evaluate our proposed framework on the KITTI and Argoverse datasets. Experimental results demonstrate its superior performance compared to existing frame-by-frame 2D-3D pose tracking methods and state-of-the-art vision-only pose tracking algorithms. More online pose tracking videos are available at \url{this https URL}
相机位置在3D激光地图中因其处理复杂场景的潜力而日益受到关注,超越了仅使用视觉定位方法的限制。然而,现有的方法主要关注解决跨模态间隙,Frame by Frame 估算相机姿态,而不考虑相邻帧之间的关系,这导致姿态跟踪不稳定。为了减轻这种情况,我们提出了一种新方法,它使用2D-2D特征匹配将相邻帧之间的2D-3D对应关系耦合起来,建立同时估计多个相机姿态的多视角几何约束。具体来说,我们提出了一种新的2D-3D姿态跟踪框架,它包括连续帧的前后端混合流估计网络和后端姿态优化模块。我们还设计了一种新的跨模态一致性损失,在训练和推断过程中纳入多视角约束。我们使用KITTI和Argoverse数据集评估了我们提出的框架。实验结果表明,与现有的帧间2D-3D姿态跟踪方法和最先进的仅使用视觉姿态跟踪算法相比,我们的框架表现更好。更多在线姿态跟踪视频可以在 \url{this https URL} 找到。
https://arxiv.org/abs/2309.11335
Echocardiogram video segmentation plays an important role in cardiac disease diagnosis. This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabelled target domains. Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat. In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation. Our GraphEcho comprises two innovative modules, the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle Consistency (TCC) module, which utilize prior knowledge of echocardiogram videos, i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency, respectively. These two modules can better align global and local features from source and target domains, improving UDA segmentation results. Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods. Our collected dataset and code will be publicly released upon acceptance. This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos. Code and dataset are available at: this https URL
心电图视频分割在心脏疾病诊断中发挥着重要的作用。本文研究的是心电图视频分割的未授权域适应(UDA),目标是将训练在源域上的模型扩展到未标记的目标域。现有的UDA分割方法不适合这个任务,因为它们不建模局部信息和心跳的周期性一致性。在本文中,我们介绍了一个新收集的 CardiacUDA 数据集和一个新的 GraphEcho 方法,用于心结构分割。我们的 GraphEcho 包括两个创新模块,分别是空间跨域图形匹配(SCGM)和时间循环一致性(TCC)模块,它们利用心电图视频的先前知识,即在不同患者和中心之间的一致性心结构和心跳的周期性一致性。这两个模块可以更好地对齐源域和目标域中的全局和局部特征,改善 UDA 分割结果。实验结果表明,我们的 GraphEcho 方法比现有的UDA分割方法表现更好。我们收集的数据集和代码将在接受后公开发布。这项工作将建立一个新的坚实的心电图结构分割基础,从心电图视频分割中分离心结构。代码和数据集可在以下 https URL 中获取:
https://arxiv.org/abs/2309.11145
Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.
声音可以在我们的生活中传递重要的空间推理信息。为了赋予深度学习这种能力,我们利用跨modal知识蒸馏技术解决了在2D和3D中利用声音进行密集室内预测的挑战。在这项工作中,我们提出了一种通过匹配(SAM)蒸馏框架来实现空间对齐的方法,该框架能够在视觉到音频知识转移中识别两个modal之间的局部对应关系。SAM将音频特征与视觉一致性可学习的空间嵌入相结合,以解决学生模型多层中的不一致性问题。我们的方法不依赖于特定的输入表示,可以在输入形状或维度上具有灵活性,而不会影响性能。凭借新 curated 的基准名为周围Dense Auditory Prediction(DAPS),我们成为第一个在2D和3D中利用音频观察解决广泛directional周围室内预测问题的人。具体而言,对于基于音频的深度估计、语义分割和具有挑战性的3D场景重建,我们提出的蒸馏框架 consistently 实现了各种指标和基本架构的前沿性能。
https://arxiv.org/abs/2309.11081
Across various domains, data from different sources such as Baidu Baike and Wikipedia often manifest in distinct forms. Current entity matching methodologies predominantly focus on homogeneous data, characterized by attributes that share the same structure and concise attribute values. However, this orientation poses challenges in handling data with diverse formats. Moreover, prevailing approaches aggregate the similarity of attribute values between corresponding attributes to ascertain entity similarity. Yet, they often overlook the intricate interrelationships between attributes, where one attribute may have multiple associations. The simplistic approach of pairwise attribute comparison fails to harness the wealth of information encapsulated within this http URL address these challenges, we introduce a novel entity matching model, dubbed Entity Matching Model for Capturing Complex Attribute Relationships(EMM-CCAR),built upon pre-trained models. Specifically, this model transforms the matching task into a sequence matching problem to mitigate the impact of varying data formats. Moreover, by introducing attention mechanisms, it identifies complex relationships between attributes, emphasizing the degree of matching among multiple attributes rather than one-to-one correspondences. Through the integration of the EMM-CCAR model, we adeptly surmount the challenges posed by data heterogeneity and intricate attribute interdependencies. In comparison with the prevalent DER-SSM and Ditto approaches, our model achieves improvements of approximately 4% and 1% in F1 scores, respectively. This furnishes a robust solution for addressing the intricacies of attribute complexity in entity matching.
在各种不同的领域,来自不同来源的数据,如百度百度和维基百科,往往表现出不同的形式。目前实体匹配方法主要关注同质数据,特点是属性具有相同的结构和简洁的属性值。然而,这种取向在处理不同格式的数据时带来了挑战。此外,流行的方法集将对应属性之间的属性值相似性聚合起来,以确定实体相似性。然而,他们常常忽略属性之间的复杂关系,其中一种属性可能具有多个关联。简单的点对点属性比较方法无法充分利用 this http URL address 中包含的信息财富。为了解决这些挑战,我们提出了一种新的实体匹配模型,称为实体匹配模型,以捕捉复杂的属性关系(EMM-CCAR),基于预先训练的模型。具体来说,这个模型将匹配任务转化为序列匹配问题,以减轻不同数据格式的影响。此外,通过引入注意力机制,它识别复杂的属性关系,强调多个属性之间的匹配程度,而不是一对一的对应关系。通过集成 EMM-CCAR 模型,我们巧妙地克服了数据异质性和复杂属性间依赖关系带来的挑战。与流行的 DER-SSM 和 Ditto 方法相比,我们的模型在 F1 得分方面实现了约 4% 的改进。这提供了解决实体匹配中属性复杂性的稳健解决方案。
https://arxiv.org/abs/2309.11046
We propose a new score-based model with one-step sampling. Previously, score-based models were burdened with heavy computations due to iterative sampling. For substituting the iterative process, we train a standalone generator to compress all the time steps with the gradient backpropagated from the score network. In order to produce meaningful gradients for the generator, the score network is trained to simultaneously match the real data distribution and mismatch the fake data distribution. This model has the following advantages: 1) For sampling, it generates a fake image with only one step forward. 2) For training, it only needs 10 diffusion steps.3) Compared with consistency model, it is free of the ill-posed problem caused by consistency loss. On the popular CIFAR-10 dataset, our model outperforms Consistency Model and Denoising Score Matching, which demonstrates the potential of the framework. We further provide more examples on the MINIST and LSUN datasets. The code is available on GitHub.
我们提出了一种基于评分的新模型,采用一步采样。以前,基于评分的模型由于迭代采样而承受了繁重的计算负担。为了取代迭代过程,我们训练了一个单独的生成器,使其通过评分网络中的梯度回传压缩所有时间步。为了为生成器产生有意义的梯度,评分网络被训练同时匹配真实数据分布和不匹配假数据分布。这个模型具有以下优点: 1) 用于采样,只需向前移动一步即可生成一张假图像。 2) 用于训练,只需要10个扩散步。 3) 与一致性模型相比,它避免了一致性损失引起的 ill-posed 问题。在流行的CIFAR-10数据集上,我们的模型比一致性模型和去噪评分匹配表现更好,这表明框架的潜力。我们还在MINIST和LUSN数据集上提供了更多的示例代码,代码可在GitHub上获取。
https://arxiv.org/abs/2309.11043
In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene. In this work, we propose a stylization framework for NeRF based on local style transfer. In particular, we use a hash-grid encoding to learn the embedding of the appearance and geometry components, and show that the mapping defined by the hash table allows us to control the stylization to a certain extent. Stylization is then achieved by optimizing the appearance branch while keeping the geometry branch fixed. To support local style transfer, we propose a new loss function that utilizes a segmentation network and bipartite matching to establish region correspondences between the style image and the content images obtained from volume rendering. Our experiments show that our method yields plausible stylization results with novel view synthesis while having flexible controllability via manipulating and customizing the region correspondences.
近年来,对基于参考样式图像的三维场景进行样式化的兴趣不断增加,特别是对神经网络光照场(NeRF)进行样式化。虽然直接对NeRF进行样式化可以保证任意新视角的外观一致性,但这是一个挑战性的问题,需要指导从样式图像到NeRF场景不同部分的模式转移。在本研究中,我们提出了基于本地样式转移的NeRF样式化框架。特别是,我们使用哈希网格编码学习外观和几何组件的嵌入,并证明哈希表定义的映射可以让我们在某种程度上控制样式化。样式化是通过优化外观分支而实现的,同时保持几何分支 fixed。为了支持本地样式转移,我们提出了一个新的损失函数,利用分割网络和二元匹配来建立从体积渲染中获得的样式图像和内容图像的区域对应关系。我们的实验结果表明,我们的方法和新的可视化方法能够产生新颖视角的样式化结果,同时通过操纵和定制区域对应关系具有灵活的控制性。
https://arxiv.org/abs/2309.10684