In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{this https URL}
在本文中,我们探讨了来自预训练的文本到视频(T2V)扩散模型生成视频理解任务的视觉表示。我们假设,从预训练的生成T2V模型中学习到的潜在表示包含了丰富的语义和连贯的时间对应关系,从而自然地促进视频理解。我们的假设通过经典的参考视频物体分割(R-VOS)任务得到验证。我们引入了一个名为“VD-IT”的新框架,它专为固定预训练的T2V模型设计了一些专用的组件。具体来说,VD-IT使用文本信息作为条件输入,确保了语义的一致性,并在精确的时间实例匹配方面保持了语义一致性。它还进一步增加了图像令牌作为补充的文本输入,从而丰富特征集,生成了详细和细微的掩码。 此外,我们提出了一种预测模块,用于预测视频特有噪声,这有助于保留特征保真度并提高分割质量。通过广泛的实验,我们惊讶地观察到,固定生成T2V扩散模型,与通常用于视频骨干的(如Video Swin Transformer)预训练方法相比,具有更好的保持语义对齐和时间一致性的潜力。在现有标准基准上,我们的VD-IT实现了极具竞争力的结果,超过了许多现有方法。代码将在\url{这个 https URL}中提供。
https://arxiv.org/abs/2403.12042
Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise.
将激光雷达(LiDAR)和相机信息集成到鸟瞰视图(BEV)表示中,已成为自动驾驶中3D物体检测的关键方面。然而,现有的方法容易受到LiDAR和相机传感器之间不准确校准关系的影响。这种不准确会导致相机分支的深度估计误差,最终导致LiDAR和相机BEV特征之间的错位。在本文中,我们提出了一种稳健的融合框架,称为Graph BEV。通过解决由不准确点云投影引起的误差,我们引入了局部对齐模块,该模块通过图匹配采用邻近感知深度特征。此外,我们还提出了全局对齐模块,用于纠正LiDAR和相机BEV特征之间的错位。我们的Graph BEV框架在nusChips验证集上实现了最先进的性能,具有mAP值为70.1%,比BEV融合高出1.6%。重要的是,在错位噪声条件下,我们的Graph BEV超过了BEV融合。
https://arxiv.org/abs/2403.11848
Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which simultaneously focuses on sufficiently mining the intrinsic features of individual image patches and the relations between image patch features. Based on this, we construct a lightweight Relational Representation Learning Network (RRL-Net). Specifically, we innovatively construct an autoencoder to fully characterize the individual intrinsic features, and introduce a Feature Interaction Learning (FIL) module to extract deep-level feature relations. To further fully mine individual intrinsic features, a lightweight Multi-dimensional Global-to-Local Attention (MGLA) module is constructed to enhance the global feature extraction of individual image patches and capture local dependencies within global features. By combining the MGLA module, we further explore the feature extraction network and construct an Attention-based Lightweight Feature Extraction (ALFE) network. In addition, we propose a Multi-Loss Post-Pruning (MLPP) optimization strategy, which greatly promotes network optimization while avoiding increases in parameters and inference time. Extensive experiments demonstrate that our RRL-Net achieves state-of-the-art (SOTA) performance on multiple public datasets. Our code will be made public later.
近年来,在跨光谱图像补丁匹配中,特征关系学习已经引起了广泛的关注。然而,现有相关研究仅关注从图像补丁特征中提取多样关系,而忽略了个体图像补丁的充分固有特征表示。因此,我们提出了一个新的关系表示学习想法,该想法同时关注个体图像补丁的固有特征和图像补丁特征之间的关系。基于此,我们构建了一个轻量级的关系表示学习网络(RRL-Net)。具体来说,我们创新地构建了一个自编码器,全面描述了个体固有特征,并引入了特征交互学习(FIL)模块来提取深度层特征关系。为了进一步挖掘个体固有特征,我们构建了一个轻量多维全局到局部关注(MGLA)模块,增强了个体图像补丁的全局特征提取,并捕捉到全局特征中的局部依赖关系。通过结合MGLA模块,我们进一步探索了特征提取网络,并构建了基于注意力的轻量级特征提取(ALFE)网络。此外,我们提出了一个多损失后修剪(MLPP)优化策略,可以在网络优化时大大降低参数和推理时间,同时在保持性能的同时提高网络的训练效率。大量实验证明,我们的RRL-Net在多个公共数据集上实现了最先进的(SOTA)性能。我们的代码稍后公开发布。
https://arxiv.org/abs/2403.11751
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of semantic neglect, where the model fails to learn meaningful semantic alignments. In this paper, we propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents. Specifically, a semantic-enhanced PSL agent is proposed and a prioritized semantic training strategy is introduced to select goal images that exhibit clear semantic supervision and relax the reward function from strict exact view matching. At inference time, a semantic expansion inference scheme is designed to preserve the same granularity level of the goal-semantic as training. Furthermore, for the popular HM3D environment, we present an Instance Navigation (InstanceNav) task that requires going to a specific object instance with detailed descriptions, as opposed to the Object Navigation (ObjectNav) task where the goal is defined merely by the object category. Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task. Code will be released at https://anonymous.4open. science/r/PSL/.
我们研究零样本实例导航,其中代理商在没有使用对象注释进行训练的情况下,导航到特定的目标物体。 previous object navigation approaches apply the image-goal navigation (ImageNav) task (到达图像的位置)进行预训练,并使用视觉语言模型将代理器转移到实现目标物体。然而,这些方法导致语义忽视的问题,即模型无法学习有意义的语义对齐。在本文中,我们提出了一个优先语义学习(PSL)方法来提高导航代理的语义理解能力。具体来说,我们提出了一个语义增强的PSL代理和一个优先语义训练策略,以选择具有清晰语义监督并减轻奖励函数的精确视匹配的目标图像。在推理时,我们设计了一个语义扩展推理方案,以保留训练时的目标语义级别。此外,对于流行的HM3D环境,我们还提出了一个实例导航(InstanceNav)任务,需要对特定的物体实例进行详细的描述,而不是仅仅定义目标物体类别。我们的PSL代理在零样本ObjectNav中的成功率方面优于前状态-of-the-art,并且在新的InstanceNav任务中也非常出色。代码将在https://anonymous.4open.science/ r/PSL/中发布。
https://arxiv.org/abs/2403.11650
Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.
产品匹配,即根据不同的可发现性、策展和定价,对相同产品进行不同表示的任务,是在线市场和电子商务公司的一项关键能力。我们在一个行业环境中提出了一个 robust 的多模态产品匹配系统,其中大量数据、数据分布变化和未见过的领域带来了挑战。我们比较了不同的方法,并得出结论,通过对比学习训练的预训练图像和文本编码器的相对简单的投影,产生了最先进的成果,同时平衡了成本和性能。我们的解决方案超过了单独匹配系统和大型预训练模型(如 CLIP)。此外,我们还展示了如何将人机互动过程与基于模型的预测相结合,以在生产系统中实现近乎完美的精确度。
https://arxiv.org/abs/2403.11593
Continual learning aims to refine model parameters for new tasks while retaining knowledge from previous tasks. Recently, prompt-based learning has emerged to leverage pre-trained models to be prompted to learn subsequent tasks without the reliance on the rehearsal buffer. Although this approach has demonstrated outstanding results, existing methods depend on preceding task-selection process to choose appropriate prompts. However, imperfectness in task-selection may lead to negative impacts on the performance particularly in the scenarios where the number of tasks is large or task distributions are imbalanced. To address this issue, we introduce I-Prompt, a task-agnostic approach focuses on the visual semantic information of image tokens to eliminate task prediction. Our method consists of semantic prompt matching, which determines prompts based on similarities between tokens, and image token-level prompting, which applies prompts directly to image tokens in the intermediate layers. Consequently, our method achieves competitive performance on four benchmarks while significantly reducing training time compared to state-of-the-art methods. Moreover, we demonstrate the superiority of our method across various scenarios through extensive experiments.
持续学习旨在通过保留先前任务的知识来微调新任务的模型参数。最近,基于提示的学习方法已经出现,利用预训练模型来在没有依赖 rehearsal buffer 的情况下 prompt 学习后续任务。尽管这种方法已经展示了出色的效果,但现有的方法仍然依赖于先前的任务选择过程来选择适当的提示。然而,任务选择的不足可能导致负面的影响,特别是在任务数量较大或任务分布不均衡的情况下。为了解决这个问题,我们引入了 I-Prompt,一种任务无关的方法,专注于图像元数据的视觉语义信息,以消除任务预测。我们的方法包括语义提示匹配,这是基于元数据之间相似性的确定提示。图像元级别提示直接应用于中间层图像元数据。因此,我们的方法在四个基准测试中都实现了竞争力的性能,同时显著减少了训练时间与最先进方法的比较。此外,我们通过广泛的实验证明了我们在各种场景中的优越性。
https://arxiv.org/abs/2403.11537
Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at this https URL.
将中间帧分割作为内存中的长期上下文模型,基于空间-时间记忆的方法在半监督视频物体分割(SVOS)中最近展示了令人印象深刻的成果。然而,这些方法面临着两个关键限制:1)依赖于非局部像素级别匹配来读取内存,导致分割部分噪声特征的检索;2)每个对象独立分割,没有交互。这些缺陷使得基于内存的方法在类似物体和多物体分割方面遇到困难。为了应对这些问题,我们提出了一个查询调制方法,称为QMVOS。这个方法将物体特征总结成动态查询,然后将它们视为动态滤波器进行掩码预测,从而为模型提供高级描述和物体级别的感知。通过跨查询关注实现高效的多个物体交互。 extensive实验证明,我们的方法可以在基于内存的SVOS方法上带来显著的改进,并在标准SVOS基准测试上实现竞争力的性能。代码可在此处访问:https://www.xively.cn/
https://arxiv.org/abs/2403.11529
This paper focuses on the sim-to-real issue of RGB-D grasp detection and formulates it as a domain adaptation problem. In this case, we present a global-to-local method to address hybrid domain gaps in RGB and depth data and insufficient multi-modal feature alignment. First, a self-supervised rotation pre-training strategy is adopted to deliver robust initialization for RGB and depth networks. We then propose a global-to-local alignment pipeline with individual global domain classifiers for scene features of RGB and depth images as well as a local one specifically working for grasp features in the two modalities. In particular, we propose a grasp prototype adaptation module, which aims to facilitate fine-grained local feature alignment by dynamically updating and matching the grasp prototypes from the simulation and real-world scenarios throughout the training process. Due to such designs, the proposed method substantially reduces the domain shift and thus leads to consistent performance improvements. Extensive experiments are conducted on the GraspNet-Planar benchmark and physical environment, and superior results are achieved which demonstrate the effectiveness of our method.
本文重点关注了RGB-D抓取检测中的模拟到现实问题,将其建模为领域迁移问题。在这种情况下,我们提出了一个全局到局部的方法来解决RGB和深度数据中的混合领域差距以及不足的多模态特征对齐。首先,采用自监督旋转预训练策略提供RGB和深度网络的稳健初始化。然后,我们提出了一种全局到局部的对齐管道,包括为RGB和深度图像的场景特征以及为抓取特征的个人全局领域分类器。具体来说,我们提出了一个抓取原型适应模块,旨在通过动态更新和匹配模拟和现实场景中的抓取原型来促进细粒度局部特征对齐。由于这种设计,所提出的方法大大减少了领域漂移,从而实现了一致的性能提升。在GraspNet-Planar基准和物理环境中进行了广泛的实验,并取得了卓越的结果,证明了我们的方法的有效性。
https://arxiv.org/abs/2403.11511
The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
在许多几何计算机视觉任务中,如SLAM,静态环境的假设是很常见的,但它限制了这些任务在高度动态场景中的适用性。由于这些任务依赖于在静态环境中确定输入图像之间的点对应关系,我们提出了一个基于图神经网络的稀疏特征匹配网络,旨在在具有挑战性的条件下实现鲁棒匹配,同时排除运动物体上的关键点。我们在图边上采用类似的注意力和聚合方案来增强关键点表示,与最先进的特征匹配网络类似,但补充了极化的图信息和大大减少了图的边数。此外,我们还引入了一种自监督训练方案,用于从仅处理视觉-inertial数据的动态环境中提取伪标签,用于图像对。一系列实验证明,与最先进的特征匹配网络相比,我们的网络在排除运动物体关键点的同时,仍然实现了与传统匹配指标类似的结果。当集成到SLAM系统中时,我们的网络在动态场景中的性能显著提高,尤其是在高度动态场景中。
https://arxiv.org/abs/2403.11370
This paper presents a novel system designed for 3D mapping and visual relocalization using 3D Gaussian Splatting. Our proposed method uses LiDAR and camera data to create accurate and visually plausible representations of the environment. By leveraging LiDAR data to initiate the training of the 3D Gaussian Splatting map, our system constructs maps that are both detailed and geometrically accurate. To mitigate excessive GPU memory usage and facilitate rapid spatial queries, we employ a combination of a 2D voxel map and a KD-tree. This preparation makes our method well-suited for visual localization tasks, enabling efficient identification of correspondences between the query image and the rendered image from the Gaussian Splatting map via normalized cross-correlation (NCC). Additionally, we refine the camera pose of the query image using feature-based matching and the Perspective-n-Point (PnP) technique. The effectiveness, adaptability, and precision of our system are demonstrated through extensive evaluation on the KITTI360 dataset.
本文提出了一种名为“3D高斯扩张”的新系统,用于使用3D高斯扩张进行三维建模和视觉重新对齐。我们所提出的方法利用激光雷达和相机数据来创建准确且视觉上逼真的环境表示。通过利用激光雷达数据来启动3D高斯扩张图的训练,我们的系统构建了既详细又几何上准确的地图。为了减轻过高的GPU内存使用,同时促进快速空间查询,我们采用了一种2D体素图和KD树相结合的方法。这种准备使我们的方法非常适合视觉定位任务,通过归一化互相关(NCC)在Gaussian Splatting地图上查找查询图像和渲染图像之间的对应关系。此外,我们使用基于特征的匹配和透视n点(PnP)技术来优化查询图像的相机姿态。通过在KITTI360数据集上进行广泛的评估,证明了我们的系统的有效性、可适应性和精确性。
https://arxiv.org/abs/2403.11367
Data visualization serves as a critical means for presenting data and mining its valuable insights. The task of chart summarization, through natural language processing techniques, facilitates in-depth data analysis of charts. However, there still are notable deficiencies in terms of visual-language matching and reasoning ability for existing approaches. To address these limitations, this study constructs a large-scale dataset of comprehensive chart-caption pairs and fine-tuning instructions on each chart. Thanks to the broad coverage of various topics and visual styles within this dataset, better matching degree can be achieved from the view of training data. Moreover, we propose an innovative chart summarization method, ChartThinker, which synthesizes deep analysis based on chains of thought and strategies of context retrieval, aiming to improve the logical coherence and accuracy of the generated summaries. Built upon the curated datasets, our trained model consistently exhibits superior performance in chart summarization tasks, surpassing 8 state-of-the-art models over 7 evaluation metrics. Our dataset and codes are publicly accessible.
数据可视化是一种展示数据和挖掘其有价值的信息的关键手段。通过自然语言处理技术,图表摘要任务有助于深入分析图表。然而,现有的方法在视觉语言匹配和推理能力方面仍然存在显著的不足。为了克服这些限制,本研究构建了一个大型的图表摘要数据集,并为每个图表提供了详细的指令。得益于这个数据集中的广泛涵盖各种主题和视觉风格的全面性,从训练数据的角度可以实现更好的匹配程度。此外,我们提出了一个创新性的图表摘要方法,ChartThinker,它基于思维链和上下文检索策略的深度分析,旨在提高生成的摘要的逻辑连贯性和准确性。基于精心挑选的训练数据,我们训练的模型在图表摘要任务中始终表现出卓越的性能,超过7个评估指标中的8个领先水平。我们的数据集和代码都是公开可用的。
https://arxiv.org/abs/2403.11236
Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues, we propose a neural MRF model, where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory, to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images, we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods, e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at this https URL .
立体匹配是许多计算机视觉和机器人应用的核心任务。尽管它们在传统立体方法中具有优势,但手工制作的马尔可夫随机场(MRF)模型与端到端深度模型的建模准确性相比仍然存在很大的局限性。虽然深度学习表示已经大大提高了MRF模型的原始项,但整体准确性仍然受到手工制作的点对点项和消息传递的限制。为了解决这些问题,我们提出了一个神经网络MRF模型,其中所有势函数和消息传递都使用数据驱动的神经网络设计。我们的完全数据驱动模型基于变分推理理论,以防止收敛问题和保留立体MRF的图归纳偏见。为了使推理可实现并具有良好的扩展性,我们还提出了一个差异提议网络(DPN),用于动态修剪搜索空间中的差异。与所有已发表的方法相比,所提出的方法在KITTI 2012和2015的领导者名单上排名第一,同时运行速度比100毫秒还要快。这种方法在KITTI 2015上降低了D1指标超过50%,并且在各个领域表现出很强的跨域泛化能力,可以恢复尖端边缘。此外,我们的方法还表现出强大的跨域通用性,可以恢复尖端边缘。代码在此处:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=1137
https://arxiv.org/abs/2403.11193
The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: this https URL.
开放世界对象的复杂动态性为多对象跟踪(MOT)带来了非忽视性的挑战,通常表现为严重的变形、快速的移动和遮挡。大多数仅依赖于粗粒度物体线索的方法,如盒子和物体整体外观,容易因动态对象的内部关系失真而降解。为解决这个问题,本文提出了NetTrack,一种高效、通用且经济实惠的跟踪框架,以引入对动态性的细粒度学习。具体来说,NetTrack通过利用点级视觉线索构建动态性感知关联,并相应地引入了细粒度采样和匹配方法。此外,NetTrack还学习物体与文本的对应关系进行细粒度局部定位。为了评估在极度动态的开放世界场景中MOT的效果,构建了一个展示高动态性和多样物种以及开放世界场景的鸟群跟踪(BFT)数据集。对BFT的全面评估证实了细粒度学习在物体动态性上的有效性,而通过对具有挑战性的开放世界基准进行深入的训练和调整,即TAO、TAO-OW、AnimalTrack和GMOT-40,证实了NetTrack的强泛化能力。项目页面:此https链接。
https://arxiv.org/abs/2403.11186
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
本文提出了一种基于Transformer的集成特征和成本聚合网络,用于密集匹配任务。在密集匹配的背景下,许多工作从两种聚合形式中受益:特征聚合,涉及相似特征的对齐,或者成本聚合,旨在在相邻像素的流量估计中建立连贯性。在本文中,我们首先证明了特征聚合和成本聚合表现出不同的特征,并揭示了这两种聚合过程的潜在巨大好处。然后,我们引入了一个简单而有效的架构,利用自注意力和跨注意机制,展示了我们的方法将特征聚合和成本聚合统一起来,并有效利用了两种技术的优势。在提出的关注层中,特征和成本体积相互补充,通过粗到细的设计进一步促进了精确的匹配估计。最后在推理时,我们的网络产生多尺度预测,计算它们的置信度分数,并选择最有置信度的流进行最终预测。我们的框架在语义匹配的标准基准上进行了评估,并且还应用于几何匹配。结果表明,与现有方法相比,我们的方法取得了显著的提高。
https://arxiv.org/abs/2403.11120
Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenarios, where topological errors are common. We propose a general loss function for topologically faithful multi-class segmentation extending the recent Betti matching concept, which is based on induced matchings of persistence barcodes. We project the N-class segmentation problem to N single-class segmentation tasks, which allows us to use 1-parameter persistent homology making training of neural networks computationally feasible. We validate our method on a comprehensive set of four medical datasets with highly variant topological characteristics. Our loss formulation significantly enhances topological correctness in cardiac, cell, artery-vein, and Circle of Willis segmentation.
topological accuracy在医学图像分割中是一个非常重要的属性,对于下游应用,如网络分析和血管或细胞计数中的流建模,具有很高的价值。最近,数学拓扑学中的一些方法论进步将抽象拓扑学中的有理概念引入到二进制分割中。然而,在多分类分割场景中,这些方法没有被充分探索,因为这里 topological errors 是常见的。我们提出了一种基于诱导匹配的拓扑忠实多分类分割损失函数,扩展了最近的有理Betti 匹配概念,这是基于持续带码的诱导匹配。我们将 N 类分割问题投影到 N 个单分类分割任务上,这使得我们可以在计算神经网络训练时使用 1 参数的持续带码使得训练具有可行性。我们在具有高度变异拓扑特征的四个医学数据集上验证我们的方法。我们的损失函数对心脏、细胞、血管和Willis环的拓扑正确性显著增强了。
https://arxiv.org/abs/2403.11001
In this paper, we address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios, where irregular flight trajectories, such as hovering, turning left/right, and moving up/down, lead to significantly greater complexity compared to fixed-camera MOT. Specifically, changes in the scene background not only render traditional frame-to-frame object IOU association methods ineffective but also introduce significant view shifts in the objects, which complicates tracking. To overcome these issues, we propose a novel universal HomView-MOT framework, which for the first time, harnesses the view Homography inherent in changing scenes to solve MOT challenges in moving environments, incorporating Homographic Matching and View-Centric concepts. We introduce a Fast Homography Estimation (FHE) algorithm for rapid computation of Homography matrices between video frames, enabling object View-Centric ID Learning (VCIL) and leveraging multi-view Homography to learn cross-view ID features. Concurrently, our Homographic Matching Filter (HMF) maps object bounding boxes from different frames onto a common view plane for a more realistic physical IOU association. Extensive experiments have proven that these innovations allow HomView-MOT to achieve state-of-the-art performance on prominent UAV MOT datasets VisDrone and UAVDT.
在本文中,我们解决了在运动无人机场景中多目标跟踪(MOT)的挑战,由于不规则的飞行轨迹(如悬停、左/右转和上升/下降),与固定相机MOT相比,复杂性明显更高。具体来说,场景背景的变化不仅使传统的帧到帧物体IOU关联方法变得无效,而且物体上产生显著的视差,从而复杂跟踪。为了克服这些问题,我们提出了一个新颖的通用HomView-MOT框架,它利用变化场景中的视变换来解决在移动环境中的MOT挑战,包括视变换匹配和视元中心概念。我们引入了快速Homography估计(FHE)算法,用于计算视频帧之间的Homography矩阵,实现物体视元中心ID学习(VCIL)和利用多视图Homography学习跨视图ID特征。同时,我们的Homographic匹配滤波器(HMF)将不同帧中的物体边界框映射到共同的视平面,以实现更真实的物理IOU关联。大量实验证明,这些创新使HomView-MOT在显著的UAV MOT数据集VisDrone和UAVDT上实现最先进的性能。
https://arxiv.org/abs/2403.10830
Trajectory prediction and generation are vital for autonomous robots navigating dynamic environments. While prior research has typically focused on either prediction or generation, our approach unifies these tasks to provide a versatile framework and achieve state-of-the-art performance. Diffusion models, which are currently state-of-the-art for learned trajectory generation in long-horizon planning and offline reinforcement learning tasks, rely on a computationally intensive iterative sampling process. This slow process impedes the dynamic capabilities of robotic systems. In contrast, we introduce Trajectory Conditional Flow Matching (T-CFM), a novel data-driven approach that utilizes flow matching techniques to learn a solver time-varying vector field for efficient and fast trajectory generation. We demonstrate the effectiveness of T-CFM on three separate tasks: adversarial tracking, real-world aircraft trajectory forecasting, and long-horizon planning. Our model outperforms state-of-the-art baselines with an increase of 35% in predictive accuracy and 142% increase in planning performance. Notably, T-CFM achieves up to 100$\times$ speed-up compared to diffusion-based models without sacrificing accuracy, which is crucial for real-time decision making in robotics.
轨迹预测和生成对于自主机器人导航动态环境至关重要。虽然先前的研究通常集中于预测或生成,但我们的方法将预测和生成任务统一起来,提供了一个多功能的框架,并实现了最先进的性能。扩散模型,目前在长视野规划和学习式强化学习中处于最先进地位,依赖于计算密集的迭代采样过程。这种缓慢的过程会阻碍机器系统的动态能力。 相比之下,我们引入了轨迹条件流匹配(T-CFM),一种新颖的数据驱动方法,利用流匹配技术学习一个有效的快照时间矢量场,以实现高效和快速的轨迹生成。我们在三个不同的任务中证明了T-CFM的有效性:对抗跟踪、现实世界飞机轨迹预测和长视野规划。我们的模型在预测准确性和规划性能方面比最先进的基准模型提高了35%和142%。值得注意的是,T-CFM实现了不牺牲准确性的情况下,比基于扩散模型的模型快100倍,这对于机器人在实时决策中的重要作用至关重要。
https://arxiv.org/abs/2403.10809
The task of searching for visual objects in a large image dataset is difficult because it requires efficient matching and accurate localization of objects that can vary in size. Although the segment anything model (SAM) offers a potential solution for extracting object spatial context, learning embeddings for local objects remains a challenging problem. This paper presents a novel unsupervised deep metric learning approach, termed unsupervised collaborative metric learning with mixed-scale groups (MS-UGCML), devised to learn embeddings for objects of varying scales. Following this, a benchmark of challenges is assembled by utilizing COCO 2017 and VOC 2007 datasets to facilitate the training and evaluation of general object retrieval models. Finally, we conduct comprehensive ablation studies and discuss the complexities faced within the domain of general object retrieval. Our object retrieval evaluations span a range of datasets, including BelgaLogos, Visual Genome, LVIS, in addition to a challenging evaluation set that we have individually assembled for open-vocabulary evaluation. These comprehensive evaluations effectively highlight the robustness of our unsupervised MS-UGCML approach, with an object level and image level mAPs improvement of up to 6.69% and 10.03%, respectively. The code is publicly available at this https URL.
在大型图像数据集中寻找视觉对象的任务很难,因为这需要高效的匹配和准确的对象定位,这些对象的大小可能会有所不同。尽管分段任何事物模型(SAM)提出了从图像中提取物体空间上下文的潜在解决方案,但学习局部对象的嵌入仍然是一个具有挑战性的问题。本文提出了一种名为无监督协作度量学习混合尺度组(MS-UGCML)的新颖无监督深度元学习方法,旨在学习大小不同的对象的嵌入。接着,通过使用COCO 2017和VOC 2007数据集来构建一个挑战性的基准,以促进一般物体检索模型的训练和评估。最后,我们进行了全面的消融研究,并讨论了领域内所面临的复杂问题。我们的物体检索评估跨越了多个数据集,包括BelgaLogos、Visual Genome、LVIS,以及我们单独为开放词汇评估组装的具有挑战性的评估集。这些全面的评估有效地突出了我们无监督MS-UGCML方法的稳健性,其物体级别和图像级别AP值分别提高了6.69%和10.03%。代码可以在该https URL上公开使用。
https://arxiv.org/abs/2403.10798
We present DPPE, a dense pose estimation algorithm that functions over a Plenoxels environment. Recent advances in neural radiance field techniques have shown that it is a powerful tool for environment representation. More recent neural rendering algorithms have significantly improved both training duration and rendering speed. Plenoxels introduced a fully-differentiable radiance field technique that uses Plenoptic volume elements contained in voxels for rendering, offering reduced training times and better rendering accuracy, while also eliminating the neural net component. In this work, we introduce a 6-DoF monocular RGB-only pose estimation procedure for Plenoxels, which seeks to recover the ground truth camera pose after a perturbation. We employ a variation on classical template matching techniques, using stochastic gradient descent to optimize the pose by minimizing errors in re-rendering. In particular, we examine an approach that takes advantage of the rapid rendering speed of Plenoxels to numerically approximate part of the pose gradient, using a central differencing technique. We show that such methods are effective in pose estimation. Finally, we perform ablations over key components of the problem space, with a particular focus on image subsampling and Plenoxel grid resolution. Project website: this https URL
我们提出了DPPE,一种在Plenoxels环境中进行稠密姿态估计的算法。最近在神经辐射场技术方面的进展表明,这是一种强大的环境表示工具。更近期的神经渲染算法显著提高了训练时间和渲染速度。Plenoxels引入了一种完全可导的辐射场技术,利用Plenoptic体积元素(voxels)中的光子进行渲染,从而降低了训练时间并提高了渲染准确性,同时消除了神经网络组件。在这篇工作中,我们为Plenoxels引入了一种6DoF单目RGB-only姿态估计方法,旨在在扰动后恢复真实相机姿态。我们采用了一种基于随机梯度下降的优化方法,通过最小化重新渲染中的误差来优化姿态。特别地,我们研究了一种利用Plenoxels快速渲染速度采用离散差分技术数值近似的姿态梯度的方法。我们证明了这种方法在姿态估计方面是有效的。最后,我们对问题空间的关键组件进行平滑处理,特别关注图像子采样和Plenoxel网格分辨率。项目网站:https://this URL
https://arxiv.org/abs/2403.10773
The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.
本研究旨在改进音频-图像时间一致性对于音频-文本跨检索的目标。为了解决大规模非语音音频-图像数据对齐困难的问题,研究了将大量配对音频-图像数据获得的知識转移方法,以探究如何在共享音频-文本表示中学习知識。传统的音频-图像学习方法是将一个随机的视频流中的单个图像随机分配给整个音频剪辑,假设它们的共现。然而,这种方法可能无法准确捕捉目标音频和图像之间的时间一致性,因为单个图像只能代表场景的一个快照,尽管目标音频会随时刻变化。为了应对这个问题,我们提出了两种音频和图像匹配方法,它们有效地捕捉了时间信息:(i)最近匹配,即根据音频信息选择多个时间帧中的图像,并(ii)多帧匹配,即使用多个时间帧的音频和图像对。实验结果表明,方法(i)通过选择与音频信息最相似的图像来提高音频-文本检索性能,并将所學知識轉移。相反,方法(ii)在保持音频-图像检索性能的同时,没有在音频-文本检索性能上表现出显著的改善。这些结果表明,優化音频-图像时间一致性可能有助于更好地將知識傳遞到audio-text retrieval中。
https://arxiv.org/abs/2403.10756