Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
在文档图像中的表格检测是一个关键的任务,涉及表格的识别和定位。尽管最近在深度学习领域的进步大大提高了这一任务的准确性,但仍然高度依赖大型带标签数据集进行有效的训练。为克服这一挑战,已经出现了几種半监督方法,通常采用基于卷积神经网络(CNN)的检测器以及非最大抑制(NMS)等后处理技术。然而,该领域的最新进展已经将重点转向基于Transformer的技术,消除了NMS的需要,并强调了对象查询和注意机制。之前的研究集中在两个关键领域以提高基于Transformer的检测器的质量:优化对象查询和优化注意机制。然而,增加对象查询可能会引入冗余,而调整注意机制可能会增加复杂性。为了应对这些挑战,我们引入了一种半监督方法,使用了SAM-DETR,一种用于精确将对象查询与目标特征对齐的新颖方法。我们的方法在减少误检率和提高表格检测性能方面取得了显著的降幅,特别是在具有多样表格结构的复杂文档中。这项工作在半监督环境中提供了更高效和准确的表格检测。
https://arxiv.org/abs/2405.00187
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654
Matching visible and near-infrared (NIR) images remains a significant challenge in remote sensing image fusion. The nonlinear radiometric differences between heterogeneous remote sensing images make the image matching task even more difficult. Deep learning has gained substantial attention in computer vision tasks in recent years. However, many methods rely on supervised learning and necessitate large amounts of annotated data. Nevertheless, annotated data is frequently limited in the field of remote sensing image matching. To address this challenge, this paper proposes a novel keypoint descriptor approach that obtains robust feature descriptors via a self-supervised matching network. A light-weight transformer network, termed as LTFormer, is designed to generate deep-level feature descriptors. Furthermore, we implement an innovative triplet loss function, LT Loss, to enhance the matching performance further. Our approach outperforms conventional hand-crafted local feature descriptors and proves equally competitive compared to state-of-the-art deep learning-based methods, even amidst the shortage of annotated data.
匹配可见和近红外(NIR)图像仍然是遥感图像融合的一个显著挑战。异质遥感图像的非线性辐射差异使得图像匹配任务变得更加困难。近年来,计算机视觉任务在深度学习领域得到了很多关注。然而,许多方法依赖于监督学习,需要大量标记数据。然而,遥感图像匹配领域标记数据经常有限。为了应对这个挑战,本文提出了一种新的关键点描述方法,通过自监督匹配网络获得稳健的特征描述符。一种轻量级的Transformer网络,被称为LTFormer,被设计用于生成深层次特征描述。此外,我们还实现了一种创新的三元组损失函数,称为LT Loss,以提高匹配性能。我们的方法在传统手工局部特征描述符的基础上表现优异,并证明了与基于深度学习的先进方法在竞争中具有竞争力,即使在标记数据有限的情况下。
https://arxiv.org/abs/2404.19311
We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamental design choices in convolutional neural networks for detecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accurate image matching requires sufficiently large image resolutions - for this reason, we keep the resolution as large as possible while limiting the number of channels in the network. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at this http URL.
我们提出了一个轻量级且准确的资源高效的视觉对应架构。我们的方法被称为XFeat(加速特征),它重新审视了卷积神经网络中用于检测、提取和匹配局部特征的基本设计选择。我们的新模型满足对于资源受限设备快速且鲁棒算法的关键需求。特别是,准确的图像匹配需要足够大的图像分辨率 - 因此,我们在网络中限制通道数量,同时尽可能地保持分辨率。此外,我们的模型还设计为在稀疏或半稀疏级别提供匹配选择,每种选择都可能更适合不同的下游应用,例如视觉导航和增强现实。我们的模型是第一个提供半稀疏匹配的,它依赖于新颖的匹配平滑模块。XFeat具有多才性和硬件无关性,在速度(最高可达5倍)和精度上超过了当前基于深度学习的局部特征,已经在姿态估计和视觉局部定位中得到证明。我们展示它在一台廉价的笔记本电脑CPU上的实时运行,没有专门的硬件优化。代码和权重可以从该网站的URL获取。
https://arxiv.org/abs/2404.19174
Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See this http URL for more.
除了通过增加数据或参数来扩展基础模型,微调的适配器提供了一种以较低成本生成高保真度、定制图像的替代方法。因此,适配器已被广泛应用于开源社区,累积了一个超过100K个适配器的数据库,其中大多数都是高度自定义且描述不足的。本文探讨了将提示与一组相关适配器相匹配的问题,这是基于最近的工作,该工作强调了组合适配器的性能提升。我们引入了Stylus,它根据提示的关键词高效地选择并自动组合任务特定的适配器。Stylus概述了一个三阶段的方法,首先总结具有更好描述和嵌入的适配器,检索相关的适配器,然后根据提示的关键词进一步组装适配器,通过检查它们是否符合提示来检查它们。为了评估Stylus,我们开发了StylusDocs,一个包含75K个预计算嵌入的适配器的 curated数据集。在我们的对流行Stable Diffusion检查点的评估中,Stylus实现了CLIP-FID Pareto效率的更大提高,是基础模型的两倍受欢迎程度,人类和多模态模型作为评估者,超过基础模型。更多内容,请访问此链接:http://www.example.com/StylusDocs。
https://arxiv.org/abs/2404.18928
[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.
最近的技术进步突出了劳动力市场的动态,对就业前景产生了重大影响,并增加了平台和语言中的职位空缺数据。对这种数据的汇总有可能为劳动力市场提供有价值的洞察,包括劳动力市场需求、新技能的出现以及为各种利益相关者提供职位匹配。然而,尽管在私营部门普遍存在见解,但在该领域仍缺乏透明的语言技术和数据。本论文研究了自然语言处理(NLP)技术,用于从职位描述中提取相关信息,识别包括训练数据不足、缺乏标准化注释指南和有效提取方法在内的挑战。我们构建了问题、获得注释数据和介绍提取方法。我们的贡献包括职位描述数据集、去识别数据集和一个新的人工学习算法,用于高效模型训练。我们提出了使用弱监督进行技能提取的分类感知预训练方法、适应多语言语言模型的领域感知预训练方法以及利用多个技能提取数据集的检索增强模型,以提高整体性能。最后,我们将在指定的分类中定位提取的信息。
https://arxiv.org/abs/2404.18977
Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).
准确地模拟不同场景中异质代理人的多样行为是自动驾驶模拟的基本任务。由于行为分布的多维度性、驾驶场景的高维度性、分布变化和信息不完整,这个任务具有挑战性。我们的第一个洞察是通过可导模拟通过状态匹配提供有意义的学习信号,实现策略的有效分配。这可以通过揭示梯度高速公路和异质代理人的梯度通道来证明。然而,在低密度区域中发现了梯度爆炸和弱监督的问题。我们的第二个洞察是,通过应用双策略 regularization 缩小函数空间可以解决这些问题。再考虑多样性,我们第三个洞察是,数据集中的异质代理人的行为可以用一系列原型向量有效地压缩检索。这导致基于模型的强化模仿学习框架(MRIC)。MRIC 引入了开环模型基于仿真的 regularization 以稳定训练,以及基于模型的强化学习 (RL) 基于领域知识的 regularization。RL regularization 涉及可导的 Minkowskidifference-based 碰撞避免和基于投影的道路和交通规则遵守奖励。还进一步提出了动态乘数机制,消除 regularization 的干扰,同时确保其有效性。使用大型 Waymo 开放运动数据集进行实验研究,结果表明 MRIC 在多样性、行为真实性和分布真实性方面超过了最先进的基线,在某些关键指标(如碰撞率、minSADE 和时间到碰撞 JSD)上具有很大的优势(e.g., collision rate, minSADE, and time-to-collision JSD)。
https://arxiv.org/abs/2404.18464
Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{this https URL} to ensure transparency, reproducibility, and application to new domains.
开发有效的生物医学检索模型对于在知识密集型的生物医学任务中取得成功非常重要,但仍具有挑战性,因为缺乏足够的公开标注的生物医学数据和计算资源。我们提出了BMRetriever,一系列用于通过大型生物医学语料库的无监督预训练来增强生物医学检索的大密度的检索器,然后对标签数据和合成对进行指令微调。在11个生物医学数据集上的实验证实了BMRetriever在各种生物医学应用中的有效性。BMRetriever还表现出强大的参数效率,其中410M变体优于基线多达11.7倍,2B变体与拥有超过50亿参数的模型的性能相匹敌。训练数据和模型检查点在\url{这个 https URL}上发布,以确保透明度、可重复性和应用于新领域。
https://arxiv.org/abs/2404.18443
The proliferation of social media has led to information overload and increased interest in opinion mining. We propose "Question-Answering Network Analysis" (QANA), a novel opinion mining framework that utilizes Large Language Models (LLMs) to generate questions from users' comments, constructs a bipartite graph based on the comments' answerability to the questions, and applies centrality measures to examine the importance of opinions. We investigate the impact of question generation styles, LLM selections, and the choice of embedding model on the quality of the constructed QA networks by comparing them with annotated Key Point Analysis datasets. QANA achieves comparable performance to previous state-of-the-art supervised models in a zero-shot manner for Key Point Matching task, also reducing the computational cost from quadratic to linear. For Key Point Generation, questions with high PageRank or degree centrality align well with manually annotated key points. Notably, QANA enables analysts to assess the importance of key points from various aspects according to their selection of centrality measure. QANA's primary contribution lies in its flexibility to extract key points from a wide range of perspectives, which enhances the quality and impartiality of opinion mining.
社交媒体的普及导致信息过载和意见挖掘兴趣的增加。我们提出了“问题-回答网络分析”(QANA),一种新颖的意见挖掘框架,利用大型语言模型(LLMs)生成用户评论,基于评论对问题的回答能力构建二分图,并应用中心度度量来研究意见的重要性。我们研究了问题生成方式、LLM选择和嵌入模型的选择对构建的QA网络质量的影响,将它们与注释的Key Point Analysis数据集进行比较。QANA在零散射击情况下实现了与先前最先进的监督模型相媲美的性能,并且从二次方降到线性。对于Key Point生成,具有高PageRank或度中心性的问题与手动注释的关键点对齐得很好。值得注意的是,QANA使分析员根据其选择的中心度度量从各种角度评估关键点的重要性。QANA的主要贡献在于其从各种角度提取关键点的灵活性,这提高了意见挖掘的质量和平等性。
https://arxiv.org/abs/2404.18371
Skeleton-based action recognition is vital for comprehending human-centric videos and has applications in diverse domains. One of the challenges of skeleton-based action recognition is dealing with low-quality data, such as skeletons that have missing or inaccurate joints. This paper addresses the issue of enhancing action recognition using low-quality skeletons through a general knowledge distillation framework. The proposed framework employs a teacher-student model setup, where a teacher model trained on high-quality skeletons guides the learning of a student model that handles low-quality skeletons. To bridge the gap between heterogeneous high-quality and lowquality skeletons, we present a novel part-based skeleton matching strategy, which exploits shared body parts to facilitate local action pattern learning. An action-specific part matrix is developed to emphasize critical parts for different actions, enabling the student model to distill discriminative part-level knowledge. A novel part-level multi-sample contrastive loss achieves knowledge transfer from multiple high-quality skeletons to low-quality ones, which enables the proposed knowledge distillation framework to include training low-quality skeletons that lack corresponding high-quality matches. Comprehensive experiments conducted on the NTU-RGB+D, Penn Action, and SYSU 3D HOI datasets demonstrate the effectiveness of the proposed knowledge distillation framework.
基于骨架的动作识别对于理解以人为中心的视频至关重要,并在各种领域具有应用价值。骨架动作识别的一个挑战是处理低质量数据,例如缺失或准确度不高的骨骼。本文通过一个通用的知识蒸馏框架来提高基于骨架的动作识别,该框架采用一个教师模型和一个学生模型。教师模型通过训练高质量骨骼来指导学习学生模型,学生模型处理低质量骨骼。为了弥合高质量和低质量骨骼之间的差距,我们提出了一个新颖的部分基于骨骼匹配策略,该策略利用共享身体部分来促进局部动作模式学习。为不同动作生成特定部分矩阵,强调关键部分以帮助学生模型蒸馏部分级别知识。一种新颖的部分级别多样本对比损失实现从多个高质量骨骼向低质量骨骼的知识传递,这使得所提出的知识蒸馏框架可以包括训练低质量骨骼,这些骨骼没有相应的高质量匹配。在NTU-RGB+D、Penn Action和SYSU 3D HOI数据集上进行全面的实验证明所提出的知识蒸馏框架的有效性。
https://arxiv.org/abs/2404.18206
In this work, we propose a novel discriminative framework for dexterous grasp generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues, we propose progressive strategies for both the training and testing phases. First, the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second, we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably, while keeping high quality, the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at this https URL .
在这项工作中,我们提出了一个名为Dexterous Grasp TRansformer(DGTR)的新颖的抓取生成框架,能够通过仅一次前向传递处理物体点云来预测多样的一组抓取姿势。我们将抓取生成定义为预测任务,并为此设计了一个基于Transformer的抓取模型。然而,我们发现这种集预测范式在抓取领域遇到了几个优化挑战,导致性能受限。为了应对这些问题,我们在训练和测试阶段都提出了渐进的策略。首先,我们引入了动态静态匹配训练(DSMT)策略来提高训练阶段的优化稳定性。其次,我们引入了一对对抗性损失的 adversarial-balanced 测试时间适应(AB-TTA)策略来提高测试阶段的抓取质量。 DexGraspNet 数据集上的实验结果表明,DGTR 具有预测高质和多样抓取姿势的能力。值得注意的是,尽管保持高质量,DGTR 预测的抓取姿势多样性在多个指标上显著超过了之前的工作,而无需进行数据预处理。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.18135
Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method. Our code is publicly available at: this https URL.
图像文本匹配仍然是一个具有挑战性的任务,因为不同模态之间存在异质语义多样性,三元组内的距离分离度不足。与之前关注增强多模态表示或利用跨模态对应进行更准确检索的前沿方法不同,本文旨在以一种更有力的匹配模型的方式利用同侪分支之间的知识传递,寻找一个更强大的匹配模型。具体来说,我们提出了一个全新的Deep Boosting Learning(DBL)算法,其中锚支首先通过训练获得对数据属性的见解,目标支则获得更丰富的知识以开发最优的特征和距离度量。具体,锚支最初学习正负对之间的绝对或相对距离,为特定网络和数据分布提供基础理解。在此基础上,目标支同时负担具有自适应边界的扩展,进一步扩大匹配样本之间的相对距离。大量实验证实,我们的DBL可以在各种图像文本匹配领域的最新模型上实现令人印象深刻的改进,并优于相关的流行合作策略,例如传统扩散、相互学习和对比学习。此外,我们还证实,DBL可以轻松地集成到他们的训练场景中,在相同的计算成本下实现卓越的性能,证明了我们所提出方法的可行性和广泛的适用性。我们的代码可公开访问于:https:// this URL。
https://arxiv.org/abs/2404.18114
LiDAR-camera extrinsic calibration (LCEC) is crucial for data fusion in intelligent vehicles. Offline, target-based approaches have long been the preferred choice in this field. However, they often demonstrate poor adaptability to real-world environments. This is largely because extrinsic parameters may change significantly due to moderate shocks or during extended operations in environments with vibrations. In contrast, online, target-free approaches provide greater adaptability yet typically lack robustness, primarily due to the challenges in cross-modal feature matching. Therefore, in this article, we unleash the full potential of large vision models (LVMs), which are emerging as a significant trend in the fields of computer vision and robotics, especially for embodied artificial intelligence, to achieve robust and accurate online, target-free LCEC across a variety of challenging scenarios. Our main contributions are threefold: we introduce a novel framework known as MIAS-LCEC, provide an open-source versatile calibration toolbox with an interactive visualization interface, and publish three real-world datasets captured from various indoor and outdoor environments. The cornerstone of our framework and toolbox is the cross-modal mask matching (C3M) algorithm, developed based on a state-of-the-art (SoTA) LVM and capable of generating sufficient and reliable matches. Extensive experiments conducted on these real-world datasets demonstrate the robustness of our approach and its superior performance compared to SoTA methods, particularly for the solid-state LiDARs with super-wide fields of view.
LiDAR相机外差校准(LCEC)在智能汽车数据融合中至关重要。离线,目标导向的方法在领域中一直是首选。然而,它们通常在现实环境中表现不佳。这主要是因为外差参数可能会因中度冲击或环境振动而显著变化。相比之下,在线,目标无的方法提供更大的适应性,但通常缺乏鲁棒性,主要原因是跨模态特征匹配的挑战。因此,在本文中,我们发挥了大型视觉模型(LVMs)的全部潜力,这些模型在计算机视觉和机器人领域正成为一种趋势,特别是对于嵌入式人工智能,实现跨各种具有挑战性的场景的稳健且准确的在线,目标无LCEC。我们的主要贡献是三方面的:我们引入了一个名为MIAS-LCEC的新框架,提供了一个具有交互式可视化界面的开源多功能校准工具箱,并公开了从各种室内和室外环境捕获的三个真实世界数据集。我们框架和工具箱的基础是先进的基于SoTA LVM的跨模态掩码匹配(C3M)算法,能够生成充分且可靠的匹配。在这些真实世界数据集上进行的大量实验证明了我们方法的可行性和与SoTA方法的优越性能,特别是对于具有超宽视野的固体LiDAR。
https://arxiv.org/abs/2404.18083
Moving objects are frequently seen in daily life and usually appear blurred in images due to their motion. While general object retrieval is a widely explored area in computer vision, it primarily focuses on sharp and static objects, and retrieval of motion-blurred objects in large image collections remains unexplored. We propose a method for object retrieval in images that are affected by motion blur. The proposed method learns a robust representation capable of matching blurred objects to their deblurred versions and vice versa. To evaluate our approach, we present the first large-scale datasets for blurred object retrieval, featuring images with objects exhibiting varying degrees of blur in various poses and scales. We conducted extensive experiments, showing that our method outperforms state-of-the-art retrieval methods on the new blur-retrieval datasets, which validates the effectiveness of the proposed approach.
移动物体在现实生活中经常可见,通常由于其运动而在图像中显得模糊。虽然计算机视觉中广泛研究的对象检索领域主要关注清晰和静态的对象,但在大型图像集合中检索运动模糊的物体仍然是一个未探索的问题。我们提出了一种在受运动模糊影响的图像中进行对象检索的方法。所提出的方法能够学习到一个健壮的表示,能够将模糊的物体与它们的去模糊版本进行匹配,反之亦然。为了评估我们的方法,我们提出了第一个大型的模糊物体检索数据集,其中包括在各种姿势和尺度下表现出不同程度的模糊的图像。我们进行了广泛的实验,结果表明,与最先进的检索方法相比,我们的方法在新模糊检索数据集上表现出优异的性能,这验证了所提出方法的的有效性。
https://arxiv.org/abs/2404.18025
In this article, a novel approach for merging 3D point cloud maps in the context of egocentric multi-robot exploration is presented. Unlike traditional methods, the proposed approach leverages state-of-the-art place recognition and learned descriptors to efficiently detect overlap between maps, eliminating the need for the time-consuming global feature extraction and feature matching process. The estimated overlapping regions are used to calculate a homogeneous rigid transform, which serves as an initial condition for the GICP point cloud registration algorithm to refine the alignment between the maps. The advantages of this approach include faster processing time, improved accuracy, and increased robustness in challenging environments. Furthermore, the effectiveness of the proposed framework is successfully demonstrated through multiple field missions of robot exploration in a variety of different underground environments.
在本文中,提出了一种在自旋多机器人 exploration 背景下合并 3D 点云地图的新方法。与传统方法不同,所提出的方法利用最先进的点位识别和学到的描述符来有效地检测地图之间的重叠,消除了全局特征提取和匹配过程所需的时间。估计的重叠区域用于计算同构形刚变换,作为 GICP 点云配准算法的初始条件,以对地图进行对齐。这种方法的优势包括更快的处理时间、更高的准确性和在更复杂的环境中增强的鲁棒性。此外,通过在各种不同地下环境中进行机器人探索,成功证明了所提出的框架的有效性。
https://arxiv.org/abs/2404.18006
We present an approach to backpropagating through minimal problem solvers in end-to-end neural network training. Traditional methods relying on manually constructed formulas, finite differences, and autograd are laborious, approximate, and unstable for complex minimal problem solvers. We show that using the Implicit function theorem to calculate derivatives to backpropagate through the solution of a minimal problem solver is simple, fast, and stable. We compare our approach to (i) using the standard autograd on minimal problem solvers and relate it to existing backpropagation formulas through SVD-based and Eig-based solvers and (ii) implementing the backprop with an existing PyTorch Deep Declarative Networks (DDN) framework. We demonstrate our technique on a toy example of training outlier-rejection weights for 3D point registration and on a real application of training an outlier-rejection and RANSAC sampling network in image matching. Our method provides $100\%$ stability and is 10 times faster compared to autograd, which is unstable and slow, and compared to DDN, which is stable but also slow.
我们提出了一种在端到端神经网络训练中通过最小问题求解器进行反向传播的方法。传统的依赖手工构造公式、有限差分和自求导的方法费力、近似和不稳定。我们证明了使用隐函数定理计算最小问题求解器的反向传播是简单、快速、稳定的。我们将我们的方法与(i)在最小问题求解器上使用标准自求导方法和通过SVD-基于和Eig-基于求解器与现有的反向传播公式联系起来进行比较,以及(ii)实现使用现有PyTorch Deep Declarative Networks (DDN)框架的反向传播。我们在三维点配准的训练示例和一个图像匹配领域的真实应用中演示了我们的技术。我们的方法提供了100%的稳定性,比自求导快10倍,与DDN相比,它既稳定又慢。
https://arxiv.org/abs/2404.17993
Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder's original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model's accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6 on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.
文档布局分析涉及理解文档中元素的排列。本文探讨了理解文档图像中各种元素(如文本、图片、表格和标题)的复杂性。该方法采用了一种基于Transformer的高级对象检测网络作为创新的图形页面对象检测器,用于识别表格、图中和显示的元素。我们引入了查询编码机制,为对比学习提供高质量的对象查询,提高解码阶段的工作效率。我们还提出了一个混合匹配方案,将解码器的原始一对一匹配策略与训练阶段的一对多匹配策略相结合。这种方法旨在提高模型在检测页面上的各种图形元素方面的准确性和多样性。我们在PubLayNet、DocLayNet和PubTables基准测试上的实验结果表明,我们的方法超越了当前最先进的方法。在PubLayNet上,其平均精度为97.3%;在DocLayNet上,为81.6%;在PubTables上,为98.6%,证明了其在布局分析方面的卓越性能。这些进步不仅使文档图像转换为可编辑和可访问的格式,而且简化了信息检索和数据提取过程。
https://arxiv.org/abs/2404.17888
In open-set recognition, existing methods generally learn statically fixed decision boundaries using known classes to reject unknown classes. Though they have achieved promising results, such decision boundaries are evidently insufficient for universal unknown classes in dynamic and open scenarios as they can potentially appear at any position in the feature space. Moreover, these methods just simply reject unknown class samples during testing without any effective utilization for them. In fact, such samples completely can constitute the true instantiated representation of the unknown classes to further enhance the model's performance. To address these issues, this paper proposes a novel dynamic against dynamic idea, i.e., dynamic method against dynamic changing open-set world, where an open-set self-learning (OSSL) framework is correspondingly developed. OSSL starts with a good closed-set classifier trained by known classes and utilizes available test samples for model adaptation during testing, thus gaining the adaptability to changing data distributions. In particular, a novel self-matching module is designed for OSSL, which can achieve the adaptation in automatically identifying known class samples while rejecting unknown class samples which are further utilized to enhance the discriminability of the model as the instantiated representation of unknown classes. Our method establishes new performance milestones respectively in almost all standard and cross-data benchmarks.
在开集识别中,现有方法通常使用已知类别来学习静态固定的决策边界以拒绝未知类别。虽然它们取得了很好的效果,但显然这些决策边界对于动态和开放场景中的通用未知类是不够的,因为它们可能出现在特征空间的任何位置。此外,这些方法在测试过程中只是简单地拒绝未知类样本,而没有对这些样本进行有效的利用。实际上,这些样本完全可以构成未知类的真实实例表示,进一步增强模型的性能。为解决这些问题,本文提出了一种新颖的动态对抗动态的想法,即动态方法对抗动态变化开集世界,相应地开发了一个自适应学习(OSSL)框架。OSSL从已知类别的良好闭合类ifier开始训练,并在测试过程中利用可用的测试样本进行模型适应,从而获得了适应变化数据分布的特性。特别地,为OSSL设计了一个新颖的自匹配模块,可以在自动识别已知类样本的同时拒绝未知类样本,作为未知类类别的实例表示,进一步加强模型的区分性。我们的方法在几乎所有标准和跨数据基准测试中都建立了新的性能里程碑。
https://arxiv.org/abs/2404.17830
Due to escalating privacy concerns, federated learning has been recognized as a vital approach for training deep neural networks with decentralized medical data. In practice, it is challenging to ensure consistent imaging quality across various institutions, often attributed to equipment malfunctions affecting a minority of clients. This imbalance in image quality can cause the federated model to develop an inherent bias towards higher-quality images, thus posing a severe fairness issue. In this study, we pioneer the identification and formulation of this new fairness challenge within the context of the imaging quality shift. Traditional methods for promoting fairness in federated learning predominantly focus on balancing empirical risks across diverse client distributions. This strategy primarily facilitates fair optimization across different training data distributions, yet neglects the crucial aspect of generalization. To address this, we introduce a solution termed Federated learning with Inter-client Sharpness Matching (FedISM). FedISM enhances both local training and global aggregation by incorporating sharpness-awareness, aiming to harmonize the sharpness levels across clients for fair generalization. Our empirical evaluations, conducted using the widely-used ICH and ISIC 2019 datasets, establish FedISM's superiority over current state-of-the-art federated learning methods in promoting fairness. Code is available at this https URL.
由于隐私问题不断升级,联邦学习被认为是一种通过分布式医疗数据训练深度神经网络的重要方法。在实践中,确保各个机构之间保持一致的图像质量是非常具有挑战性的,这种情况通常归因于影响少数客户端设备的故障。这种图像质量的不平衡可能导致联邦模型过于关注高质量图像,从而引发严重的不公平问题。在本文中,我们在图像质量变化背景下,首创了关于这个新公平挑战的识别和阐述。 传统方法在促进分布式学习中的公平性方面主要关注平衡不同客户端分布的实证风险。这种策略主要通过不同训练数据分布的公平优化来促进公平,然而却忽略了泛化的重要性。为解决这个问题,我们引入了一种名为“联邦学习与客户端尖度匹配”(FedISM)的解决方案。FedISM通过引入尖度意识来增强局部训练和全局聚合,旨在统一客户端的尖度水平以进行公平的泛化。 我们的实证评估使用广泛使用的ICH和ISIC 2019数据集进行,结果表明,FedISM在促进公平方面优于当前的分布式学习方法。代码可在此链接处获取:https:// this URL.
https://arxiv.org/abs/2404.17805
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:this https URL.
CLIP 通过在图像-文本对差学习任务上的训练展示了出色的跨模态匹配能力。然而,如果没有为单模态场景进行特定的优化,CLIP 在单模态特征提取方面的性能可能无法达到最优。尽管如此,一些研究表明,直接使用 CLIP 的图像编码器来执行少样本分类等任务,会导致其预训练目标与特征提取方法之间存在错位。这种不一致性可能会削弱图像特征的代表性,从而对 CLIP 在目标任务上的效果产生不利影响。在本文中,我们将文本特征视为 CLIP 空间中图像特征的精确邻近点,并提出了一个新的 Cross-Modal Neighbor Representation(CODER),基于图像与它们邻居文本之间的距离结构。这种特征提取方法更接近 CLIP 的预训练目标,从而充分利用了 CLIP 的稳健跨模态能力。构建高质量的 CODER 的关键在于如何创建大量高质量和多样性的文本以与图像匹配。我们引入了自动文本生成器(ATG)以以无需数据和训练的方式生成所需文本。我们将 CODER 应用于 CLIP 的零样本和少样本图像分类任务。各种数据集和模型的实验结果证实了 CODER 的有效性。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.17753