Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.
生成常识推理(GCR)需要一个模型使用常识知识来推理关于一种情况的句子,同时生成连贯的句子。尽管生成的句子的质量至关重要,但生成多样性同样重要,因为它反映了模型能够使用一系列常识知识事实的能力。大型语言模型(LLMs)通过在上下文中学来提高各种任务的生成质量,而不需要进行微调。然而,LLM输出的多样性方面之前还没有系统地研究过。为了解决这个问题,我们提出了一个简单的方法,它扩展了LLM的生成,同时保留了其质量。在三个基准GCR数据集上的实验结果表明,我们的方法实现了质量与多样性的理想平衡。此外,我们提出的方法生成的句子可以作为现有常识生成器的训练数据,以提高其多样性。
https://arxiv.org/abs/2404.16807
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at this https URL.
近年来,基于静态、预先收集的通用数据集训练的大语言模型(LLMs)的成功引发了大量的研究方向和应用。其中一种方向解决了将预训练的LLM集成到动态数据分布、任务结构和用户偏好中的非平凡挑战。经过专门调整以满足特定需求后,预训练的LLM在先前知识领域中的表现常常会显著下降,这种现象被称为“灾难性遗忘”。尽管在持续学习(CL)领域得到了广泛研究,但LLMs在LLM领域中呈现出了新的表现形式。在本次调查中,我们全面概述了LLM在CL背景下的当前研究进展。本次调查分为四个主要部分:我们首先描述了持续学习LLMs的概述,包括两个方向:垂直连续(或垂直持续学习),即从通用到特定能力的持续适应,以及水平连续(或水平持续学习),即跨越时间和领域的持续适应(第3节)。接着我们总结了在现代CL背景下学习LLM的三个阶段:持续预训练(CPT)、领域自适应预训练(DAP)和持续微调(CFT)(第4节)。然后我们概述了使用LLMs进行持续学习的评估协议以及当前可用的数据源(第5节)。最后,我们讨论了与LLM的持续学习相关的一些有趣问题(第6节)。本次调查中审查的论文清单可以在这个https:// URL中找到。
https://arxiv.org/abs/2404.16789
This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
本文提出了一种名为Neighborhood-based Traveling Salesman Problem (DTSP)的新的学习方法,用于通过给定的任务点快速生成非holonomic车辆通过邻近区域的周游路线。该方法包括两个学习阶段:首先,一种模型无关的强化学习方法利用特权信息从由LinKernighan启发式(LKH)算法生成的专家轨迹中提炼知识。随后,一个监督学习阶段训练一个自适应网络,以独立于特权信息解决问题。在第一个学习阶段之前,还开发了一种使用演示数据进行参数初始化的技术,以提高训练效率。与LKH相比,所提出的学习方法解决方案大约快50倍,并且比其他演示学习方法和RL取得了显著的优越性,大多数这些方法无法感知所有任务点。
https://arxiv.org/abs/2404.16721
Autonomous navigation in dynamic environments is a complex but essential task for autonomous robots, with recent deep reinforcement learning approaches showing promising results. However, the complexity of the real world makes it infeasible to train agents in every possible scenario configuration. Moreover, existing methods typically overlook factors such as robot kinodynamic constraints, or assume perfect knowledge of the environment. In this work, we present RUMOR, a novel planner for differential-drive robots that uses deep reinforcement learning to navigate in highly dynamic environments. Unlike other end-to-end DRL planners, it uses a descriptive robocentric velocity space model to extract the dynamic environment information, enhancing training effectiveness and scenario interpretation. Additionally, we propose an action space that inherently considers robot kinodynamics and train it in a simulator that reproduces the real world problematic aspects, reducing the gap between the reality and simulation. We extensively compare RUMOR with other state-of-the-art approaches, demonstrating a better performance, and provide a detailed analysis of the results. Finally, we validate RUMOR's performance in real-world settings by deploying it on a ground robot. Our experiments, conducted in crowded scenarios and unseen environments, confirm the algorithm's robustness and transferability.
自主导航在动态环境中是一个复杂但 essential 的任务,对于自主机器人来说,最近的深度强化学习方法显示出良好的效果。然而,真实世界的复杂性使得在所有可能的场景配置上训练代理是不切实际的。此外,现有的方法通常忽视诸如机器人动力学限制等因素,或者假设对环境具有完美的了解。在这项工作中,我们提出了 RUMOR,一种用于在高度动态环境中进行自主导航的新规划器,它使用深度强化学习来 navigate。与其它端到端 DRL 规划器不同,它使用描述性的机器人本体运动空间模型来提取动态环境信息,提高训练效果和场景解释。此外,我们还提出了一个考虑机器人动力学的动作空间,并将其在模拟器上训练,减少了现实世界和模拟器之间的差距。我们详细比较了 RUMOR 与其他最先进的方案,证明了其更好的性能,并提供了结果的详细分析。最后,我们通过在真实环境中部署 RUMOR 来验证其性能。我们的实验在拥挤的场景和未知的环境中进行,证实了算法的稳健性和可迁移性。
https://arxiv.org/abs/2404.16672
Recent successes in Generative Artificial Intelligence (GenAI) have led to new technologies capable of generating high-quality code, natural language, and images. The next step is to integrate GenAI technology into products, a task typically conducted by software developers. Such product development always comes with a certain risk of liability. Within this article, we want to shed light on the current state of two such risks: data protection and copyright. Both aspects are crucial for GenAI. This technology deals with data for both model training and generated output. We summarize key aspects regarding our current knowledge that every software developer involved in product development using GenAI should be aware of to avoid critical mistakes that may expose them to liability claims.
近年来在生成人工智能(GenAI)方面的成功已经催生出能够生成高质量代码、自然语言和图像的新技术。接下来的任务是将GenAI技术集成到产品中,通常由软件开发者来完成。这种产品开发总是伴随着一定的责任风险。在这篇文章中,我们想阐明目前与GenAI相关的两个风险:数据保护和版权。这两个方面对GenAI至关重要。这项技术处理的是数据,用于模型训练和生成输出。我们总结了所有使用GenAI进行产品开发的软件开发者应该了解的关键方面,以便避免可能使自己面临责任诉讼的严重错误。
https://arxiv.org/abs/2404.16630
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.
将大型语言模型(LLMs)融入医疗保健行业有望彻底改变医疗诊断、研究和患者护理。然而,医疗LLMs的发展面临着一些障碍,如复杂的训练要求、严格的评估需求以及 proprietary模型的主导地位,这些模型限制了学术探索。透明、全面的访问LLM资源对于推动该领域的发展、促进可重复性以及鼓励医疗保健领域的人工创新至关重要。我们推出了Hippocrates,一个专为医疗领域而设计的开源LLM框架。与之前的努力相比,它提供了无限制的访问其训练数据集、代码库、检查点以及评估协议。这种开放方法旨在鼓励协同研究,让社区在透明的生态系统中构建、改进和严格评估医疗LLM。此外,我们还介绍了Hippo家族7B模型,这些模型针对医疗领域进行了微调和优化,通过持续的预训练、指令调整和强化学习从人类和AI反馈中进行微调。我们的模型在现有开放医疗LLM模型的性能优势基础上,性能优势巨大,甚至超过了具有70B参数的模型。通过Hippocrates,我们渴望利用LLMs不仅推动医疗知识和患者护理的发展,还将促进医疗保健领域的人工研究民主化,使它们在全球范围内可用。
https://arxiv.org/abs/2404.16621
Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.
预训练的大型文本-图像(T2I)模型,特别是适当的文本提示,在定制图像生成领域引起了越来越多的兴趣。然而,灾难性遗忘问题使得在保留学习到的样式的同时,持续生成新的用户提供样式变得困难。在本文中,我们提出了一种名为MuseumMaker的方法,使您能够在无尽的方式下根据一组自定义样式合成图像,并逐渐积累这些创意艺术作品作为一个博物馆。当面临新的定制样式时,我们开发了一种风格蒸馏损失模块,将整个数据集的风格传递给图像生成。它可以减小由于图片内容而产生的学习偏移,并解决由少样本图像引起的灾难性过拟合问题。为了处理过去学习到的样式中的灾难性遗忘,我们为共享LoRA模块设计了双重正则化,以优化模型更新方向,分别从权重和特征方面对扩散模型进行正则。同时,通过任务级别的标记学习模块,学习到一个与新样式对应的独特标记嵌入,这可以保留过去样式的历史知识,同时限制LoRA参数的数量。随着任何新的用户提供样式,我们的MuseumMaker可以捕捉到新风格的细微差别,同时保留学习到的样式的细节。在多样风格数据集上的实验结果证实了我们对MuseumMaker方法的有效性,展示了其在各种场景的稳健性和多样性。
https://arxiv.org/abs/2404.16612
The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at this https URL.
在复杂场景中进行时空动作局部化的任务是高级视频理解的一个具有挑战性的任务。通过高质量的视频特征提取和增强检测器预测锚点的精度,可以有效地提高模型性能。为此,我们提出了一个高性能的双流时空特征提取网络SFMViT,采用锚点剪枝策略。我们SFMViT的骨干网络由ViT和SlowFast组成,基于先前对时空动作局部化的知识,充分利用ViT的卓越全局特征提取能力和SlowFast的时空序列建模能力。其次,我们引入了最大置信度堆来剪枝检测器在每个帧中检测到的锚点,以过滤出有效的锚点。这些设计使我们的SFMViT在Chaotic World数据集上的mAP达到26.62%,远超过现有模型。代码可以从该链接获得。
https://arxiv.org/abs/2404.16609
Large language models (LLMs) show early signs of artificial general intelligence but struggle with hallucinations. One promising solution to mitigate these hallucinations is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. However, such a solution risks compromising privacy, as recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models. The significant advantage of LLMs over traditional pre-trained models may exacerbate these concerns. To this end, we investigate the effectiveness of reconstructing original knowledge and predicting entity attributes from these embeddings when LLMs are employed. Empirical findings indicate that LLMs significantly improve the accuracy of two evaluated tasks over those from pre-trained models, regardless of whether the texts are in-distribution or out-of-distribution. This underscores a heightened potential for LLMs to jeopardize user privacy, highlighting the negative consequences of their widespread use. We further discuss preliminary strategies to mitigate this risk.
大语言模型(LLMs)显示出早期的人工通用智能迹象,但在幻觉方面遇到困难。一种减轻这些幻觉的潜在解决方案是将外部知识存储为嵌入,有助于LLMs在检索增强生成。然而,这样的解决方案可能危及隐私,因为最近的研究表明,通过预训练语言模型可以部分重构原始文本。LLM与传统预训练模型的显著优势可能会加剧这些担忧。因此,我们研究了在LLM被应用时,从这些嵌入中恢复原始知识和预测实体属性的有效性。 实证发现表明,无论文本是否在分布内,LLM在两个评估任务中的准确率都显著高于预训练模型。这表明LLM显著提高了两个评估任务的准确性,无论这些文本是否在分布内。这凸出了LLM对用户隐私可能造成的威胁,并突出了其在广泛使用时可能产生的负面后果。我们进一步讨论了减轻这种风险的初步策略。
https://arxiv.org/abs/2404.16587
We revisit certain problems of pose estimation based on 3D--2D correspondences between features which may be points or lines. Specifically, we address the two previously-studied minimal problems of estimating camera extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$ line--line correspondences. To the best of our knowledge, all of the previously-known practical solutions to these problems required computing the roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$ polynomials when $p=1.$ We describe and implement two elementary solutions which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$ to $4$, respectively. We show experimentally that the resulting solvers are numerically stable and fast: when compared to the previous state-of-the art, we may obtain nearly an order of magnitude speedup. The code is available at \url{this https URL\_absolute}
我们回顾了基于3D--2D对应关系的某些姿态估计问题,这些问题可能是个点或线。具体来说,我们解决了之前研究过的最小问题:从{1,2}点--点对应关系中估计相机外项,以及从$l=3-p$线--线对应关系中估计相机外项。据我们所知,所有之前已知的问题解决方案都需要在$p=2$时计算次数$\ge 4$(单变量)多项式的根,或者在$p=1$时计算次数$\ge 8$多项式的根。我们描述并实现了两种简化解决方案,它们分别将需要的多项式的次数从4降低到2,从8降低到4。我们证明了这些求解器在数值稳定性和速度方面都是快速的:与之前的先进水平相比,我们可能可以实现近一个数量级的速度提升。代码可在此处下载:https://this URL_absolute
https://arxiv.org/abs/2404.16552
Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.
在本文中,我们解决了仅使用预训练的孔洞图(source)和未标注的全景图像(target)进行无监督域适应(SFUDA)的问题,以实现孔洞到全景语义分割。解决这一问题是不简单的,因为存在三个关键挑战:1)不同域之间语义不匹配,2)源域问题中的风格差异,3)全景图像中不可避免的扭曲。为了应对这些问题,我们提出了360SFUDA++,它有效地从仅有的未标注全景图像中提取知识,并将可靠的知识传递到目标全景域。具体来说,我们首先利用切线投影(TP)作为它具有较少的扭曲,同时将等角投影(ERP)切成固定 FoV 投影(FFP)的补丁,以模仿孔洞图像。两个投影在提取知识方面都有效。然而,由于不同的投影使得域之间知识传递变得困难,我们 then 引入了可靠的全景原型适应模块(RP2AM),在预测和原型级别上传递知识。RP2AM 选择自信的知识,并整合全景原型以实现可靠的知识适应。此外,我们还引入了跨投影双重注意模块(CDAM),它更好地对域之间的特征水平进行投影之间的空间和通道特征的同步调整。知识提取和传递过程都被同步更新,以达到最佳性能。在合成和真实世界基准上的广泛实验,包括户外和室内场景,证明了我们的360SFUDA++在性能上显著优于前面的SFUDA方法。
https://arxiv.org/abs/2404.16501
Automated driving systems require monitoring mechanisms to ensure safe operation, especially if system components degrade or fail. Their runtime self-representation plays a key role as it provides a-priori knowledge about the system's capabilities and limitations. In this paper, we propose a data-driven approach for deriving such a self-representation model for the motion controller of an automated vehicle. A conformalized prediction model is learned and allows estimating how operational conditions as well as potential degradations and failures of the vehicle's actuators impact motion control performance. During runtime behavior generation, our predictor can provide a heuristic for determining the admissible action space.
自动驾驶系统需要监控机制以确保安全运行,尤其是系统组件退化或失效时。其运行时自表示在确定系统能力与限制方面发挥着关键作用。在本文中,我们提出了一个数据驱动的方法,用于得出这样的自表示模型,该模型用于自动车辆运动控制器的运动控制器。学得了平滑预测模型,允许估计操作条件以及潜在的车辆执行器故障和运动控制性能的影响。在运行时行为生成期间,我们的预测器可以提供一种经验性的确定可允许动作空间的指导方针。
https://arxiv.org/abs/2404.16500
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at this https URL.
大多数无监督的三维物体检测方法遵循基于聚类的伪标签生成和迭代自训练过程。然而,由于激光雷达扫描的稀疏性,导致伪标签具有错误的大小和位置,从而导致检测性能不佳。为了解决这个问题,本文引入了一种以常识原型为基础的检测器,称为CPD,用于无监督三维物体检测。CPD首先基于常识直觉构建了高质量的边界框和密集点的高质量常识原型(CProto)。然后,CPD通过利用CProto的大小先验来优化低质量伪标签。此外,CPD通过CProto的几何知识提高了稀疏扫描对象检测的准确性。CPD在Waymo Open Dataset(WOD)、PandaSet和KITTI数据集上优于最先进的无监督三维检测器。此外,通过在WOD和KITTI上训练CPD并进行测试,CPD在容易和 moderate 车辆类别上获得了90.85%和81.01%的3D平均精度。这些成就使CPD与完全监督的检测器相接近,强调了我们的方法的重要性。代码将在该https URL上可用。
https://arxiv.org/abs/2404.16493
In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.
在将模仿学习应用于灵巧机器人手的应用中,系统的复杂性使得学习复杂的操作任务具有挑战性。然而,描述人类在不同任务中操作的丰富数据集可以为我们在运动子轨迹方面提供更好的知识。我们提出了一种利用多个大型、任务无关的数据集的方法,以获得有效的表示我们包括在基于Transformer的行为复制方法中的运动子轨迹的潜在表示。我们的结果表明,使用潜在表示能够提高与传统行为复制方法的性能,特别是关于感知和本体知觉中的错误和噪声的鲁棒性。此外,所提出的方法仅依赖于人类演示,因此消除了遥控的需求,从而加速了数据收集过程。准确的手指重新定位的逆运动学确保了从人类手数据到机器人的精确传递,促进了有效的学习和部署操作策略。最后,已经训练好的策略已经被成功地应用于一个23Dof的实物机器人系统。
https://arxiv.org/abs/2404.16483
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.
大规模语言模型(LLMs)如今在学术界、研究、商业和金融等各个领域得到了广泛应用,用于诸如文本生成、总结和翻译等任务。尽管它们已经得到了普遍的采用,但这些模型往往会产生错误或误导性的信息,表现出一种幻觉倾向。这种行为可以归因于几个因素,其中一致性和推理能力是重要的因素。LLMs常常缺乏生成解释和进行合乎理性的推理的能力,导致不准确的回答。此外,它们在输出上表现出不一致性。本文旨在评估和比较公共和专有LLMs的一致性和推理能力。实验使用了Boolq数据集作为基线,包括问题、答案和相应的解释。数据集中的问题作为提示呈现在LLMs上,生成的答案与基线答案进行比较。此外,还生成了推理能力来评估模型的能力。一致性是通过反复向模型呈现相同的问题并观察其响应的变化来评估的。为了衡量推理能力,生成的解释与基线解释使用BERT、BLEU和F-1分数进行比较。研究结果表明,专有模型在一致性和推理能力方面通常优于公共模型。然而,即使面对基本的通用知识问题,没有一个模型在一致性和推理能力上获得了90%的分数。这项研究突出了LLMs的一致性和推理能力与当前语言模型的固有推理挑战之间的直接关系,并强调了当前语言模型中存在的固有推理挑战。
https://arxiv.org/abs/2404.16478
Multimodal sentiment analysis (MSA) aims to understand human sentiment through multimodal data. Most MSA efforts are based on the assumption of modality completeness. However, in real-world applications, some practical factors cause uncertain modality missingness, which drastically degrades the model's performance. To this end, we propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the MSA task under uncertain missing modalities. Specifically, we present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics. Moreover, a category-guided prototype distillation mechanism is introduced to capture cross-category correlations using category prototypes to align feature distributions and generate favorable joint representations. Eventually, we design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network through response disentanglement and mutual information maximization. Comprehensive experiments on three datasets indicate that our framework can achieve favorable improvements compared with several baselines.
多模态情感分析(MSA)旨在通过多模态数据理解人类情感。大多数MSA努力都是基于模态完备性的假设。然而,在现实应用中,一些实际因素导致不确定模态缺失,这严重削弱了模型的性能。为此,我们提出了一个在不确定缺失模态下的MSA任务的联合关系蒸馏(CorrKD)框架。具体来说,我们提出了一个样本级别的对比性蒸馏机制,用于将包含跨样本相关性的全面知识转移到重建缺失语义。此外,还引入了一个分类引导的原型蒸馏机制,通过分类原型来捕捉跨类相关性,从而使特征分布对齐,并生成有利的联合表示。最后,我们设计了一个响应解耦一致性蒸馏策略,通过响应解耦和互信息最大化来优化学生网络的情感决策边界。在三个数据集上的全面实验表明,与几个基线相比,我们的框架可以实现显著的改进。
https://arxiv.org/abs/2404.16456
Decision diagrams (DDs) are powerful tools to represent effectively propositional formulas, which are largely used in many domains, in particular in formal verification and in knowledge compilation. Some forms of DDs (e.g., OBDDs, SDDs) are canonical, that is, (under given conditions on the atom list) they univocally represent equivalence classes of formulas. Given the limited expressiveness of propositional logic, a few attempts to leverage DDs to SMT level have been presented in the literature. Unfortunately, these techniques still suffer from some limitations: most procedures are theory-specific; some produce theory DDs (T-DDs) which do not univocally represent T-valid formulas or T-inconsistent formulas; none of these techniques provably produces theory-canonical T-DDs, which (under given conditions on the T-atom list) univocally represent T-equivalence classes of formulas. Also, these procedures are not easy to implement, and very few implementations are actually available. In this paper, we present a novel very-general technique to leverage DDs to SMT level, which has several advantages: it is very easy to implement on top of an AllSMT solver and a DD package, which are used as blackboxes; it works for every form of DDs and every theory, or combination thereof, supported by the AllSMT solver; it produces theory-canonical T-DDs if the propositional DD is canonical. We have implemented a prototype tool for both T-OBDDs and T-SDDs on top of OBDD and SDD packages and the MathSAT SMT solver. Some preliminary empirical evaluation supports the effectiveness of the approach.
决策图(DDs)是表示命题公式的强大工具,这在许多领域都有广泛应用,特别是在形式验证和知识编译领域。有些形式的公司(如OBDDs和SDDs)是规范的,即(在给定的原子列表条件下)它们单方面表示等价类。然而,根据命题逻辑的有限表达性,文献中已经提出了几种尝试利用DDs到SMT层次的方法。不幸的是,这些技术仍然存在一些局限性:大多数过程都是特定于理论的;有些产生理论DDs(T-DDs),它们不单方面表示T-有效公式或T-不一致公式;目前还没有这些技术能够生成理论规范的T-DDs,在给定理论原子列表的条件下,它们单方面表示T-等价类。此外,这些过程很难实现,实际上实现的数量也很少。在本文中,我们提出了一个新颖的方法,利用DDs到SMT层次,具有几个优点:在奥数SMT求解器和DD软件的基础上,实现起来非常容易;它适用于所有形式的DDs和支持AllSMT求解器的任何理论或它们的组合;如果命题DD是规范的,它就会产生理论规范的T-DDs。我们在奥数SMT求解器和SDD软件的基础上,实现了T-OBDD和T-SDD的原型工具。一些初步的实证评估支持了这种方法的有效性。
https://arxiv.org/abs/2404.16455
Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent of their appearance, shape, size, quantity, and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context, while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations, we propose PAD, a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches, compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types, such as localized noise, printable, and naturalistic patches, exhibit notable improvements over state-of-the-art works. Our code is available at this https URL.
对抗性补丁攻击对现实世界的物体检测器构成了显著的安全威胁,因为它们的实际可行性。现有的防御方法,依赖攻击数据或先验知识,很难有效地应对广泛的对抗性补丁。在本文中,我们展示了对抗性补丁的两个固有特性:语义独立性和空间异质性,无论它们的形状、大小、数量和位置如何。语义独立性表明,攻击性补丁在语义上下文内自行为,而空间异质性表现为由于独立生成过程,补丁区域与原始干净图像的图像质量不同的显著图像质量差异。基于这些观察结果,我们提出了PAD,一种新颖的对抗性补丁局部化和删除方法,不需要先验知识或额外训练。PAD能够对各种对抗性补丁进行补丁,兼容任何预训练的物体检测器。我们对各种补丁类型(如局部噪音、可打印的和自然istic补丁)进行全面的数字和物理实验,结果表明,与最先进的成果相比,我们的工作取得了显著的改善。我们的代码可在此处访问:https://www.thuatminh.com/
https://arxiv.org/abs/2404.16452