Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
https://arxiv.org/abs/2501.09555
Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model's decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at this https URL.
类增量故障诊断要求模型在适应新故障类别的同时保留先前的知识。然而,针对不平衡和长尾数据的研究相对较少。从少量样本的故障数据中提取判别性特征是一项挑战,并且增加新的故障类别通常需要进行成本高昂的重新训练。此外,现有方法的渐进式训练面临灾难性遗忘的风险,严重的类不平衡可能会导致模型决策偏向于正常类别。为了解决这些问题,我们引入了一种名为监督对比知识蒸馏类增量故障诊断(SCLIFD)框架的方法。该方法提出了一种改进表示学习能力和减少遗忘的监督对比知识蒸馏技术、一种用于样本回放以缓解灾难性遗忘的新颖优先示例选择方法以及随机森林分类器来解决类别不平衡问题。在模拟和现实世界工业数据集上的广泛实验,涵盖了各种不平衡比例,证明了SCLIFD相对于现有方法的优势。我们的代码可在上述链接中找到。
https://arxiv.org/abs/2501.09525
Few-shot class incremental learning implies the model to learn new classes while retaining knowledge of previously learned classes with a small number of training instances. Existing frameworks typically freeze the parameters of the previously learned classes during the incorporation of new classes. However, this approach often results in suboptimal class separation of previously learned classes, leading to overlap between old and new classes. Consequently, the performance of old classes degrades on new classes. To address these challenges, we propose a novel feature augmentation driven contrastive learning framework designed to enhance the separation of previously learned classes to accommodate new classes. Our approach involves augmenting feature vectors and assigning proxy labels to these vectors. This strategy expands the feature space, ensuring seamless integration of new classes within the expanded space. Additionally, we employ a self-supervised contrastive loss to improve the separation between previous classes. We validate our framework through experiments on three FSCIL benchmark datasets: CIFAR100, miniImageNet, and CUB200. The results demonstrate that our Feature Augmentation driven Contrastive Learning framework significantly outperforms other approaches, achieving state-of-the-art performance.
少量样本类别增量学习意味着模型在使用少量训练实例的情况下,能够同时学习新类别的知识并保留已学类别的知识。现有的框架通常在引入新类别时冻结先前已学类别的参数设置不变。然而,这种方法往往会导致之前已学类别间分离效果不佳,从而造成旧类别与新类别之间的重叠。因此,旧类别的性能在面对新类别时会下降。 为了解决这些问题,我们提出了一种新颖的基于特征增强驱动对比学习框架,旨在改进先前已学类别间的分离度以适应新类别的加入。我们的方法包括对特征向量进行增强,并给这些向量分配代理标签。这种策略可以扩展特征空间,在扩大的空间内实现新类别的平滑整合。此外,我们采用自监督的对比损失来优化旧类之间的区分度。 我们在三个少量样本类别增量学习基准数据集(CIFAR100、miniImageNet和CUB200)上对我们的框架进行了验证实验。结果表明,基于特征增强驱动对比学习的框架显著优于其他方法,并达到了最先进的性能水平。
https://arxiv.org/abs/2501.09361
Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
在医疗图像分类中,少样本学习(few-shot learning)面临着一个显著的挑战,即标注数据的有限可用性和医学影像的复杂性。本文提出了一种新的框架——自适应视觉-语言微调与层次对比对齐(HiCA),该框架利用大规模视觉-语言模型(LVLMs)的能力来进行医疗图像分析。HiCA引入了一个两阶段的微调策略,结合领域特定预训练和分层对比学习来在多个层级上对齐视觉和文本表示。我们在两个基准数据集——胸部X光片和乳腺超声检查数据集中评估了我们的方法,在少样本(few-shot)和零样本(zero-shot)设置下均取得了最先进的性能表现。进一步的分析表明,与现有的基线相比,该方法具有更强的鲁棒性、泛化能力和可解释性,并在性能上实现了显著提升。我们的工作强调了层次对比策略在将LVLMs适应医学成像任务独特挑战方面的潜力。
https://arxiv.org/abs/2501.09294
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290
Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
视觉空间系统在混凝土裂缝检测中变得越来越重要。然而,现有方法往往缺乏对多样场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且难以处理曲线或复杂的几何形状。为了克服这些局限性,本研究提出了一种结合计算机视觉技术和多模态同步定位与地图构建(SLAM)的新框架,用于二维(2D)裂缝检测、三维(3D)重建和自动测量。首先,在DeepLabv3+分割模型的基础上进行改进,并利用基础模型Segment Anything Model (SAM) 进行特定的优化,我们开发了一种在不熟悉场景中具有强泛化的裂缝分割方法,能够生成精确的2D裂缝掩模。为了提高三维重建的准确性和鲁棒性,本研究结合了激光雷达点云、图像数据和分割掩模的数据。通过利用图像SLAM和激光雷达SLAM,我们开发了一个多帧和多模态融合框架,产生密集且彩色化的点云,在3D现实尺度上有效地捕捉裂缝语义信息。此外,还在三维稠密的点云空间内直接自动测量了裂缝几何属性,超出了传统二维图像基方法的限制。这一进展使得该方法适用于具有曲线及复杂三维几何形状的结构部件。各种混凝土结构上的实验结果强调了所提方法在实际应用中的显著改进和独特优势,证明其有效、准确且鲁棒性良好。
https://arxiv.org/abs/2501.09203
In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small but meaningful changes in the text. These generated answers serve as test cases to assess the ability of LLMs to detect subtle differences in matching of the original answers. A subset of the datasets was manually verified for quality and accuracy. Our results show that while larger LLMs, such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in distinguishing matched and non-matched answers, smaller models show more variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot examples, while Mistral Nemo 12b underperformed on detection of subtle text alteration, particularly in Lithuanian, even with additional examples. QWEN2.5 7b and Mistral 7b were able to obtain a strong and comparable performance to the larger 70b models in zero and few shot experiments. Moreover, the performance of Mistral 7b was weaker in few shot experiments.
在这项工作中,我们解决了在拉脱维亚语和立陶宛语上评估大规模语言模型(LLMs)对简答题匹配任务的挑战。我们引入了新的数据集,其中包括502个拉脱维亚语问题-答案对和690个立陶宛语问题-答案对。对于每个问题-答案对,我们使用一组专门设计的小而有意义的文本变化规则生成了匹配和不匹配的答案。这些生成的答案用作测试案例,以评估LLMs检测原始答案中细微差异的能力。数据集的一部分经过人工验证以确保质量和准确性。 我们的结果显示,较大的语言模型(如QWEN2.5 72b和LLaMa3.1 70b)在区分匹配与不匹配的答案方面表现出近乎完美的性能,而较小的模型则显示出更多的变化性。例如,LLaMa3.1 8b和EuroLLM 9b从少量示例中受益匪浅,而Mistral Nemo 12b在检测立陶宛语中的细微文本修改时表现较弱,即使提供了额外的例子也是如此。QWEN2.5 7b和Mistral 7b能够在零样本和少样本实验中获得与较大规模的70b模型相当且强大的性能。此外,在少量示例的情况下,Mistral 7b的表现相对较弱。
https://arxiv.org/abs/2501.09164
Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications.
视觉基础模型在各种图像分析任务中取得了显著进展。在图像分割任务方面,像Segment Anything Model (SAM)这样的基础模型能够通过用户提供的提示实现零样本泛化分割。然而,由于SAM主要是在自然图像上进行训练的,因此缺乏医学成像领域的专业知识。这使得将SAM应用于医学图像分割时面临挑战,包括需要对专门的医学数据集进行大量微调以及依赖于手动提示的需求,这两种需求既费时又需要医疗专家的介入。 为此,本工作引入了Few-shot Adaptation of Training-free SAM (FATE-SAM),这是一种创新方法,旨在将先进的Segment Anything Model 2(SAM2)调整用于3D医学图像分割。FATE-SAM重新组装了预训练的SAM2模块,以实现少量样本适应性,并利用少量支持示例来捕捉解剖学知识并执行无需提示的分割,同时避免了对模型微调的需求。为了处理医学图像体积性质的问题,我们引入了一个体积一致性机制,增强了3D切片之间的空间连贯性。 我们在多个医学成像数据集上评估FATE-SAM,并将其与监督学习方法、零样本SAM方法和微调后的医疗SAM方法进行了比较。结果显示,FATE-SAM在不需要大量注释数据集和专家介入的情况下提供了稳健且准确的分割结果。因此,FATE-SAM为医学图像分割提供了一种实用高效的解决方案,使其更适用于临床应用。
https://arxiv.org/abs/2501.09138
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.
CLIP(对比语言图像预训练)在模式识别和计算机视觉领域取得了巨大成功。将CLIP转移到下游任务(如零样本或少样本分类)是多模态学习中的热门话题。然而,当前的研究主要集中在文本提示学习或视觉适配器微调上,未能充分挖掘图像-文本对之间的互补信息和关联性。在本文中,我们提出了一种图像描述增强的CLIP适配器(IDEA)方法,用于将CLIP适应于少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕捉细粒度特征。IDEA是一种针对CLIP的无需训练的方法,在多个任务上可以与最先进的模型媲美甚至超过它们。 此外,我们引入了Trainable-IDEA(T-IDEA),它在IDEA的基础上增加了两个轻量级可学习组件(即投影器和可学习潜在空间),进一步提升了模型性能,并在11个数据集上实现了最先进的结果。作为一项重要贡献,我们采用了Llama模型并设计了一个综合的管道来为11个数据集上的图像生成文本描述,总共产生了1,637,795对图像-文本配对,命名为"IMD-11"。 我们的代码和数据可在以下网址获取:[https://this-url.com](请将URL替换为您实际提供的地址)。
https://arxiv.org/abs/2501.08816
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model's performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o's promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: this https URL
这项研究强调了ChatGPT(特别是GPT-4o)在面部呈现攻击检测(PAD,Face Presentation Attack Detection)中的潜力,作为现有模型和商业解决方案的有力竞争者,在特定场景中表现出色。我们的结果显示,GPT-4o展示了高一致性,特别是在少量样本情境下的学习中,其性能随着提供的示例数量增加而提升(参考数据)。此外,我们还观察到详细的提示可以使模型更可靠地提供评分,这种现象在简洁的提示下并未出现。寻求解释性的提示则略微提升了模型的表现,通过增强其可解释性来实现这一效果。值得注意的是,在没有明确指示分类攻击类型的情况下,该模型仍能在少量样本场景中以高精度正确预测攻击类型(打印或重放),展示了其涌现推理能力。 然而,尽管具备这些优势,GPT-4o在零样本任务上面临挑战,其表现相较于专门设计的PAD系统有所局限。实验是在SOTERIA数据集的一个子集中进行,并且通过仅使用已获得知情同意者的数据来确保符合数据隐私法规。这些发现强调了GPT-4o在PAD应用中的前景,为未来研究奠定了基础,以解决更广泛的数据隐私问题并提高跨数据集的泛化能力。代码可在以下链接获取:[此URL](请将“this https URL”替换为实际的代码网址)。
https://arxiv.org/abs/2501.08799
Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by $+7.4\%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.
目前,基于大型合成图像数据集训练的虚假图像检测器在有限研究的生成模型上表现良好。然而,在未见过的模型面前,它们的表现显著下降。此外,从在线生成模型中收集足够的训练数据通常既昂贵又不可行。为了解决这些问题,我们提出了一种名为Few-Shot Detector (FSD)的新颖AI图像检测器,该检测器通过利用少量样本学习专门的度量空间,从而有效地区分未见过的虚假图像。实验表明,FSD在GenImage数据集上达到了最先进的性能,平均准确率提高了7.4%。更重要的是,我们的方法无需进一步训练就能更好地捕捉未见过图像中同一类别的共同特征。
https://arxiv.org/abs/2501.08763
While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: this https URL.
尽管自动驾驶技术取得了显著进展,但数据驱动的方法在处理复杂场景时仍因推理能力有限而面临挑战。与此同时,随着视觉语言模型的普及,知识驱动型自主驾驶系统已经得到了相当大的发展。本文中我们提出了一种基于认知感知和双过程思考的新方法——LeapVAD。我们的方法实施了一个人类注意机制来识别并聚焦影响驾驶决策的关键交通元素上。通过全面属性(包括外观、运动模式及关联风险)对这些对象进行表征,LeapVAD实现了更有效的环境表示,并简化了决策流程。此外,LeapVAD融合了一种创新的双过程决策模块,模仿人类驾驶学习的过程。系统由一个分析过程(System-II),即通过逻辑推理积累驾驶经验的部分和一个启发式过程(System-I),即通过微调及少量示例学习完善知识的部分组成。LeapVAD还包括反思机制和不断增长的记忆库,使它能够从过去的错误中学习,并在闭环环境中持续改进其性能。为了提高效率,我们开发了一种场景编码网络,生成紧凑的场景表示,以快速检索相关的驾驶经验。我们在两个领先的自动驾驶模拟器CARLA和DriveArena上进行了广泛的评估,结果显示尽管训练数据有限,LeapVAD的表现优于仅基于摄像头的方法。全面的消融研究进一步强调了它在持续学习和领域适应中的有效性。 项目页面: [这个URL](this%20https%20URL)
https://arxiv.org/abs/2501.08168
Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.
源代码作者归属识别在软件取证、抄袭检测和保护软件补丁完整性方面至关重要。现有技术通常依赖于监督机器学习,但由于需要大量的标注数据集,这种技术难以跨不同的编程语言和编码风格进行泛化。受最近使用大型语言模型(LLMs)进行自然语言作者身份分析进展的启发,这些方法在无需特定任务调整的情况下表现出卓越性能,本文探讨了将LLM应用于源代码作者归属识别的研究。 我们提出了一个全面研究,展示了最先进的LLM能够跨不同编程语言成功地识别源代码作者。LLM可以通过零样本提示确定两个代码片段是否由同一作者编写,并且可以达到0.78的马修斯相关系数(MCC)。此外,在通过少量样本学习的情况下,LLM可以从一组参考代码片断中推断出代码作者身份,同样达到了0.77的MCC。这些模型对一些针对归属错误攻击也显示出一定的对抗鲁棒性。 尽管具备上述能力,我们发现对于大量作者而言,直接提示LLMs的表现并不理想,主要是因为输入令牌限制的问题。为解决这一问题,我们提出了一种锦标赛风格的方法来进行大规模归属识别。在GitHub上的C++(500位作者,26,355个样本)和Java(686位作者,55,267个样本)代码数据集上评估这种方法后,仅使用每位作者的一个参考样例,我们就达到了C++分类准确率最高为65%,Java则为68.7%。 这些结果为将LLM应用于网络安全及软件工程中的源代码作者归属识别开辟了新的可能性。
https://arxiv.org/abs/2501.08165
Empirical risk minimization (ERM) is not robust to changes in the distribution of data. When the distribution of test data is different from that of training data, the problem is known as out-of-distribution generalization. Recently, two techniques have been developed for addressing out-of-distribution generalization in computer vision: weight averaging (WA) and sharpness-aware minimization (SAM). WA involves training multiple models with different hyperparameters and then averaging the weights of these models, which can significantly improve out-of-distribution generalization performance. SAM optimizes a neural network to find minima in flat regions, which have been proven to perform well under distribution shifts. While these techniques have made great progress, there is still room for improvement and further exploration. In this thesis, we propose increasing the model diversity in WA explicitly by introducing gradient similarity as a loss regularizer to further improve out-of-distribution generalization performance. We also propose combining WA and SAM to solve the problem of few-shot domain adaptation. Our extensive experiments on digits datasets (MNIST, SVHN, USPS, MNIST-M) and other domain adaptation datasets (VLCS, PACS) show that combining WA and SAM leads to improved out-of-distribution generalization performance and significantly increases few-shot domain adaptation accuracy.
经验风险最小化(ERM)在数据分布变化时不够稳健。当测试数据的分布与训练数据不同,这个问题被称为出界泛化(out-of-distribution generalization)。近期,计算机视觉领域开发了两种应对出界泛化的技术:权重平均(WA)和尖锐度感知最小化(SAM)。WA通过使用不同的超参数训练多个模型,并将这些模型的权重进行平均来提高模型在不同分布下的泛化性能。而SAM则优化神经网络以找到平坦区域中的极小值,这些平坦区域已被证明能在分布变化下表现出良好的性能。 尽管这两种技术已取得了显著进展,但仍存在改进和进一步探索的空间。在这篇论文中,我们提出通过将梯度相似性作为损失正则器来明确增加WA的模型多样性,从而进一步提高出界泛化性能。此外,我们还提出了结合使用WA和SAM的方法来解决少量样本领域适应的问题。我们在数字数据集(MNIST、SVHN、USPS、MNIST-M)和其他领域适应数据集(VLCS、PACS)上进行了广泛的实验,结果表明结合使用WA和SAM可以提高出界泛化性能,并显著提升少量样本领域适应的准确性。
https://arxiv.org/abs/2501.08361
Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. (2024) focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search. Nevertheless, this method lacks generality since it specifies the instruction-response structure. Moreover, the reason why inserting special tokens takes effect in inducing harmful behaviors is only empirically discussed. In this paper, we take a deeper insight into the mechanism of special token injection and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at this https URL.
最近,一些研究集中在使用少量恶意示例对大型语言模型(LLM)进行越狱攻击。特别是,郑等人(2024年)专注于通过在示例中注入特殊标记并采用随机搜索方法来提高少样本越狱(FSJ)的效率。然而,这种方法缺乏普遍性,因为它规定了指令-响应结构。此外,插入特殊标记为何能诱导有害行为的原因仅从经验上进行了讨论。 在这篇论文中,我们深入探讨了特殊标记注入机制,并提出了自指导少样本越狱攻击(Self-Instruct-FSJ),并通过示例级别的贪婪搜索来支持这一方法。该框架将FSJ攻击分解为模式和行为学习,以更通用且高效的方式利用模型的漏洞。我们进行了精心设计的实验,在常见的开源模型上评估了我们的方法,并将其与基线算法进行了比较。我们的代码可在以下网址获得:[此URL]。
https://arxiv.org/abs/2501.07959
Home Energy Management Systems (HEMSs) help households tailor their electricity usage based on power system signals such as energy prices. This technology helps to reduce energy bills and offers greater demand-side flexibility that supports the power system stability. However, residents who lack a technical background may find it difficult to use HEMSs effectively, because HEMSs require well-formatted parameterization that reflects the characteristics of the energy resources, houses, and users' needs. Recently, Large-Language Models (LLMs) have demonstrated an outstanding ability in language understanding. Motivated by this, we propose an LLM-based interface that interacts with users to understand and parameterize their ``badly-formatted answers'', and then outputs well-formatted parameters to implement an HEMS. We further use Reason and Act method (ReAct) and few-shot prompting to enhance the LLM performance. Evaluating the interface performance requires multiple user--LLM interactions. To avoid the efforts in finding volunteer users and reduce the evaluation time, we additionally propose a method that uses another LLM to simulate users with varying expertise, ranging from knowledgeable to non-technical. By comprehensive evaluation, the proposed LLM-based HEMS interface achieves an average parameter retrieval accuracy of 88\%, outperforming benchmark models without ReAct and/or few-shot prompting.
家庭能源管理系统(HEMS)帮助住户根据电力系统的信号如电价调整用电方式,从而降低电费并提供更大的需求侧灵活性以支持电网稳定性。然而,缺乏技术背景的居民可能会发现有效使用HEMS比较困难,因为需要对反映能源资源、房屋特点和用户需求的参数进行良好格式化配置。 最近,大型语言模型(LLM)在语言理解方面表现出色。受此启发,我们提出了一种基于LLM的界面,能够与用户互动以理解和解析他们“不规范”的答案,并输出可执行的家庭能源管理系统所需的正规化参数。此外,通过使用Reason和Act方法(ReAct)以及少量样本提示来增强LLM性能。 评估该接口的表现需要多次用户-LLM交互。为了减少寻找志愿者用户的精力并缩短评测时间,我们还提出了一种使用另一款LLM模拟不同技术水平用户的方案,从非常了解技术到完全不懂技术不等。 通过全面的测试评价,所提出的基于LLM的家庭能源管理系统界面实现了88%的平均参数检索准确率,超过了没有ReAct和/或少量样本提示基准模型的表现。
https://arxiv.org/abs/2501.07919
Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs' one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.
使用大型语言模型(LLM)进行自动化代码生成因其高效性和适应性而受到了广泛关注。然而,实际的编程任务或基准测试如HumanEval和StudentEval通常缺乏专门的训练数据集,这对依赖参考示例的现有少量提示方法提出了挑战。受人类元记忆——即涉及回忆和评估的认知过程启发,我们提出了一种新的框架(名为M^2WF),以改进LLM一次性代码生成的能力。此方法使LLM能够自主地生成、评估并利用合成示例来提高可靠性和性能。与以往的方法不同,它减少了对人工整理数据的依赖,并且能够灵活适应各种编程场景。我们的实验表明,在编码基准测试中取得了显著改善,为无数据环境提供了一个可扩展和稳健的解决方案。代码和框架将在GitHub和HuggingFace上公开发布。
https://arxiv.org/abs/2501.07892
We present a novel approach for depth estimation from images captured by structured light systems. Unlike many previous methods that rely on image matching process, our approach uses a density voxel grid to represent scene geometry, which is trained via self-supervised differentiable volume rendering. Our method leverages color fields derived from projected patterns in structured light systems during the rendering process, enabling the isolated optimization of the geometry field. This contributes to faster convergence and high-quality output. Additionally, we incorporate normalized device coordinates (NDC), a distortion loss, and a novel surface-based color loss to enhance geometric fidelity. Experimental results demonstrate that our method outperforms existing matching-based techniques in geometric performance for few-shot scenarios, achieving approximately a 60% reduction in average estimated depth errors on synthetic scenes and about 30% on real-world captured scenes. Furthermore, our approach delivers fast training, with a speed roughly three times faster than previous matching-free methods that employ implicit representations.
我们提出了一种从结构光系统捕捉的图像中进行深度估计的新方法。与许多依赖于图像匹配过程的先前方法不同,我们的方法使用密度体素网格来表示场景几何,并通过自监督可微分体积渲染进行训练。该方法利用在结构光系统中投影图案所衍生的颜色场,在渲染过程中实现对几何字段的孤立优化,从而加快了收敛速度并提高了输出质量。此外,我们还集成了归一化设备坐标(NDC)、畸变损失以及一种基于表面的新颜色损失,以增强几何精度。 实验结果表明,在少量示例场景下,我们的方法在几何性能方面优于现有的基于匹配的技术,在合成场景中平均估计深度误差减少了约60%,而在真实世界捕捉的场景中则减少约30%。此外,我们提出的方法具有快速训练的特点,其速度比之前采用隐式表示的无匹配自由方法快大约三倍。
https://arxiv.org/abs/2501.07113
Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website this https URL.
教机器人自主完成日常任务仍然是一个挑战。模仿学习(IL)是一种强大的方法,可以通过演示来赋予机器人技能,但受制于收集远程操作机器人数据的劳动密集型过程。人类视频提供了一种可扩展的替代方案,但由于缺乏机器人的动作标签,直接从这些视频训练IL策略仍然困难重重。为了解决这个问题,我们提出将动作表示为图像上的短期二维轨迹(即运动轨迹)。这种动作捕捉了预测的人手或机器人末端执行器的动作方向。我们将这个方法实例化为一个名为“Motion Track Policy”(MT-pi)的IL策略,该策略接收图像观察并输出运动轨迹作为动作。通过利用这种统一的跨实体行动空间,即使只用几分钟的人类视频和有限的额外机器人演示,MT-pi也能成功完成任务。在测试时,我们从两个摄像机视图中预测运动轨迹,并通过多视角合成恢复六自由度(6DoF)轨迹。 实验结果显示,在四个真实世界的任务中,MT-pi达到了86.5%的成功率,比那些不利用人类数据或我们的行动空间的最新IL基准方法高出40%,并且能够推广到仅出现在人类视频中的场景。代码和视频可在我们网站上的此链接获取:[this https URL]。 注:最后提供了一个URL格式,但实际并没有提供具体网址,请根据上下文查找相关信息。
https://arxiv.org/abs/2501.06994
This paper reports on learning a reward map for social navigation in dynamic environments where the robot can reason about its path at any time, given agents' trajectories and scene geometry. Humans navigating in dense and dynamic indoor environments often work with several implied social rules. A rule-based approach fails to model all possible interactions between humans, robots, and scenes. We propose a novel Smooth Maximum Entropy Deep Inverse Reinforcement Learning (S-MEDIRL) algorithm that can extrapolate beyond expert demos to better encode scene navigability from few-shot demonstrations. The agent learns to predict the cost maps reasoning on trajectory data and scene geometry. The agent samples a trajectory that is then executed using a local crowd navigation controller. We present results in a photo-realistic simulation environment, with a robot and a human navigating a narrow crossing scenario. The robot implicitly learns to exhibit social behaviors such as yielding to oncoming traffic and avoiding deadlocks. We compare the proposed approach to the popular model-based crowd navigation algorithm ORCA and a rule-based agent that exhibits yielding.
本文报告了一种在动态环境中为机器人学习社会导航奖励图的方法,该方法允许机器人在任何时候根据代理的轨迹和场景几何形状进行路径推理。人类在密集且动态的室内环境中导航时通常会遵循一些隐含的社会规则。基于规则的方法无法模拟人类、机器人与场景之间所有可能的互动。为此,我们提出了一种新颖的平滑最大熵深度逆向强化学习(S-MEDIRL)算法,该算法能够从少量演示中推断出超出专家示范的数据,更好地编码场景中的可导航性。通过考虑轨迹数据和场景几何形状,代理学会预测代价地图,并采样一条轨迹执行本地人群导航控制器。 我们在一个逼真的模拟环境中展示了实验结果,在这个环境中,机器人和人类在狭窄的交叉路口场景中进行导航。机器人隐式地学会了表现出诸如让路给来往的人流以及避免死锁等社会行为。我们还将所提出的算法与流行的基于模型的人群导航算法ORCA以及执行让路规则的代理进行了比较。 通过上述实验和对比,证明了我们的S-MEDIRL方法在处理动态环境中机器人和社会互动方面具有显著优势。
https://arxiv.org/abs/2501.06946