Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
基于大型语言模型的机器写作通常依赖于检索增强生成技术。然而,这些方法仍然局限于模型预定义的范围内,限制了内容丰富信息的生成能力。具体而言,常规检索到的信息往往缺乏深度、实用性,并且存在冗余问题,这会降低生成文章的质量,导致产出浅薄、重复和缺乏原创性的结果。为了解决这些问题,我们提出了一种名为OmniThink的机器写作框架,该框架模仿人类迭代扩展与反思的过程。OmniThink的核心思想是模拟学习者在其主题知识中逐步深化的认知行为。 实验结果显示,相比现有方法,OmniThink能够提高生成文章的知识密度,并且不会影响连贯性和深度等指标的表现。通过人工评价和专家反馈进一步表明,OmniThink有潜力解决长篇幅文章生成中的现实挑战。
https://arxiv.org/abs/2501.09751
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels. The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks. However, they pose issues of accessibility and resource availability. Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. But its dependency on training data severely limits scalability. Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding. Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable. Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot's sensitivity to prompt content. The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.
这项研究对12种机器学习模型及其变体在检测经济意识形态方面的能力进行了系统的评估。作为评价标准,我使用了跨越英国六次选举的宣言数据,并由专家和众包编码者预先标注。该分析评估了几种生成式、微调型和零样本模型在颗粒级和汇总级上的表现。 研究结果表明,像GPT-4o和Gemini 1.5 Flash这样的生成式模型,在所有基准测试中都持续优于其他模型。然而,这些模型面临可访问性和资源可用性的问题。微调可以产生具有竞争力的性能,并通过特定领域的优化提供可靠的替代方案。但是,它对训练数据的依赖严重限制了其扩展能力。零样本模型在识别经济意识形态信号方面经常遇到困难,往往与人类编码的结果存在负面关联。 使用通用知识来执行特定领域(如意识形态量表)的任务被证明是不可靠的。其他关键发现包括政党内部差异较大、微调从更大规模的数据中受益更多以及零样本模型对提示内容敏感性高。评估涵盖了每种模型的优势和局限,并得出了自动化分析政治内容的最佳实践。
https://arxiv.org/abs/2501.09719
Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.
许多非传统学生在网络安全项目中往往缺乏来自同龄人、家庭成员和教授的建议,这会妨碍他们的学习经历。此外,由于内容的相关性、建议的地方性、最低专业知识要求以及时机等问题,这些学生可能无法充分利用各种LLM(大型语言模型)驱动的人工智能助手提供的服务。本文通过介绍一款专门为满足这些学生的知识、技能和职业准备咨询需求而设计的应用程序来解决这些问题。我们开发了一个学习工具平台“CyberMentor”,旨在应对网络安全专业学生多样化的需要与痛点。该平台利用代理工作流和生成式大型语言模型(LLMs),并通过检索增强生成技术(RAG)实现准确且上下文相关的信息检索,以确保可访问性和个性化服务。 我们展示了CyberMentor在满足网络安全教育的知识需求、职业市场的适应性要求、分析和编程任务的技能需求以及提供即时按需学习支持方面的作用。通过三种使用场景的应用展示,CyberMentor在促进知识获取与职业准备方面发挥了重要作用,并提供了无缝的技术指导和支持。我们还采用了LangChain提示评价法来评估该平台的影响,确认其在帮助性、准确性和完整性方面的优秀表现。 这些结果强调了该系统支持学生发展实用的网络安全技能的能力,同时提高高等教育中的公平性和可持续性。此外,“CyberMentor”的开源设计允许它被其他学科领域采纳和适应,推动教育创新并扩大其潜在影响。
https://arxiv.org/abs/2501.09709
We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.
我们介绍e-Llama模型:这是两个分别拥有80亿和700亿参数的大型语言模型,专为电子商务领域进行了优化。这些模型作为基础模型,在电子商务方面具备深厚的知识积累,并为基础模型的指令微调提供了依据。通过在特定领域的1万亿个标记数据上进行持续预训练,我们得到了e-Llama模型。 我们在一系列消融研究中讨论了我们的方法,并根据实验结果解释了选择超参数的理由。为了量化这些模型在适应电子商务领域方面的效果,我们定义并实施了一套多语言、专门针对电子商务的评估任务集。结果显示,在精心设计的培训设置下,Llama 3.1模型可以被调整以适应新的电子商务领域,同时不会牺牲其在通用领域任务上的显著性能。 此外,我们还探索了将优化后的模型和基础模型合并的可能性,以便更好地控制不同领域之间的性能权衡。
https://arxiv.org/abs/2501.09706
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t
将度量学习项目样本转换到嵌入空间中,在该空间中,相似性和不相似性基于它们的学得表示进行量化。然而,现有的方法通常依赖于标签引导的表征学习,即不同模态(如音频和视觉数据)的表示是根据标注标签对齐的。这种做法往往未能充分挖掘与标签直接关联之外的音频和视觉数据分布中固有的复杂特征和潜在关系,导致在音视频嵌入学习上的性能不佳。 为解决这一问题,我们提出了一种新型架构,该架构集成了跨模态三元组损失(cross-modal triplet loss)与渐进式自我蒸馏(progressive self-distillation)。我们的方法通过利用内在分布并动态优化音频-视觉软对齐来增强表征学习——即在标签之外捕捉音视频数据之间固有关系的概率性对齐。具体而言,模型从每个批次的子集中提取注释标签的基础知识,并基于这些知识点进行自我蒸馏。这种方法使得模型能够逐步提高其表示能力,不仅利用了直接标注信息,还充分利用了未被显式标签覆盖的数据内在结构和复杂特征。通过这种方式,我们的方法能够在跨模态数据的学习中更好地捕捉到隐藏的潜在关系,从而提升音视频嵌入学习的整体性能。 简而言之,这种方法旨在解决现有度量学习在处理多模态数据时可能存在的局限性,并且通过引入自我蒸馏机制来改进模型对复杂内在结构的理解和利用。
https://arxiv.org/abs/2501.09608
Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
传统的2D人体姿态估计方法通常需要大量的标注数据,这既耗时又昂贵。相比之下,半监督的2D人体姿态估计可以通过利用大量未标记的数据和少量已标注的数据来缓解上述问题。现有的半监督2D人体姿态估计方法通过反向传播更新网络,而忽视了之前训练过程中重要的历史信息。因此,我们提出了一种新的半监督2D人体姿态估计方法,采用了一个新颖设计的教师-评审员-学生框架(Teacher-Reviewer-Student framework)。具体来说,首先模仿人类不断回顾以前的知识以巩固记忆的现象来设计我们的框架,在这个框架中,教师预测结果来指导学生的学习,而评审员存储重要的历史参数以提供额外的监督信号。其次,我们引入了一种多级特征学习策略,利用骨干网络不同阶段的输出估计热图(heatmap)来指导网络训练,这不仅丰富了监督信息,还能有效捕捉关键点之间的关系。最后,我们设计了一种数据增强策略,即关键点混合(Keypoint-Mix),通过混合不同的关键点来扰动姿态信息,从而增强了网络区分关键点的能力。在公开可用的数据集上进行的广泛实验表明,我们的方法相较于现有方法取得了显著的改进。
https://arxiv.org/abs/2501.09565
Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model's decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at this https URL.
类增量故障诊断要求模型在适应新故障类别的同时保留先前的知识。然而,针对不平衡和长尾数据的研究相对较少。从少量样本的故障数据中提取判别性特征是一项挑战,并且增加新的故障类别通常需要进行成本高昂的重新训练。此外,现有方法的渐进式训练面临灾难性遗忘的风险,严重的类不平衡可能会导致模型决策偏向于正常类别。为了解决这些问题,我们引入了一种名为监督对比知识蒸馏类增量故障诊断(SCLIFD)框架的方法。该方法提出了一种改进表示学习能力和减少遗忘的监督对比知识蒸馏技术、一种用于样本回放以缓解灾难性遗忘的新颖优先示例选择方法以及随机森林分类器来解决类别不平衡问题。在模拟和现实世界工业数据集上的广泛实验,涵盖了各种不平衡比例,证明了SCLIFD相对于现有方法的优势。我们的代码可在上述链接中找到。
https://arxiv.org/abs/2501.09525
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
This paper addresses the challenges translating case law under Hong Kong's bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost-effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human-machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single-agent system to a multi-agent system, incorporating Translator, Annotator, and Proofreader agents. This multi-agent approach, supported by a grant, aims to facilitate efficient, high-quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.
这篇论文探讨了在香港双语法律体系下翻译案例法所面临的挑战。它强调了在1997年移交前将所有成文法条文成功译为中文的初步成就,这是《基本法》规定的任务。这一努力需要法律、语言和翻译专家之间的重大合作,最终形成了一套全面且文化适宜的双语法律系统。然而,由于司法决定的数量庞大并且持续增长,翻译案例法仍然是一个重大的挑战。 论文批评了政府和司法机构在翻译案例法方面所做的零散而不协调的努力,并将其与先前对成文法翻译所采取的彻底方法进行了对比。虽然政府承认法律双语的重要性,但缺乏可持续的策略来解决案例法的翻译问题。司法机关认为全面翻译所有判决是不必要的、不现实的以及成本高昂的观点被分析和批评,指出其对法律透明度及公众信任的影响。 提出的一个解决方案涉及利用机器翻译技术通过一个人机交互式翻译平台实现,该平台经历了两个主要转变。最初基于神经网络模型,随后转向使用大型语言模型以提高翻译准确性。此外,这个平台从单代理系统转变为多代理系统,包括译员、注释者和校对者等角色。 这种多代理方法,在资助项目的支持下,旨在通过整合先进的人工智能技术和持续的反馈机制来促进司法判决高效且高质量的翻译,从而更好地满足双语法律体系的需求。
https://arxiv.org/abs/2501.09444
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning with Pre-training* (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.
指令微调通过依赖高质量的数据集来增强大规模语言模型(LLMs)在各种任务中遵循人类指令的能力。然而,无论是手动策划还是合成生成的这些数据集往往过于狭隘,并且与预训练期间捕捉到的广泛分布不一致,这限制了LLM的泛化能力以及对预训练知识的有效利用。我们提出了一种名为“与预训练对齐的指令微调”(AITP)的方法,该方法通过识别指令微调数据集中的覆盖不足之处,并将代表性不足的预训练数据重写为高质量的指令-响应对来弥合这一差距。这种方法在丰富数据集多样性的同时保留了特定任务的目标。针对三个完全开放的LLM进行的八项基准测试评估显示,使用AITP可以实现一致的性能提升。消融研究突显了自适应数据选择、受控重写和平衡集成的好处,并强调了将指令微调与预训练分布对齐的重要性,以充分释放LLM的能力。
https://arxiv.org/abs/2501.09368
Few-shot class incremental learning implies the model to learn new classes while retaining knowledge of previously learned classes with a small number of training instances. Existing frameworks typically freeze the parameters of the previously learned classes during the incorporation of new classes. However, this approach often results in suboptimal class separation of previously learned classes, leading to overlap between old and new classes. Consequently, the performance of old classes degrades on new classes. To address these challenges, we propose a novel feature augmentation driven contrastive learning framework designed to enhance the separation of previously learned classes to accommodate new classes. Our approach involves augmenting feature vectors and assigning proxy labels to these vectors. This strategy expands the feature space, ensuring seamless integration of new classes within the expanded space. Additionally, we employ a self-supervised contrastive loss to improve the separation between previous classes. We validate our framework through experiments on three FSCIL benchmark datasets: CIFAR100, miniImageNet, and CUB200. The results demonstrate that our Feature Augmentation driven Contrastive Learning framework significantly outperforms other approaches, achieving state-of-the-art performance.
少量样本类别增量学习意味着模型在使用少量训练实例的情况下,能够同时学习新类别的知识并保留已学类别的知识。现有的框架通常在引入新类别时冻结先前已学类别的参数设置不变。然而,这种方法往往会导致之前已学类别间分离效果不佳,从而造成旧类别与新类别之间的重叠。因此,旧类别的性能在面对新类别时会下降。 为了解决这些问题,我们提出了一种新颖的基于特征增强驱动对比学习框架,旨在改进先前已学类别间的分离度以适应新类别的加入。我们的方法包括对特征向量进行增强,并给这些向量分配代理标签。这种策略可以扩展特征空间,在扩大的空间内实现新类别的平滑整合。此外,我们采用自监督的对比损失来优化旧类之间的区分度。 我们在三个少量样本类别增量学习基准数据集(CIFAR100、miniImageNet和CUB200)上对我们的框架进行了验证实验。结果表明,基于特征增强驱动对比学习的框架显著优于其他方法,并达到了最先进的性能水平。
https://arxiv.org/abs/2501.09361
Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.
最近,大型重构模型(LRMs)已成为创建3D基础模型的一种流行方法。使用2D视觉数据训练3D重构模型通常需要了解训练样本的相机姿态,这一过程既耗时又容易出错。因此,传统的3D重构训练要么局限于合成的3D数据集,要么是带有标注姿势的小规模数据集。在这项研究中,我们探讨了仅用未标记视频数据进行三维重建的可能性,并引入了一种新型的三维重建模型UVRM(Unposed Video Reconstruction Model)。这种模型可以在无需任何姿态信息的情况下通过单目视频进行训练和评估。 UVRM采用一种变换网络,将视频帧隐式地聚合成一个与姿势无关的潜在特征空间,然后解码为三平面3D表示。为了在训练过程中避免使用真实姿态标注,UVRM结合了分数蒸馏采样(SDS)方法和分析合成方法,逐步利用预训练的扩散模型生成伪新视角视图。 我们在G-Objaverse和CO3D数据集上对UVRM进行了定性和定量评估,并且在这些评估中没有使用任何姿态信息。大量的实验表明,UVRM能够有效并高效地从无标记视频中重构各种各样的三维物体。
https://arxiv.org/abs/2501.09347
Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets' shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows' saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
视频合成孔径雷达(ViSAR)在移动目标检测(MTD)领域引起了广泛关注,因为它能够持续监测目标区域的变化。在ViSAR中,移动目标的阴影不会偏移和模糊,这被广泛用作MTD的一个特征。然而,由于背景中的低散射区难以与阴影区分,会导致更多的漏报和误报。因此,研究如何增强阴影与背景之间的区别变得非常重要。 为此,在本研究中我们提出了用于ViSAR的阴影增强及背景抑制算法(SE-BSFV)。该算法基于低秩表示(LRR)理论,并采用在线子空间学习技术来增强ViSAR图像中的阴影并压制背景。首先,我们使用一个配准算法对ViSAR图像进行配准,并利用高斯混合分布(GMD)模型化ViSAR数据。其次,从先前帧中获得的知识被用来估计当前帧的GMD参数,同时采用期望最大算法来估算子空间参数。然后可以获取当前帧的前景矩阵。最后,使用交替方向乘子法(ADMM)消除前景矩阵中的强散射物体以得到最终结果。 实验结果显示,与几种其他先进的预处理算法相比,SE-BSFV算法显著增强了阴影的重要性,并在确保效率的同时大幅提高了检测性能。
https://arxiv.org/abs/2501.09341
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
Despite significant advancements in general-purpose AI agents, several challenges still hinder their practical application in real-world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long-horizon planning. Second, general-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise. In this paper, we introduce the Standard Operational Procedure-guided Agent (SOP-agent), a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language. Formally, we represent a SOP as a decision graph, which is traversed to guide the agent in completing tasks specified by the SOP. We conduct extensive experiments across tasks in multiple domains, including decision-making, search and reasoning, code generation, data cleaning, and grounded customer service. The SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems. Additionally, we introduce the Grounded Customer Service Benchmark, the first benchmark designed to evaluate the grounded decision-making capabilities of AI agents in customer service scenarios based on SOPs.
尽管通用人工智能代理在许多方面取得了显著进展,但在实际应用场景中仍面临若干挑战。首先,大型语言模型(LLM)的有限规划能力限制了AI代理解决需要长远规划的复杂任务的能力。其次,通用AI代理难以有效利用特定领域的知识和人类专业知识。为此,本文介绍了由标准化操作程序(SOP)引导的智能体(SOP-agent),这是一种通过使用自然语言编写的伪代码样式的标准操作程序来构建领域专用代理的新框架。从形式上讲,我们将一个SOP表示为决策图,并利用该图指导智能体完成由SOP指定的任务。 我们进行了跨多个领域的广泛实验,包括决策制定、搜索与推理、代码生成、数据清理以及基于SOP的客户服务质量保障任务。实验结果显示,SOP-agent表现出卓越的灵活性和适应性,在性能上超过了通用代理框架,并且在特定领域系统中表现相当出色。 此外,本文还引入了基于SOP的第一款评估AI代理在客户服务场景中的基础决策能力基准测试——Grounded Customer Service Benchmark(基于标准操作程序制定的顾客服务基准)。
https://arxiv.org/abs/2501.09316
This review underscores the critical need for effective strategies to identify and support individuals with suicidal ideation, exploiting technological innovations in ML and DL to further suicide prevention efforts. The study details the application of these technologies in analyzing vast amounts of unstructured social media data to detect linguistic patterns, keywords, phrases, tones, and contextual cues associated with suicidal thoughts. It explores various ML and DL models like SVMs, CNNs, LSTM, neural networks, and their effectiveness in interpreting complex data patterns and emotional nuances within text data. The review discusses the potential of these technologies to serve as a life-saving tool by identifying at-risk individuals through their digital traces. Furthermore, it evaluates the real-world effectiveness, limitations, and ethical considerations of employing these technologies for suicide prevention, stressing the importance of responsible development and usage. The study aims to fill critical knowledge gaps by analyzing recent studies, methodologies, tools, and techniques in this field. It highlights the importance of synthesizing current literature to inform practical tools and suicide prevention efforts, guiding innovation in reliable, ethical systems for early intervention. This research synthesis evaluates the intersection of technology and mental health, advocating for the ethical and responsible application of ML, DL, and NLP to offer life-saving potential worldwide while addressing challenges like generalizability, biases, privacy, and the need for further research to ensure these technologies do not exacerbate existing inequities and harms.
这篇评论强调了识别和支持有自杀倾向个体的有效策略的迫切需求,利用机器学习(ML)和深度学习(DL)等技术创新来进一步推动自杀预防工作。研究详细介绍了这些技术在分析大量非结构化社交媒体数据方面的作用,以检测与自杀想法相关的语言模式、关键词、短语、语气以及上下文线索。该评论探讨了各种ML和DL模型,如支持向量机(SVM)、卷积神经网络(CNN)、长短期记忆网络(LSTM)和神经网络,并评估它们在解释复杂数据模式和文本情感细微差别方面的有效性。 评论讨论了这些技术作为挽救生命工具的潜力,通过识别有自杀风险个体在其数字足迹中留下的线索。此外,它还评估了将这些技术用于预防自杀的实际效果、局限性和伦理考虑,强调负责任地开发和使用这些技术的重要性。研究旨在填补知识空白,分析近期该领域的研究、方法论、工具和技术。 评论指出综合现有文献对于指导实践工具和自杀预防工作至关重要,并推动可靠且符合伦理规范的早期干预系统创新。这项研究综述评估了科技与心理健康之间的交汇点,并倡导负责任地应用ML、DL以及自然语言处理(NLP)技术,以在全球范围内提供挽救生命的潜力。同时,评论也指出了诸如泛化性、偏见、隐私问题等挑战,强调需要进一步的研究以确保这些技术不会加剧现有的不平等和伤害。
https://arxiv.org/abs/2501.09309
Retrieval-Augmented Generation equips large language models with the capability to retrieve external knowledge, thereby mitigating hallucinations by incorporating information beyond the model's intrinsic abilities. However, most prior works have focused on invoking retrieval deterministically, which makes it unsuitable for tasks such as long-form question answering. Instead, dynamically performing retrieval by invoking it only when the underlying LLM lacks the required knowledge can be more efficient. In this context, we delve deeper into the question, "To Retrieve or Not to Retrieve?" by exploring multiple uncertainty detection methods. We evaluate these methods for the task of long-form question answering, employing dynamic retrieval, and present our comparisons. Our findings suggest that uncertainty detection metrics, such as Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval calls by almost half, with only a slight reduction in question-answering accuracy.
检索增强生成使大型语言模型具备了获取外部知识的能力,从而通过引入超出模型固有能力的信息来减轻幻觉问题。然而,大多数先前的工作主要集中在确定性地调用检索上,这使得它们不适合诸如长篇问答任务等场景。相反,在底层LLM(大语言模型)缺乏所需知识时动态执行检索可以更高效。在这一背景下,我们深入探讨了“是否要进行检索?”这个问题,并通过探索多种不确定性检测方法来解答它。我们在长期问题回答的任务中使用动态检索评估这些方法,并展示了我们的比较结果。研究发现表明,诸如度矩阵杰卡德(Degree Matrix Jaccard)和离心率(Eccentricity)之类的不确定性检测指标可以将检索调用次数减少近一半,同时仅略微降低问答准确性。
https://arxiv.org/abs/2501.09292
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290