Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
长期以来,人们一直认为语言是人类推理的必要工具。大型语言模型(LLMs)的重大突破激发了利用这些模型来解决复杂推理任务的研究兴趣。研究人员已经超越了简单的自回归令牌生成,引入了“思维”这一概念——一系列代表推理过程中间步骤的令牌序列。这种创新的方法使LLM能够模仿复杂的类人类推理过程,如树搜索和反思思考。最近,一种新兴的学习推理趋势是应用强化学习(RL)来训练LLM掌握推理过程。这种方法通过试错算法实现了高质量推理轨迹的自动生成,并且通过提供大量训练数据显著扩展了LLM的推理能力。此外,近期研究表明,在测试时鼓励LLMs使用更多令牌进行“思考”可以进一步大幅提高推理准确性。因此,结合训练时间和测试时间上的扩展展示了一个新的研究前沿——通向大型推理模型的道路。OpenAI推出的o1系列标志着这一研究方向的一个重要里程碑。在这份综述中,我们将介绍近期在LLM推理方面的重大进展。我们首先引入LLMs的基础背景知识,然后探讨驱动大规模推理模型发展的关键技术组件,重点在于自动化数据构建、学习推理技术以及测试时间扩展。此外,我们还会分析一些热门的开源项目在构建大型推理模型中的应用,并最终提出开放性挑战和未来研究方向。
https://arxiv.org/abs/2501.09686
While large language models (LLMs) present significant potential for supporting numerous real-world applica- tions and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
尽管大型语言模型(LLMs)在支持众多现实世界应用和带来积极社会影响方面展现出巨大潜力,但它们仍然面临着固有的隐私泄露风险、幻觉输出以及价值不一致等重大挑战,并且在被破解后可能会恶意用于生成有毒内容和不符合伦理的目的。因此,在本次综述中,我们全面回顾了近期为减轻这些问题而取得的进展,这些进展按照LLMs开发和使用过程中的四个阶段进行组织:数据收集与预训练、微调与对齐、提示与推理以及后期处理与审计。我们将详细阐述最近在增强LLMs隐私保护性能、幻觉减少、价值一致性和消除毒性以及防破解措施方面的进展。 与之前专注于负责任的LLMs单一维度的综述不同,本综述提出了一种涵盖这些多样维度的统一框架,为如何通过多种方式提升LLMs性能以更好地服务于现实世界应用提供了全面的看法。
https://arxiv.org/abs/2501.09431
Image segmentation, a key task in computer vision, has traditionally relied on convolutional neural networks (CNNs), yet these models struggle with capturing complex spatial dependencies, objects with varying scales, need for manually crafted architecture components and contextual information. This paper explores the shortcomings of CNN-based models and the shift towards transformer architectures -to overcome those limitations. This work reviews state-of-the-art transformer-based segmentation models, addressing segmentation-specific challenges and their solutions. The paper discusses current challenges in transformer-based segmentation and outlines promising future trends, such as lightweight architectures and enhanced data efficiency. This survey serves as a guide for understanding the impact of transformers in advancing segmentation capabilities and overcoming the limitations of traditional models.
图像分割,作为计算机视觉中的一个关键任务,长期以来一直依赖于卷积神经网络(CNN)。然而,这些模型在捕捉复杂的空间依赖关系、处理不同尺度的对象以及利用手工设计的架构组件和上下文信息方面存在困难。本文探讨了基于 CNN 的模型的不足之处,并转向基于变压器架构的趋势以克服这些限制。这项工作回顾了最新的基于变压器的分割模型,针对特定于分割的挑战及其解决方案进行了讨论。 论文还讨论了当前基于变压器的分割面临的挑战,并概述了一些有前景的发展趋势,例如轻量级架构和增强的数据效率。这篇综述旨在帮助理解变压器在提升分割能力以及克服传统模型局限性方面的影响。
https://arxiv.org/abs/2501.09372
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG
大型语言模型(LLMs)通过实现类似人类的文本生成和自然语言理解彻底革新了人工智能领域。然而,它们依赖于静态训练数据的特性限制了其应对动态、实时查询的能力,导致输出过时或不准确。检索增强生成(RAG)作为一种解决方案应运而生,它通过将实时数据检索集成到LLM中来提供上下文相关和最新的响应。尽管RAG系统展现出巨大潜力,但传统的RAG系统受限于静态工作流程,并且缺乏多步推理和复杂任务管理所需的适应性。 代理增强生成(Agentic RAG)则超越了这些限制,在RAG管道中嵌入自主AI代理。这些代理利用代理人设计模式的反思、规划、工具使用及多代理协作,以动态管理检索策略、迭代细化上下文理解,并适应工作流程以满足复杂任务需求。这种集成使得Agentic RAG系统能够在各种应用中提供无与伦比的灵活性、可扩展性和上下文感知能力。 本文综述全面探索了Agentic RAG,从其基础原理和RAG范式的演进开始。它详细介绍了Agentic RAG架构的分类,并强调了在医疗保健、金融和教育等行业中的关键应用,同时探讨了实用实现策略。此外,该综述还讨论了扩展这些系统所面临的挑战,确保伦理决策并优化实际应用性能的问题,并提供了关于实施Agentic RAG框架和技术的详细见解。 通过这样的综合分析,我们可以更深入地理解Agentic RAG的技术细节及其在现实世界中的广泛应用潜力。
https://arxiv.org/abs/2501.09136
We present Mantis Shrimp, a multi-survey deep learning model for photometric redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and infrared (UnWISE) imagery. Machine learning is now an established approach for photometric redshift estimation, with generally acknowledged higher performance in areas with a high density of spectroscopically identified galaxies over template-based methods. Multiple works have shown that image-based convolutional neural networks can outperform tabular-based color/magnitude models. In comparison to tabular models, image models have additional design complexities: it is largely unknown how to fuse inputs from different instruments which have different resolutions or noise properties. The Mantis Shrimp model estimates the conditional density estimate of redshift using cutout images. The density estimates are well calibrated and the point estimates perform well in the distribution of available spectroscopically confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and catastrophic outlier rate ($\eta$=17.53$\%$). We find that early fusion approaches (e.g., resampling and stacking images from different instruments) match the performance of late fusion approaches (e.g., concatenating latent space representations), so that the design choice ultimately is left to the user. Finally, we study how the models learn to use information across bands, finding evidence that our models successfully incorporates information from all surveys. The applicability of our model to the analysis of large populations of galaxies is limited by the speed of downloading cutouts from external servers; however, our model could be useful in smaller studies such as generating priors over redshift for stellar population synthesis.
我们介绍了"Mantis Shrimp",这是一种用于光谱红移估计的多调查深度学习模型,它融合了紫外线(GALEX)、光学(PanSTARRS)和红外线(UnWISE)图像。机器学习现在已成为一种公认的光谱红移估计方法,在高密度光谱识别星系区域中通常比基于模板的方法性能更高。多个研究工作表明,基于图像的卷积神经网络可以优于基于表格的颜色/亮度模型。与表格模型相比,图像模型具有额外的设计复杂性:如何融合来自不同仪器(它们有不同的分辨率或噪声特性)的输入在很大程度上仍不清楚。 Mantis Shrimp 模型利用剪切图来估计红移的概率密度函数。这些概率密度估计被很好地校准了,并且点估计值在可用的光谱确认星系分布中表现良好,偏差为1e-2,散度(NMAD)为2.44e-2,灾难性异常率为17.53%。 我们发现早期融合方法(例如,重采样和堆叠来自不同仪器的图像)可以与晚期融合方法(例如,连接潜在空间表示)匹配性能,因此最终的设计选择留给用户决定。最后,我们研究了模型如何学习跨波段使用信息,并找到了证据证明我们的模型成功地整合了所有调查中的信息。 我们模型在大规模星系群体分析中的适用性受到从外部服务器下载剪切图速度的限制;然而,在诸如生成红移先验用于恒星光谱合成等小型研究中,该模型可能非常有用。
https://arxiv.org/abs/2501.09112
Facial recognition models are increasingly employed by commercial enterprises, government agencies, and cloud service providers for identity verification, consumer services, and surveillance. These models are often trained using vast amounts of facial data processed and stored in cloud-based platforms, raising significant privacy concerns. Users' facial images may be exploited without their consent, leading to potential data breaches and misuse. This survey presents a comprehensive review of current methods aimed at preserving facial image privacy in cloud-based services. We categorize these methods into two primary approaches: image obfuscation-based protection and adversarial perturbation-based protection. We provide an in-depth analysis of both categories, offering qualitative and quantitative comparisons of their effectiveness. Additionally, we highlight unresolved challenges and propose future research directions to improve privacy preservation in cloud computing environments.
面部识别模型被商业企业、政府机构和云服务提供商广泛用于身份验证、消费者服务和监控。这些模型通常使用大量存储在云端平台中的面部数据进行训练,这引发了重大的隐私问题。未经用户同意的情况下,用户的面部图像可能会被滥用,从而导致潜在的数据泄露和误用。本综述全面回顾了当前旨在保护云服务中面部图像隐私的方法,并将这些方法分为两大类:基于图像模糊化的方法和基于对抗扰动的方法。我们对这两类方法进行了深入分析,并提供了定性和定量的有效性比较。此外,我们还指出了未解决的挑战并提出了未来的研究方向,以改善云计算环境中的隐私保护措施。
https://arxiv.org/abs/2501.08665
Reliance on anonymity in social media has increased its popularity on these platforms among all ages. The availability of public Wi-Fi networks has facilitated a vast variety of online content, including social media applications. Although anonymity and ease of access can be a convenient means of communication for their users, it is difficult to manage and protect its vulnerable users against sexual predators. Using an automated identification system that can attribute predators to their text would make the solution more attainable. In this survey, we provide a review of the methods of pedophile attribution used in social media platforms. We examine the effect of the size of the suspect set and the length of the text on the task of attribution. Moreover, we review the most-used datasets, features, classification techniques and performance measures for attributing sexual predators. We found that few studies have proposed tools to mitigate the risk of online sexual predators, but none of them can provide suspect attribution. Finally, we list several open research problems.
社交媒体中依赖匿名性的现象增加了各年龄段用户在这些平台上的流行度。公共Wi-Fi网络的可用性促进了大量在线内容的发展,包括社交应用程序。虽然匿名性和易访问性可以为用户提供方便的交流手段,但很难管理和保护其脆弱用户免受性侵者侵害。使用自动化识别系统来将性侵者与其文本关联起来会使解决方案更加可行。在这项调查中,我们回顾了在社交媒体平台上用于确定恋童癖者的各种方法。我们分析了嫌疑对象集合大小和文本长度对归因任务的影响,并且审查了最常用的数据库、特征分类技术以及评估属性性侵者表现的测量标准。我们发现虽然有一些研究提出了减少网上性侵犯风险的工具,但没有一个可以提供嫌疑人归属信息。最后,我们列出了一些开放的研究问题。
https://arxiv.org/abs/2501.08296
Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.
大型语言模型的迅速进步在处理和总结非结构化文本数据方面展现出了显著的能力。这为分析诸如调查回复等丰富、开放式的数据集提供了重要的意义,其中LLM(大型语言模型)有望高效地提炼关键主题和情感。然而,随着组织越来越多地依赖这些强大的AI系统来理解文字反馈,一个至关重要的问题出现了:我们能否信任LLM准确地代表文本数据集中包含的观点?虽然LLM在生成类人类摘要方面表现出色,但它们的输出可能会无意中偏离原始回复的真实内容。LLM生成的内容与实际数据中存在的主题之间的差异可能导致决策失误,并对组织产生深远的影响。 这项研究探讨了将LLM作为评判模型来评估其他LLM生成总结的主题一致性有效性的问题。我们使用Anthropic Claude模型从开放式调查回应中生成主题摘要,而亚马逊的Titan Express、Nova Pro和Meta的Llama则用作评判LLM。通过Cohen's kappa、Spearman's rho以及Krippendorff's alpha与人类评估方法进行比较,验证了这一基于LLM的评判方式作为一种可扩展的人类中心评估方法替代方案的有效性。 我们的研究结果表明,虽然作为评判模型的LLM提供了一种与人类评价员相当的可扩展解决方案,但人类在检测细微且特定于上下文的差异方面仍然可能更胜一筹。这项研究为AI辅助文本分析的现有知识体系做出了贡献。我们讨论了该领域的局限性,并提供了对未来研究的建议,强调了在不同场景和应用中泛化LLM评判模型时需要谨慎考虑的重要性。 简而言之,尽管LLM作为评估工具具有巨大的潜力,但它们尚不能完全替代人类判断的细致入微与灵活性。未来的研究应当继续探索如何结合这两种方法的优势,以实现更准确、高效的文本分析。
https://arxiv.org/abs/2501.08167
Large Language Models (LLMs) have attracted a lot of attention in various fields due to their superior performance, aiming to train hundreds of millions or more parameters on large amounts of text data to understand and generate natural language. As the superior performance of LLMs becomes apparent, they are increasingly being applied to knowledge graph embedding (KGE) related tasks to improve the processing results. As a deep learning model in the field of Natural Language Processing (NLP), it learns a large amount of textual data to predict the next word or generate content related to a given text. However, LLMs have recently been invoked to varying degrees in different types of KGE related scenarios such as multi-modal KGE and open KGE according to their task characteristics. In this paper, we investigate a wide range of approaches for performing LLMs-related tasks in different types of KGE scenarios. To better compare the various approaches, we summarize each KGE scenario in a classification. In addition to the categorization methods, we provide a tabular overview of the methods and their source code links for a more direct comparison. In the article we also discuss the applications in which the methods are mainly used and suggest several forward-looking directions for the development of this new research area.
大型语言模型(LLMs)由于其卓越的性能,在多个领域吸引了广泛关注。这些模型旨在通过在大量文本数据上训练数百亿甚至更多的参数,来理解和生成自然语言。随着LLMs优越性能的显现,它们越来越多地被应用于知识图谱嵌入(KGE)相关任务中,以提高处理结果的质量。作为自然语言处理(NLP)领域的深度学习模型,LLMs通过学习大量文本数据来预测下一个单词或生成与给定文本相关的其他内容。然而,最近根据不同的任务特性,LLMs不同程度地被应用于多种类型的KGE场景,例如多模态KGE和开放型KGE等。 在本文中,我们研究了在不同类型KGE场景下执行LLM相关任务的各种方法。为了更好地对比这些不同方法,我们将每种KGE场景进行分类总结,并提供了一览表来概述各种方法及其源代码链接,以便直接比较它们的特性。此外,在文章中,我们也讨论了这些方法的主要应用场景,并提出了一些对这一新兴研究领域未来发展方向的看法和建议。
https://arxiv.org/abs/2501.07766
Deep Neural Networks (DNNs) have grown increasingly large in size to achieve state of the art performance across a wide range of tasks. However, their high computational requirements make them less suitable for resource-constrained applications. Also, real-world datasets often consist of a mixture of easy and complex samples, necessitating adaptive inference mechanisms that account for sample difficulty. Early exit strategies offer a promising solution by enabling adaptive inference, where simpler samples are classified using the initial layers of the DNN, thereby accelerating the overall inference process. By attaching classifiers at different layers, early exit methods not only reduce inference latency but also improve the model robustness against adversarial attacks. This paper presents a comprehensive survey of early exit methods and their applications in NLP.
深度神经网络(DNN)为了在广泛的任务中实现最先进的性能,其规模已经变得越来越大。然而,它们的高计算需求使其不太适合资源受限的应用场景。此外,在现实世界的数据集中往往包含简单和复杂样本的混合,这就需要能够根据样本难度进行自适应推断的机制。早期退出策略提供了一种有前景的解决方案:通过利用DNN的初始层对简单的样本进行分类,从而加速整个推断过程。通过在不同层级附加分类器,早期退出方法不仅可以减少推理延迟,还能提高模型对抗攻击的鲁棒性。本文全面回顾了早期退出方法及其在自然语言处理(NLP)中的应用。
https://arxiv.org/abs/2501.07670
Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and personalization. Powered by modern AI technologies such as multimodal large language models and world models, Embodied AI (EmAI) represents a transformative frontier, offering enhanced autonomy and the ability to interact with the physical world to address these challenges. As an interdisciplinary and rapidly evolving research domain, "EmAI in healthcare" spans diverse fields such as algorithms, robotics, and biomedicine. This complexity underscores the importance of timely reviews and analyses to track advancements, address challenges, and foster cross-disciplinary collaboration. In this paper, we provide a comprehensive overview of the "brain" of EmAI for healthcare, wherein we introduce foundational AI algorithms for perception, actuation, planning, and memory, and focus on presenting the healthcare applications spanning clinical interventions, daily care & companionship, infrastructure support, and biomedical research. Despite its promise, the development of EmAI for healthcare is hindered by critical challenges such as safety concerns, gaps between simulation platforms and real-world applications, the absence of standardized benchmarks, and uneven progress across interdisciplinary domains. We discuss the technical barriers and explore ethical considerations, offering a forward-looking perspective on the future of EmAI in healthcare. A hierarchical framework of intelligent levels for EmAI systems is also introduced to guide further development. By providing systematic insights, this work aims to inspire innovation and practical applications, paving the way for a new era of intelligent, patient-centered healthcare.
全球范围内的医疗保健系统面临着效率、可及性和个性化等方面的持续挑战。借助现代人工智能技术,如多模态大型语言模型和世界模型,具身AI(EmAI)代表了变革性的前沿领域,能够通过增强自主性并交互物理世界来应对这些挑战。作为一门跨学科且快速发展的研究领域,“医疗保健中的EmAI”涵盖了算法、机器人技术和生物医学等多样化的领域。这种复杂性强调了及时进行综述和分析以追踪进展、解决难题以及促进跨学科合作的重要性。 本文提供了一种关于“医疗保健中EmAI的‘大脑’”的全面概述,其中包括介绍感知、行动、规划和记忆等方面的基础人工智能算法,并重点介绍了跨越临床干预、日常护理与陪伴、基础设施支持及生物医学研究等领域的健康应用。尽管前景广阔,但由于安全问题、仿真平台与实际应用场景之间的差距、缺乏标准化基准以及跨学科领域进展不均等问题,EmAI在医疗保健中的发展受到了阻碍。 本文还讨论了技术障碍,并探讨了伦理考量,为EmAI在医疗保健的未来提供了前瞻性视角。此外,我们引入了一种智能层级框架来指导EmAI系统进一步的发展。通过提供系统的见解,这项工作旨在激发创新与实际应用,开启一个以患者为中心、智能化的新时代医疗服务。
https://arxiv.org/abs/2501.07468
Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction.
模型压缩对于在嵌入式设备上部署大型计算机视觉模型至关重要。然而,静态优化技术(例如剪枝、量化等)忽略了这样一个事实:不同的输入具有不同的复杂性,因此需要不同量的计算。动态神经网络允许根据特定输入调整计算量。目前关于该主题的研究非常广泛且碎片化。我们提供了一个全面的综述,将现有的动态神经网络研究在计算机视觉领域的发现进行综合和统一。此外,我们基于网络中哪个部分是自适应的(输出、计算图或输入)提供了逻辑分类法。并且我们认为,在传感器融合的背景下,动态神经网络特别有益于更好的适应性、噪声减少和信息优先级处理。我们在这一方向上提出了初步工作。
https://arxiv.org/abs/2501.07451
We describe the development of a one-credit course to promote AI literacy at The University of Texas at Austin. In response to a call for the rapid deployment of class to serve a broad audience in Fall of 2023, we designed a 14-week seminar-style course that incorporated an interdisciplinary group of speakers who lectured on topics ranging from the fundamentals of AI to societal concerns including disinformation and employment. University students, faculty, and staff, and even community members outside of the University, were invited to enroll in this online offering: The Essentials of AI for Life and Society. We collected feedback from course participants through weekly reflections and a final survey. Satisfyingly, we found that attendees reported gains in their AI literacy. We sought critical feedback through quantitative and qualitative analysis, which uncovered challenges in designing a course for this general audience. We utilized the course feedback to design a three-credit version of the course that is being offered in Fall of 2024. The lessons we learned and our plans for this new iteration may serve as a guide to instructors designing AI courses for a broad audience.
我们描述了在德克萨斯大学奥斯汀分校开发一门一学分课程以促进人工智能素养的过程。为了响应于2023年秋季迅速部署此类课程的需求,为广泛受众服务,我们设计了一门为期14周的研讨式课程,并邀请了一个跨学科讲者团队来讲解从人工智能基础知识到包括虚假信息和就业在内的社会关注问题等主题。该课程向大学学生、教职工以及校外社区成员开放:《面向生活与社会的人工智能基础》。我们通过每周反思和最终调查收集了学员们的反馈。令人欣慰的是,参与者报告说他们的AI素养有所提高。我们还通过定量和定性分析寻求关键的批评反馈,这些反馈揭示了为这样广泛的受众设计课程所面临的挑战。我们利用课程反馈来设计一门将于2024年秋季提供的三学分版本课程。我们从这一过程中汲取的经验教训以及对于新版本课程的设计计划可能成为面向广泛受众开设AI课程的教师们的指南。
https://arxiv.org/abs/2501.07392
Lifelong learning, also known as continual or incremental learning, is a crucial component for advancing Artificial General Intelligence (AGI) by enabling systems to continuously adapt in dynamic environments. While large language models (LLMs) have demonstrated impressive capabilities in natural language processing, existing LLM agents are typically designed for static systems and lack the ability to adapt over time in response to new challenges. This survey is the first to systematically summarize the potential techniques for incorporating lifelong learning into LLM-based agents. We categorize the core components of these agents into three modules: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for grounded interactions with the dynamic environment. We highlight how these pillars collectively enable continuous adaptation, mitigate catastrophic forgetting, and improve long-term performance. This survey provides a roadmap for researchers and practitioners working to develop lifelong learning capabilities in LLM agents, offering insights into emerging trends, evaluation metrics, and application scenarios. Relevant literature and resources are available at \href{this url}{this https URL}.
终身学习,也被称为连续或增量学习,是推动通用人工智能(AGI)发展的关键组成部分,它使系统能够持续适应动态环境。虽然大型语言模型(LLM)在自然语言处理方面表现出色,但现有的LLM代理通常为静态系统设计,并且缺乏随着时间应对新挑战而进行自我调整的能力。这项调查首次系统地总结了将终身学习纳入基于LLM的代理中的潜在技术。我们将这些代理的核心组件分为三个模块:感知模块用于多模态输入整合、记忆模块用于存储和检索不断演变的知识,以及行动模块用于与动态环境进行具体交互。 我们强调这三个支柱如何共同实现持续适应性,缓解灾难性遗忘(即在学习新任务时忘记旧任务的能力),并提高长期性能。这项调查为致力于开发基于LLM代理终身学习能力的研究人员和实践者提供了一份路线图,并提供了关于新兴趋势、评估指标和应用场景的见解。 相关文献和资源可以在[\href{this url}{此链接}]获取。
https://arxiv.org/abs/2501.07278
Considering the lack of a unified framework for image description and deep cultural analysis at the subject level in the field of Ancient Chinese Paintings (ACP), this study utilized the Beijing Palace Museum's ACP collections to develop a semantic model integrating the iconological theory with a new workflow for term extraction and mapping. Our findings underscore the model's effectiveness. SDM can be used to support further art-related knowledge organization and cultural exploration of ACPs.
鉴于在古代中国绘画(ACP)领域缺乏统一的图像描述和深度文化分析框架,本研究利用北京故宫博物院的ACP收藏开发了一种语义模型,该模型结合了图像学理论,并设计了一个新的术语提取和映射工作流程。我们的研究成果强调了该模型的有效性。SDM可以用于支持进一步的艺术相关知识组织以及对ACP的文化探索。
https://arxiv.org/abs/2501.08352
Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such as dataset bias, model interpretability, and the need for common-sense reasoning. Lastly, we discuss the emerging trends in large multimodal language models and the integration of external knowledge, offering insights into the future directions of VQA. This paper aims to provide a comprehensive overview of the evolution of VQA, highlighting both its current state and potential advancements.
视觉问答(Visual Question Answering,VQA)是一个跨学科领域,它将计算机视觉(CV)和自然语言处理(NLP)相结合,使人工智能系统能够回答关于图像的问题。自2015年首次提出以来,VQA迅速发展,受到深度学习、注意力机制以及基于变压器模型的推动。本文回顾了VQA从早期阶段到重大突破的发展历程,包括注意力机制的应用、组合推理能力的提升及视觉-语言预训练方法的兴起。文中强调了一些关键模型、数据集和技巧在VQA系统发展中所起的作用,并特别指出了Transformer架构以及多模态预训练对近期进展的重要性。此外,本文还探讨了VQA技术在医疗保健等领域的专业应用及其面临的挑战,如数据集偏差问题、模型可解释性和常识推理需求。最后,文章讨论了大型多模态语言模型及外部知识集成的新兴趋势,并为未来的发展方向提供了见解。本论文旨在全面概述VQA领域的发展历程,不仅总结当前现状还展望未来发展潜力。
https://arxiv.org/abs/2501.07109
The global AI surge demands crowdworkers from diverse languages and cultures. They are pivotal in labeling data for enabling global AI systems. Despite global significance, research has primarily focused on understanding the perspectives and experiences of US and India crowdworkers, leaving a notable gap. To bridge this, we conducted a survey with 100 crowdworkers across 16 Latin American and Caribbean countries. We discovered that these workers exhibited pride and respect for their digital labor, with strong support and admiration from their families. Notably, crowd work was also seen as a stepping stone to financial and professional independence. Surprisingly, despite wanting more connection, these workers also felt isolated from peers and doubtful of others' labor quality. They resisted collaboration and gender-based tools, valuing gender-neutrality. Our work advances HCI understanding of Latin American and Caribbean crowdwork, offering insights for digital resistance tools for the region.
全球人工智能热潮需要来自各种语言和文化背景的众包工作者。他们对于标注数据、推动全球化的人工智能系统至关重要。尽管在全球范围内具有重要性,但相关研究主要集中在了解美国和印度众包工作者的观点和经历上,这留下了一个明显的空白。为了填补这一空白,我们对来自拉丁美洲及加勒比海地区16个国家的100名众包工作者进行了调查。 我们发现这些工人对自己的数字劳动感到自豪,并且得到了家庭成员的强大支持与敬仰。值得注意的是,众包工作也被视为迈向经济和职业独立的重要一步。令人惊讶的是,尽管渴望更多的交流联系,这些工人却感觉与其他同事隔离并且怀疑他人的工作质量。他们抵制合作以及基于性别的工具,重视性别中立性。 我们的研究推进了人机交互领域对拉丁美洲及加勒比海地区众包工作的理解,并为该地区的数字抵抗工具有所贡献,提供了有价值的见解。
https://arxiv.org/abs/2501.06981
Large language models (LLMs) have revolutionized scientific research with their exceptional capabilities and transformed various fields. Among their practical applications, LLMs have been playing a crucial role in mitigating threats to human life, infrastructure, and the environment. Despite growing research in disaster LLMs, there remains a lack of systematic review and in-depth analysis of LLMs for natural disaster management. To address the gap, this paper presents a comprehensive survey of existing LLMs in natural disaster management, along with a taxonomy that categorizes existing works based on disaster phases and application scenarios. By collecting public datasets and identifying key challenges and opportunities, this study aims to guide the professional community in developing advanced LLMs for disaster management to enhance the resilience against natural disasters.
大型语言模型(LLMs)凭借其卓越的能力革新了科学研究,并在各个领域产生了深远影响。在实际应用中,LLMs 在缓解对人类生命、基础设施和环境的威胁方面发挥了关键作用。尽管灾难领域的 LLM 研究正在不断增长,但对于自然灾害管理中的 LLM 的系统性回顾和深入分析仍存在不足。为了解决这一缺口,本文提供了一份关于现有 LLMS 在自然灾害管理中应用情况的全面综述,并根据灾害阶段和应用场景对现有研究进行了分类。通过收集公共数据集并识别关键挑战与机遇,本研究旨在指导专业社区开发先进的 LLM 以增强对自然灾害的抵御能力。
https://arxiv.org/abs/2501.06932
Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.
在基于众包标签训练的模型可能无法反映更广泛人群的观点,尤其是在标注者群体不具备代表性的情况下。由于收集具代表性的标签具有挑战性,我们提出了一种名为Population-Aligned Instance Replication (PAIR)的方法来通过统计调整解决这种偏差问题。通过模拟仇恨言论和冒犯性语言检测的研究,我们创建了两种不同倾向的标注员类型,并生成了不同比例类型的语料库。基于不平衡标注者群体训练的模型与基于代表性数据训练的模型相比,校准效果较差。然而,PAIR 通过复制少数派标注组的标签来匹配总体比例,从而显著减少了偏差,而无需收集新的数据。 这些结果表明,即使在无法获得具代表性的标注者群体的情况下,调查研究中的统计技术也可以帮助使模型训练与目标人群对齐。我们最后提出了三项实用建议以提高训练数据的质量。
https://arxiv.org/abs/2501.06826
This study introduces a novel method that employs tag annotation coupled with the ChatGPT language model to analyze student learning behaviors and generate personalized feedback. Central to this approach is the conversion of complex student data into an extensive set of tags, which are then decoded through tailored prompts to deliver constructive feedback that encourages rather than discourages students. This methodology focuses on accurately feeding student data into large language models and crafting prompts that enhance the constructive nature of feedback. The effectiveness of this approach was validated through surveys conducted with over 20 mathematics teachers, who confirmed the reliability of the generated reports. This method can be seamlessly integrated into intelligent adaptive learning systems or provided as a tool to significantly reduce the workload of teachers, providing accurate and timely feedback to students. By transforming raw educational data into interpretable tags, this method supports the provision of efficient and timely personalized learning feedback that offers constructive suggestions tailored to individual learner needs.
这项研究介绍了一种新颖的方法,该方法结合了标签标注和ChatGPT语言模型来分析学生的学习行为,并生成个性化反馈。此方法的核心在于将复杂的学生成绩数据转化为一系列广泛的标签,然后通过定制的提示语解码这些标签以提供鼓励性的而非否定性的反馈。这一方法的重点是准确地将学生数据输入大型语言模型,并设计能够增强反馈建设性的提示。 这种方法的有效性已通过一项针对20多名数学教师的调查得到了验证,调查显示生成的报告具有可靠性。此方法可以无缝集成到智能自适应学习系统中或作为工具提供给老师使用,以显著减轻他们的工作负担,为学生提供准确及时的反馈。 通过将原始教育数据转化为可解读的标签,这种方法支持高效且适时地提供个性化的学习反馈,这些建议能够根据每个学习者的具体需求量身定制。
https://arxiv.org/abs/2501.06819