This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.
本研究评估了通用人工智能(如ChatGPT)在法律问题解答任务中的表现,强调了法律专业人员和客户面临的重要风险。它建议利用在特定领域知识基础上发现的基础模型来克服这些问题。论文主张创建开源法律人工智能系统以提高准确性、透明度和叙事多样性,解决通用人工智能在法律背景中的不足。
https://arxiv.org/abs/2404.12349
Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.
Clipart是一种预先制作好的图形艺术形式,为描绘视觉内容提供了方便和高效的途径。将静态 clipart 图像转换为动图序列的传统工作流程费力且耗时,需要进行许多复杂的步骤,如绑定、关键帧动画和中间帧处理。近年来在将文本到视频生成模型的研究中取得了很大的进展,有望解决这个问题。然而,直接应用文本到视频生成模型通常很难保留 clipart 图像的视觉身份或生成卡通风格的运动,导致不满意的动画效果。在本文中,我们介绍了 AniClipart 系统,该系统将静态 clipart 图像转换为高质量的动图序列,通过文本到视频先验指导。为了生成卡通风格和流畅的运动,我们首先将 clipart 图像的关键点定义为运动正则化形式。然后通过优化 Video Score Distillation Sampling(VSDS)损失,使关键点的运动轨迹与提供的文本提示对齐,该损失可以表示预训练文本到视频扩散模型中自然运动足够的知识。通过使用可导的 As-Rigid-As-Possible 形状变形算法,我们的方法可以在保持变形刚度的同时进行端到端的优化。实验结果表明,与现有的图像到视频生成模型相比,AniClipart 在文本到视频对齐、视觉身份保留和运动一致性方面 consistently 表现出色。此外,我们还展示了 AniClipart 的多样性,通过将其应用于生成更广泛的动画格式,如分层动画,实现了拓扑变化。
https://arxiv.org/abs/2404.12347
Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.
在往返旅行中,从对方面临识别地点是一个常见的人类驾驶者的经历。然而,具有有限视野相机的视场机器人学能力(VPR)在实现方面被证明具有挑战性。为解决这个问题,本文提出了 Same Place Opposing Trajectory(SPOT),一种基于立体视觉惯性测量(VO)的反对观点VPR技术。该方法扩展了最近在激光描述符和双距离矩阵序列匹配方面的最新进展,并采用了一种新颖的double(相似和反对)距离矩阵序列匹配方法。我们在各种光照条件下,使用公开可用的数据集对SPOT进行了评估。与最先进的实现相比,所提出的算法在反对观点情况下实现了显著的提高,达到91.7%的召回率,而在100%精确度时,所需存储比所有测试基线都要少,并且比所有基线都要快。此外,所提出的假设没有预先知识来确定视点的相似性或反对性,并且在相似观点情况下也具有竞争力的性能。
https://arxiv.org/abs/2404.12339
Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.
检索增强生成(RAG)系统将自然语言生成和信息检索的优势相结合,为许多现实应用提供了动力,如聊天机器人。使用RAG对多模态数据(如文本、图像和视频)的联合理解具有吸引力,但有两个关键限制:一次性的、一次性的捕捉大型多模态数据中的所有内容意味着处理时间很高,而且通常 rich 多模态数据中的信息并不都在文本描述中。由于用户查询不知道,因此为多模态到文本转换和交互式查询大型多模态数据开发系统具有挑战性。为了克服这些限制,我们提出了 iRAG,它通过新的增量工作流程增强了 RAG,以实现对大型多模态数据集的交互式查询。与传统 RAG 不同,iRAG 快速索引大型多模态数据库,并且在增量工作流程中,它使用索引从多模态数据的部分部分非文本描述中主动提取更多详细信息以检索与交互式用户查询相关的上下文。这种增量工作流程避免了长多模态到文本转换时间,克服了信息损失问题,确保了交互式用户查询的响应具有高质量,而这些查询往往不知道。据我们所知,iRAG 是第一个用增量工作流程增强 RAG 的系统,以支持对大型、现实世界多模态数据的 efficient 交互式查询。在现实世界的长视频中进行实验结果表明,与传统 RAG 相比,视频到文本的 ingestion 速度提高了 23 到 25 倍,同时保证交互式用户查询的响应质量与传统 RAG 中的所有视频数据在查询之前转换为文本的响应质量相当。
https://arxiv.org/abs/2404.12309
This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.
联邦学习(FL)作为一种在大型语言模型(LLMs)的协同训练中取得突破的解决方案,已经得到了广泛的应用。然而,将LLMs集成到FL中带来了新的挑战,尤其是在LLMs的评估方面。传统的评估方法仅依赖于标记测试集和基于相似度的指标,从而无法准确地反映LLMs在生成任务上的表现。同时,虽然自动评估方法依赖于先进的LLM,但由于需要将数据传输到外部服务器,以及由于缺乏领域知识,导致在下游任务上表现不佳。为了应对这些挑战,我们提出了一个名为FedEval-LLM的大语言模型评估框架,该框架在不依赖标记测试集和外部工具的情况下提供LLM在下游任务上的可靠性能测量,从而确保了强大的隐私保护能力。FedEval-LLM利用参与者的个人LLM作为参考,提供领域知识和集体评估能力,从而与各自的下游任务保持一致,并减轻了单个参考者带来的不确定性和偏见。实验结果表明,在FL中,个性评估模型的评估能力得到了显著的提高。当应用于FL时,这些评估模型与人类偏好和RougeL得分高度一致。FedEval-LLM有效地克服了传统指标和依赖外部服务的局限性,为在协作训练场景下评估LLM提供了有前景的框架。
https://arxiv.org/abs/2404.12273
Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.
物理集成生成建模是一种混合或灰色盒模型,其中我们通过添加指导数据分布的物理知识来增强数据驱动模型。利用物理知识可以使生成模型以可控的方式产生输出,从而使输出本质上符合物理定律。它赋予了扩展训练分布以外的高置信度能力,并提高了可解释性,因为模型的部分基础是固有领域知识。在这项工作中,我们旨在提高物理集成生成模型的重建精度和对噪声的鲁棒性。为此,我们使用变分自编码器作为生成模型。为了提高解码器的重建结果,我们提出了一种使用平滑正态分布来学习物理和可训练数据驱动组件的后验分布的计划。平滑正态分布的后验分布利用了数据分布的固有动态结构,因此所学习到的模型更接近于真实的数据分布。为了提高生成模型对模型内噪声的鲁棒性,我们在平滑正态分布的编码器部分进行了修改,基于上下文信息进行缩放点积注意。这将减轻噪声在 latent 向量上的不利影响,使模型更加鲁棒。我们对人类运动数据集 [33] 进行了实证评估,结果证实了我们在建模方面的提议,即提高重建质量和模型对噪声的鲁棒性。
https://arxiv.org/abs/2404.12267
Facial expression recognition is a pivotal component in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.
面部表情识别是机器学习的一个重要组成部分,促进了各种应用的发展。然而,卷积神经网络(CNNs)经常受到灾难性遗忘的困扰,这会阻碍其适应性。所提出的方法,情感为中心的生成性重放(ECgr),通过将生成对抗网络(GAN)生成的合成图像相结合来解决这一挑战。此外,ECgr 还包含一个质量保证算法,以确保生成图像的准确性。这种双方法使 CNN 能够保留过去的知识,同时学习新的任务,从而提高其在情感识别方面的性能。在四个多样的人脸表情数据集的实验结果中,采用我们伪重放方法生成的图像增强了目标数据集和源数据集的训练,同时使 CNN 保留之前学习的知识。
https://arxiv.org/abs/2404.12260
Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from this https URL.
提取军事文本中的结构化事件知识,包括事件触发器和相应论据,对于许多应用来说至关重要,如情报分析和决策支持。然而,军事领域的事件提取面临着数据稀缺的问题,这阻碍了该领域事件提取模型的研究。为了解决这个问题,我们提出了CMNEE,一个大规模、文档级别的开源中国军事新闻事件提取数据集。它包含17,000个文档和29,223个事件,所有这些都根据预定义的军事领域数据模型进行手动注释,包括8种事件类型和11种论据角色类型。我们设计了一个两级、多轮注释策略,以确保CMNEE的质量和系统地评估了多个最先进的event extraction模型。CMNEE在实验结果方面显然短于其他领域数据集,这表明军事领域的事件提取提出了独特的挑战,需要进一步的研究努力。我们的代码和数据可以从该https URL获得。
https://arxiv.org/abs/2404.12242
Lexicon-based retrieval has gained siginificant popularity in text retrieval due to its efficient and robust performance. To further enhance performance of lexicon-based retrieval, researchers have been diligently incorporating state-of-the-art methodologies like Neural retrieval and text-level contrastive learning approaches. Nonetheless, despite the promising outcomes, current lexicon-based retrieval methods have received limited attention in exploring the potential benefits of feature context representations and term-level knowledge guidance. In this paper, we introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules(FecTek). To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced, which leverages the power of BERT's representation to determine dynamic weights for each element in the embedding. Additionally, we develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight. Evaluation of the proposed method on MS Marco benchmark demonstrates its superiority over the previous state-of-the-art approaches.
基于词汇的检索在文本检索中取得了显著的流行, due其高效且鲁棒的性能。为了进一步提高基于词汇的检索的性能,研究人员一直在努力将最先进的方法如神经检索和文本级对比学习方法融入其中。然而,尽管取得了 promising 的结果,现有的基于词汇的检索方法在探索特征上下文表示和词级知识指导的潜在好处方面也受到了限制。在本文中,我们提出了一种创新的方法,即引入了FEature Context和TErm-level Knowledge模块(FecTek)。为了有效地丰富词级权重特征上下文的表示,引入了Feature Context模块(FCM),它利用了BERT表示的力量来确定每个元素嵌入的动态权重。此外,我们还开发了一个词级知识指导模块(TKGM),用于有效地利用词级知识指导模型的训练过程。在MS Marco基准上对所提出的方法进行评估,证明了其优越性超过以前的最先进方法。
https://arxiv.org/abs/2404.12152
Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available.
大语言模型是否可以在做出重要决策时替代人类?近期的研究揭示了大型语言模型在角色扮演分配人形角色、模仿其知识和语言习惯方面具有潜在功能。然而,模仿决策需要对人格有更细微的理解。在本文中,我们研究了大型语言模型在人物驱动决策中的能力。具体来说,我们研究了 LLMs 是否可以预测高质小说中给出的角色决策。利用文学专家编写的角色分析,我们构建了一个名为 LIFECHOICE 的数据集,包含来自 395 本书的 1,401 个角色决策点。然后,我们對 LIFECHOICE 進行了全面實驗,使用各種 LLM 角色扮演方法和技術。結果表明,最先进的 LLM 在此任務上展現出有前景的能力,但仍有很大的改進空間。因此,我們進一步提出了 CHARMAP 方法,通過基於人格的記憶检索實現了 6.01% 的準確度增加。我們將向公眾提供我們的數據和代碼。
https://arxiv.org/abs/2404.12138
Knowledge Tracing (KT) aims to trace changes in students' knowledge states throughout their entire learning process by analyzing their historical learning data and predicting their future learning performance. Existing forgetting curve theory based knowledge tracing models only consider the general forgetting caused by time intervals, ignoring the individualization of students and the causal relationship of the forgetting process. To address these problems, we propose a Concept-driven Personalized Forgetting knowledge tracing model (CPF) which integrates hierarchical relationships between knowledge concepts and incorporates students' personalized cognitive abilities. First, we integrate the students' personalized capabilities into both the learning and forgetting processes to explicitly distinguish students' individual learning gains and forgetting rates according to their cognitive abilities. Second, we take into account the hierarchical relationships between knowledge points and design a precursor-successor knowledge concept matrix to simulate the causal relationship in the forgetting process, while also integrating the potential impact of forgetting prior knowledge points on subsequent ones. The proposed personalized forgetting mechanism can not only be applied to the learning of specifc knowledge concepts but also the life-long learning process. Extensive experimental results on three public datasets show that our CPF outperforms current forgetting curve theory based methods in predicting student performance, demonstrating CPF can better simulate changes in students' knowledge status through the personalized forgetting mechanism.
知识追踪(KT)旨在通过分析学生整个学习过程中的历史学习数据,预测他们的未来学习表现,来追溯学生在学习过程中的知识状态变化。现有基于知识追踪的遗忘曲线理论模型仅考虑时间间隔造成的普遍遗忘,而忽略了学生个性的差异和遗忘过程的因果关系。为了解决这些问题,我们提出了一个以概念驱动的学生个性化遗忘知识追踪模型(CPF),该模型将知识概念之间的层次关系与学生的个性化认知能力相结合。 首先,我们将学生的个性化能力集成到学习和忘记过程中,明确区分学生根据其认知能力获得的个性化学习收益和遗忘速率。其次,我们考虑知识点的层次关系,设计了一个前驱-成功者知识概念矩阵,以模拟遗忘过程中的因果关系,并考虑遗忘先前的知识点对后续知识点的潜在影响。 所提出的个性化遗忘机制不仅可以应用于对具体知识概念的学习,还可以应用于终身学习过程。在三个公共数据集上的大量实验结果表明,我们的CPF在预测学生表现方面优于基于遗忘曲线理论的方法,这表明通过个性化遗忘机制,CPF可以更好地模拟学生知识状态的变化。
https://arxiv.org/abs/2404.12127
Global conflicts and trouble spots have thrown the world into turmoil. Intelligence services have never been as necessary as they are today when it comes to providing political decision-makers with concrete, accurate, and up-to-date decision-making knowledge. This requires a common co-operation, a common working language and a common understanding of each other. The best way to create this "intelligence community" is through a harmonized intelligence education. In this paper, we show how joint intelligence education can succeed. We draw on the experience of Germany, where all intelligence services and the Bundeswehr are academically educated together in a single degree program that lays the foundations for a common working language. We also show how these experiences have been successfully transferred to a European level, namely to ICE, the Intelligence College in Europe. Our experience has shown that three aspects are particularly important: firstly, interdisciplinarity or better, transdisciplinarity, secondly, the integration of IT knowhow and thirdly, the development and learning of methodological skills. Using the example of the cyber intelligence module with a special focus on data-driven decision support, additionally with its many points of reference to numerous other academic modules, we show how the specific analytic methodology presented is embedded in our specific European teaching context.
全球冲突和热点地区让世界陷入了混乱。在提供政治决策者具体、准确、及时的决策知识方面,情报服务从未像现在这样必要。这需要共同的协作、共同的工作语言和相互理解。创建“情报社区”的最佳方式是通过统一的智力教育。在本文中,我们展示了联合智力教育可以成功。我们借鉴了德国的经验,该国所有情报服务和联邦军队在同一个学位课程中接受学术教育,为共同的工作语言奠定了基础。我们还展示了这些经验如何成功转移到了欧洲层面,即 ICE,欧洲情报学院。我们的经验表明,三个方面尤为重要:第一,跨学科性或更好,跨学科性;第二,IT 技术的整合;第三,方法论技能的学习和发展。以特别关注数据驱动决策支持的外部情报模块为例,以及它与许多其他学术模块的丰富联系,我们展示了在这种情况下,所呈现的具体分析方法是如何融入我们特定的欧洲教学背景中的。
https://arxiv.org/abs/2404.12125
Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.
树木物种分布的了解是管理森林的基础。新的深度学习方法预计将在森林制图方面显著提高准确性,并成为放大规模绘制多种树种的 critical 工具。为了推动该领域的发展,深度学习研究人员需要大型高质量标注的数据集。为此,我们提出了 PureForest 数据集:一个大规模、开放、多模态的数据集,旨在从空域激光扫描(ALS)点云和非常高分辨率(VHR)航空图像中对树木物种进行分类。目前,大多数公开的 Lidar 数据集用于树木物种分类时具有较低的多样性,因为它们只覆盖了少数 annotated 公顷的土地。相比之下,PureForest 把 18 种树木分成了 13 个语义类,跨越了 449 个不同的单一森林,迄今为止是最大的、最全面的 Lidar 数据集,用于识别树木物种。通过将 PureForest 公开发布,我们希望为开发从 Lidar 和/或航空影像中识别树木物种的深度学习方法提供一个具有挑战性的基准数据集。在本文的数据论文中,我们描述了标注工作流程、数据集、推荐的评估方法和从 3D 和 2D 模态中建立基准性能。
https://arxiv.org/abs/2404.12064
We introduce RAM, an innovative RAG-based framework with an ever-improving memory. Inspired by humans' pedagogical process, RAM utilizes recursively reasoning-based retrieval and experience reflections to continually update the memory and learn from users' communicative feedback, namely communicative learning. Extensive experiments with both simulated and real users demonstrate significant improvements over traditional RAG and self-knowledge methods, particularly excelling in handling false premise and multi-hop questions. Furthermore, RAM exhibits promising adaptability to various feedback and retrieval method chain types, showcasing its potential for advancing AI capabilities in dynamic knowledge acquisition and lifelong learning.
我们提出了RAM,一种创新基于RAG的框架,具有越来越出色的内存。受到人类教学过程的启发,RAM利用递归推理为基础的检索和经验反射来持续更新内存并从用户的交流反馈中学习,即交流学习。与模拟和真实用户的大量实验表明,传统RAG和自知方法相比,取得了显著的改进,特别是在处理虚假前提和多级问题方面表现出色。此外,RAM表现出对各种反馈和检索方法链类型的适应性,展示了其在动态知识获取和终身学习中的潜在推动AI能力。
https://arxiv.org/abs/2404.12045
Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.
数据无感知知识蒸馏(DFKD)是一种解决与模型压缩、隐私和安全相关问题的有前途的方法,尤其是在涉及对类似类别的细粒度分类任务的实际应用中。虽然利用DFKD的现有方法已经取得了鼓舞人心的成就,但在实际应用中涉及细粒度分类任务时,得到的结果往往是不最优的。为了解决这个问题,我们提出了一个名为DFKD-FGVC的方法,将其扩展到细粒度视觉分类(FGVC)任务中。我们的方法利用注意力生成器、混合高阶注意力蒸馏和语义特征对比学习。具体来说,我们在生成器中引入了一个空间级的注意力机制,以合成具有更多细节的判别部分的精细图像。我们还利用混合高阶注意力机制来捕捉部分之间的复杂互动以及细粒度类别的判别特征之间的微妙差异,关注局部特征和语义上下文关系。此外,我们还利用蒸馏框架的教师和学生模型来对比超空间中高级语义特征映射的差异,比较不同类别的差异。我们在三个广泛使用的FGVC基准(飞机、汽车196和CUB200)上评估我们的方法,并证明了其优越性能。
https://arxiv.org/abs/2404.12037
Micro-expressions (MEs) are involuntary movements revealing people's hidden feelings, which has attracted numerous interests for its objectivity in emotion detection. However, despite its wide applications in various scenarios, micro-expression recognition (MER) remains a challenging problem in real life due to three reasons, including (i) data-level: lack of data and imbalanced classes, (ii) feature-level: subtle, rapid changing, and complex features of MEs, and (iii) decision-making-level: impact of individual differences. To address these issues, we propose a dual-branch meta-auxiliary learning method, called LightmanNet, for fast and robust micro-expression recognition. Specifically, LightmanNet learns general MER knowledge from limited data through a dual-branch bi-level optimization process: (i) In the first level, it obtains task-specific MER knowledge by learning in two branches, where the first branch is for learning MER features via primary MER tasks, while the other branch is for guiding the model obtain discriminative features via auxiliary tasks, i.e., image alignment between micro-expressions and macro-expressions since their resemblance in both spatial and temporal behavioral patterns. The two branches of learning jointly constrain the model of learning meaningful task-specific MER knowledge while avoiding learning noise or superficial connections between MEs and emotions that may damage its generalization ability. (ii) In the second level, LightmanNet further refines the learned task-specific knowledge, improving model generalization and efficiency. Extensive experiments on various benchmark datasets demonstrate the superior robustness and efficiency of LightmanNet.
微表情(MEs)是指不经意的运动,揭示了人们隐藏的感受,其对于情感检测的客观性吸引了众多关注。然而,尽管它在各种场景中具有广泛的应用,但在现实生活中,微表情识别(MER)仍然是一个具有挑战性的问题,由于以下三个原因: 1. 数据层面:数据不足和数据不平衡; 2. 特征层面:微表情的微妙、快速变化和复杂特征; 3. 决策层面:个体差异的影响。 为了应对这些问题,我们提出了一个双分支元辅助学习方法,称为LightmanNet,用于快速且稳健的微表情识别。具体来说,LightmanNet通过双分支生物级优化过程从有限的数据中学习通用MER知识:(i)在第一层,它通过两个分支获得任务特定的MER知识,第一个分支通过学习主要MER任务中的MER特征来获得,而另一个分支则通过引导模型通过辅助任务获得具有区分性的特征,即通过它们在空间和时间行为模式中的相似性来获得。两个分支的学习共同约束了学习有意义的任务特定MER知识的同时,避免了学习噪声或浅层连接可能会损害其泛化能力的可能性。(ii)在第二层,LightmanNet进一步优化了已学习的任务特定知识,提高了模型的泛化能力和效率。在各种基准数据集上的广泛实验证明,LightmanNet具有卓越的稳健性和效率。
https://arxiv.org/abs/2404.12024
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
人类表现出一种固有的能力,即识别支持特定动作的工具。对象部分与它们促进的动作之间的关联通常被称为 affordance。能够根据它们所促进的动作对对象部分进行分割是实现智能机器人使用日常生活中的物体的重要途径。传统的监督学习方法 for affordance segmentation 需要昂贵的像素级注释,而弱监督方法,尽管相对较少要求,但仍依赖于物体交互示例和支持一组动作。这些限制阻碍了可扩展性,可能引入偏差,并且通常将模型限制为有限的一组预定义动作。本文提出 AffordanceCLIP,通过利用预训练的 Vision-Language 模型如 CLIP 中内嵌的隐含 affordance 知识,从而克服这些限制。我们通过实验验证,CLIP 虽然在 affordance 检测方面并未进行专门训练,但保留了许多有价值的信息,对于该任务。我们的 AffordanceCLIP 与具有专门训练的方法相比具有竞争性的 zero-shot 性能,同时提供了几个优势:(i)它适用于任何动作提示,而不仅限于预定义的一组;(ii)与现有解决方案相比,需要训练的额外参数非常少;(iii)它消除了对动作-物体对之间的直接监督,为基于功能进行模型的功能推理打开了新的视角。
https://arxiv.org/abs/2404.12015
The digital divide describes disparities in access to and usage of digital tooling between social and economic groups. Emerging generative artificial intelligence tools, which strongly affect productivity, could magnify the impact of these divides. However, the affordability, multi-modality, and multilingual capabilities of these tools could also make them more accessible to diverse users in comparison with previous forms of digital tooling. In this study, we characterize spatial differences in U.S. residents' knowledge of a new generative AI tool, ChatGPT, through an analysis of state- and county-level search query data. In the first six months after the tool's release, we observe the highest rates of users searching for ChatGPT in West Coast states and persistently low rates of search in Appalachian and Gulf states. Counties with the highest rates of search are relatively more urbanized and have proportionally more educated, more economically advantaged, and more Asian residents in comparison with other counties or with the U.S. average. In multilevel models adjusting for socioeconomic and demographic factors as well as industry makeup, education is the strongest positive predictor of rates of search for generative AI tooling. Although generative AI technologies may be novel, early differences in uptake appear to be following familiar paths of digital marginalization.
数字鸿沟描述了社会和经济群体之间访问和使用数字工具的差异。新兴的生成人工智能工具,这些工具对生产力产生严重影响,可能夸大这些差异的影响。然而,这些工具的易用性、多模态和多语言功能,也可能使它们比以往数字工具更具吸引力,使其对不同用户更具吸引力。在本研究中,我们通过分析州和县一级的搜索查询数据,对美国居民对新型生成人工智能工具ChatGPT的空间差异进行刻画。在工具发布后的前六个月内,我们在西海岸州观察到用户搜索ChatGPT的率最高,而阿巴拉契亚和阿拉巴马州持续较低。搜索率最高的是相对较发达的县,与其他县或美国平均水平相比,这些县更有利于教育、经济优势和亚洲居民。在考虑社会经济和人口因素以及行业组成的多层模型中,教育是对生成人工智能工具搜索率的最强积极预测因素。尽管生成人工智能技术可能新颖,但早期采用差异似乎正在沿着数字边缘化的熟悉路径发展。
https://arxiv.org/abs/2404.11988
Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.
事件指的是在特定背景下发生的具体事件、事故或现象。事件推理旨在根据某些关系推断事件并预测未来事件。事件推理的最新技术在各种自然语言处理应用中发挥了关键作用。由于其知识丰富和推理能力,大型语言模型(LLMs)在事件推理方面取得了显著进展。然而,当前使用的较小调整模型在处理这些任务时并没有表现出非凡的熟练程度。这一差异源于事件和它们在指令数据中的相互关系缺乏明确的建模。因此,这些模型在理解和解释事件结构方面遇到了挑战,同时在将它们的解释与人类对事件的认知之间存在差距。此外,它们在理解事件关系方面的限制导致它们无法有效推断和融入相关事件知识。在本文中,我们提出了事件导向指令调整(EvIT)来训练我们的LLM。具体来说,我们首先提出了一个名为事件四元组的全新结构,它包含了事件和事件的表示结构,并且是完整的。然后,我们基于结构设计事件关系学习。我们将学习封装到指令调整公式中,以更好地刺激模型的事件推理能力。我们设计了一个基于节点的未经监督的方法,用于从大型语料库中挖掘事件四元组。最后,我们在事件导向指令调整上对Llama模型进行微调。我们在多个数据集上进行了广泛的实验,自动和人工评估都表明,EvIT在事件推理上取得了竞争力的性能。
https://arxiv.org/abs/2404.11978