This paper addresses the problem of object-goal navigation in autonomous inspections in real-world environments. Object-goal navigation is crucial to enable effective inspections in various settings, often requiring the robot to identify the target object within a large search space. Current object inspection methods fall short of human efficiency because they typically cannot bootstrap prior and common sense knowledge as humans do. In this paper, we introduce a framework that enables robots to use semantic knowledge from prior spatial configurations of the environment and semantic common sense knowledge. We propose SEEK (Semantic Reasoning for Object Inspection Tasks) that combines semantic prior knowledge with the robot's observations to search for and navigate toward target objects more efficiently. SEEK maintains two representations: a Dynamic Scene Graph (DSG) and a Relational Semantic Network (RSN). The RSN is a compact and practical model that estimates the probability of finding the target object across spatial elements in the DSG. We propose a novel probabilistic planning framework to search for the object using relational semantic knowledge. Our simulation analyses demonstrate that SEEK outperforms the classical planning and Large Language Models (LLMs)-based methods that are examined in this study in terms of efficiency for object-goal inspection tasks. We validated our approach on a physical legged robot in urban environments, showcasing its practicality and effectiveness in real-world inspection scenarios.
本文针对现实环境中自动驾驶检查中的物体目标导航问题进行了研究。物体目标导航对于在各种环境中有效检查物体至关重要,通常需要机器人识别一个大型搜索空间中的目标物体。当前的物体检查方法之所以不能达到人类效率,是因为它们通常无法利用人类在先前的空间配置中具有的语义知识和共同感觉。在本文中,我们引入了一个框架,使机器人可以使用环境先验配置中的语义知识和共同感觉知识。我们提出了SEEK(语义推理物体检查任务)框架,将语义先验知识与机器人的观察相结合以更有效地搜索和导航目标物体。SEEK有两个表示:动态场景图(DSG)和关系语义网络(RSN)。RSN是一个紧凑且实用的模型,用于估计DSG中空间元素中找到目标物体的概率。我们提出了一种新颖的概率规划框架,使用关系语义知识搜索物体。我们的仿真分析表明,SEEK在物体目标检查任务上优于本研究中使用的经典规划和基于大型语言模型的方法。我们在城市环境中使用实物机器人进行了验证,展示了其在现实世界检查场景中的实际性和有效性。
https://arxiv.org/abs/2405.09822
Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. While knowledge distillation-based approaches exhibit notable success in preventing forgetting, we pinpoint a limitation in their ability to distill the cumulative knowledge of all the previous tasks. To remedy this, we propose Dense Knowledge Distillation (DKD). DKD uses a task pool to track the model's capabilities. It partitions the output logits of the model into dense groups, each corresponding to a task in the task pool. It then distills all tasks' knowledge using all groups. However, using all the groups can be computationally expensive, we also suggest random group selection in each optimization step. Moreover, we propose an adaptive weighting scheme, which balances the learning of new classes and the retention of old classes, based on the count and similarity of the classes. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios. Empirical analysis underscores DKD's ability to enhance model stability, promote flatter minima for improved generalization, and remains robust across various memory budgets and task orders. Moreover, it seamlessly integrates with other CL methods to boost performance and proves versatile in offline scenarios like model compression.
持续学习通常涉及在多样任务上进行序列训练,往往面临灾难性遗忘。虽然基于知识蒸馏的方法在防止遗忘方面表现出显著的成功,但它们在提取所有先前任务的累积知识方面存在局限性。为解决这个问题,我们提出了Dense Knowledge Distillation(DKD)。DKD使用任务池跟踪模型的能力。它将模型的输出对数值进行分块,每个块对应于任务池中的一个任务。然后,它使用所有组来提取所有任务的知識。然而,使用所有组可能会有计算开销,我们还在每个优化步骤中提出了随机组选择。此外,我们提出了一个自适应权重方案,该方案基于类别的计数和相似性进行平衡,以促进新类别的学习和对旧类别的保留。我们的DKD在各种基准和场景中都超越了最先进的基线。经验分析证实,DKD具有提高模型稳定性、改善拓展最小值以提高泛化性能以及应对各种记忆预算和任务顺序等问题的能力。此外,它还与其它CL方法无缝集成,提高性能,在模型压缩等离线场景中表现出色。
https://arxiv.org/abs/2405.09820
Large Language Models have recently gained significant attention in scientific discovery for their extensive knowledge and advanced reasoning capabilities. However, they encounter challenges in effectively simulating observational feedback and grounding it with language to propel advancements in physical scientific discovery. Conversely, human scientists undertake scientific discovery by formulating hypotheses, conducting experiments, and revising theories through observational analysis. Inspired by this, we propose to enhance the knowledge-driven, abstract reasoning abilities of LLMs with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework: LLMs act as knowledgeable and versatile thinkers, proposing scientific hypotheses and reason about discrete components, such as physics equations or molecule structures; meanwhile, simulations function as experimental platforms, providing observational feedback and optimizing via differentiability for continuous parts, such as physical parameters. We conduct extensive experiments to demonstrate our framework's efficacy in constitutive law discovery and molecular design, unveiling novel solutions that differ from conventional human expectations yet remain coherent upon analysis.
近年来,大型语言模型在科学研究中因其广泛的知識和先进的推理能力而受到了显著的关注。然而,在有效地模拟观测反馈并将其与语言相结合以推动物理科学研究进展方面,它们面临着挑战。相反,人类科学家通过制定假设、进行实验和通过观测分析修改理论来进行科学研究。受到这一启发,我们提出了一种通过计算强度增强知识驱动的抽象推理能力的方法。我们引入了科学生成代理 (SGA),一种二元优化框架:LLM 充当富有知识和多才多艺的思考者,提出科学假设和讨论离散组件(如物理学方程或分子结构);同时,模拟作为实验平台,提供观测反馈并通过不同的可导性优化连续部分(如物理参数)。我们进行了广泛的实验,以证明我们框架在构成定律发现和分子设计方面的有效性,并揭示了与传统人类预期不同的全新解决方案,然而这些解决方案在分析过程中仍然保持逻辑上的连贯性。
https://arxiv.org/abs/2405.09783
Temporal knowledge graphs (TKGs) can effectively model the ever-evolving nature of real-world knowledge, and their completeness and enhancement can be achieved by reasoning new events from existing ones. However, reasoning accuracy is adversely impacted due to an imbalance between new and recurring events in the datasets. To achieve more accurate TKG reasoning, we propose an attention masking-based contrastive event network (AMCEN) with local-global temporal patterns for the two-stage prediction of future events. In the network, historical and non-historical attention mask vectors are designed to control the attention bias towards historical and non-historical entities, acting as the key to alleviating the imbalance. A local-global message-passing module is proposed to comprehensively consider and capture multi-hop structural dependencies and local-global temporal evolution for the in-depth exploration of latent impact factors of different event types. A contrastive event classifier is used to classify events more accurately by incorporating local-global temporal patterns into contrastive learning. Therefore, AMCEN refines the prediction scope with the results of the contrastive event classification, followed by utilizing attention masking-based decoders to finalize the specific outcomes. The results of our experiments on four benchmark datasets highlight the superiority of AMCEN. Especially, the considerable improvements in Hits@1 prove that AMCEN can make more precise predictions about future occurrences.
temporal knowledge graphs(TKGs)可以有效地建模现实世界中知识的不断演变,通过从现有知识中推理新事件,可以实现TKG的完整性和增强。然而,由于数据集中新事件和 recurring 事件之间的不平衡,推理准确性受到损害。为了实现更精确的TKG推理,我们提出了一个基于注意力掩码的对比事件网络(AMCEN),用于两阶段预测未来事件。在网络中,历史和非历史关注掩码向量被设计成控制关注偏差,起到减轻不平衡的作用。 我们提出了一个局部-全局消息传递模块,全面考虑和捕捉不同事件类型中潜在影响因素的 multi-hop 结构关系和局部-全局时间演化。使用对比学习将局部-全局时间模式融入其中,更准确地分类事件。因此,AMCEN通过对比事件分类的结果优化了预测范围,然后利用基于注意力掩码的解码器来确定具体结果。 我们实验的四种基准数据集的结果表明,AMCEN具有优越性。特别是,Hits@1 的显著提高证明了AMCEN在预测未来事件方面具有更精确的把握。
https://arxiv.org/abs/2405.10346
Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-language models on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at this http URL.
从现实世界中的视觉上下文和场景中学习常识推理是一个向高级人工智能的重要步骤。然而,现有的视频推理基准 still 不足,因为它们主要针对事实或情境推理而设计,并且很少涉及现实世界更广泛的知识。我们的工作旨在更深入地研究推理评估,特别是在动态、开放世界和结构化知识背景下。我们提出了一个新的基准(SOK-Bench),包括 44K 个问题和 10K 个情况,并在视频中绘制了实例级别的注释。推理过程需要理解并应用情境知识和一般知识来进行问题解决。为了创建这样一个数据集,我们提出了一个自动和可扩展的方法,通过指示LLM和MLLM的组合来生成问题-答案对、知识图和推理。具体来说,我们首先从视频中提取情境知识中的可观察的情境实体、关系和过程,然后扩展到可见内容之外的开箱世界知识。任务生成是通过多个对话作为迭代,并随后通过我们设计的自我提示和演示进行修正和优化。与隐式常识和明确的事实知识相结合,我们生成相关的问题-答案对和推理过程,最后通过手动审核进行质量保证。我们对最近的大型视觉语言模型进行了基准测试,并得出了几个有趣的结论。更多信息,请参阅我们的基准网站:http://www.aclweb.org/anthology/N18-1196
https://arxiv.org/abs/2405.09713
Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.
在现实世界中进行推理并不是与情境脱离。如何从周围情境中捕获当前知识并进行推理是非常关键和具有挑战性的,这对机器智能来说尤为重要。本文介绍了一个新的基准,通过情境抽象和基于逻辑的问题回答来评估现实世界视频中的情境推理能力,称为情境推理在现实世界视频(STAR基准)。这个基准基于与人类行为或互动相关的真实世界视频,这些视频自然具有动态、组合化和逻辑性。数据集包括四种类型的问题,包括交互、序列、预测和可行性。我们通过连接提取的元实体和关系(例如,动作、人物、物体和关系)的半平面图来表示现实世界视频中的情境。除了视觉感知外,情境推理还需要结构化的情境理解和逻辑推理。每个问题及其答案都是通过基于情境的半平面图的函数程序生成的。我们比较了各种现有的视频推理模型,发现它们都难以应对这个具有挑战性的情境推理任务。我们进一步提出了一个诊断神经符号模型,该模型可以区分视觉感知、情境抽象、语言理解和功能推理,以理解这个基准的挑战。
https://arxiv.org/abs/2405.09711
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled target domain. While UDA methods for synthetic to real-world domains (synth-to-real) show remarkable performance in tasks such as semantic segmentation and object detection, very few were proposed for the instance segmentation task. In this paper, we introduce UDA4Inst, a model of synth-to-real UDA for instance segmentation in autonomous driving. We propose a novel cross-domain bidirectional data mixing method at the instance level to fully leverage the data from both source and target domains. Rare-class balancing and category module training are also employed to further improve the performance. It is worth noting that we are the first to demonstrate results on two new synth-to-real instance segmentation benchmarks, with 39.0 mAP on UrbanSyn->Cityscapes and 35.7 mAP on Synscapes->Cityscapes. UDA4Inst also achieves the state-of-the-art result on SYNTHIA->Cityscapes with 31.3 mAP, +15.6 higher than the latest approach. Our code will be released.
无监督领域适应(UDA)的目的是将来自标注源域的知识传递到未标注的目标域。虽然UDA方法在合成到现实世界的领域(synth-to-real)任务中表现出惊人的性能,但在实例分割任务中提出的非常少。在本文中,我们介绍了UDA4Inst,一种用于自动驾驶环境中实例分割的合成到现实UDA模型。我们提出了一个新颖的跨域双向数据混合方法,在实例级别充分利用源域和目标域的数据。我们还采用稀有类平衡和类别模块训练来进一步提高性能。值得一提的是,我们是第一个在两个新的合成到现实实例分割基准上获得结果的团队,城市Syn-> Cityscapes基准中的39.0mAP,合成scapes-> Cityscapes基准中的35.7mAP。UDA4Inst还在SYNTHIA-> Cityscapes上实现了与最新方法相同的最佳结果,比其高15.6个单位。我们的代码将发布。
https://arxiv.org/abs/2405.09682
The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models by testing their ability to use knowledge of a concept to match a target text with a plausible/implausible context. EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans. Domains range from social interactions (help/hinder) to spatial relations (left/right). Both, contexts and targets are minimal pairs. Objects, agents, and locations in the items can be flexibly filled in enabling easy generation of multiple controlled datasets. We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 openweights large language models (1.3B--70B parameters) across a battery of evaluation paradigms along with a human norming study comprising 12,480 measurements. The overall performance of all tested models is worse than human performance, with results varying drastically across domains. These data highlight simple cases where even large models fail and present rich avenues for targeted research on LLM world modeling capabilities.
构建并利用世界模型的能力对于通用人工智能代理至关重要。测试这种能力是困难的,部分原因是因为世界模型的构建块是不明确的。我们提出了元素(EWOK),一个用于通过测试模型在语言模型中使用概念知识与目标文本匹配的可信/不可信上下文来评估世界模型的框架。EWOK针对已知对人类世界建模至关重要的一些知识领域。领域范围从社会互动(帮助/阻碍)到空间关系(左/右)。上下文和目标都是最小二乘。物品中的对象、代理和位置可以灵活填写,以轻松生成多个受控数据集。然后我们引入了EWOK-CORE-1.0,一个包含11个世界知识领域4,374个项目的数据集。我们通过一系列评估范式对20个大型语言模型(1.3B--70B参数)进行了评估,其中包括一个人类标准化研究,该研究包括12,480个测量值。所有测试模型的整体性能都低于人类性能,各个领域的结果差异极大。这些数据突显了即使大型模型也存在简单的情况,并为针对LLM世界建模能力的有针对性研究提供了丰富的方向。
https://arxiv.org/abs/2405.09605
Objective: Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles, leaving behind a significant knowledge gap. Minding efforts to implement real-world FL, there is a notable lack of comprehensive assessment comparing FL to less complex alternatives. Materials & Methods: We extensively reviewed FL literature, categorizing insights along with our findings according to their nature and phase while establishing a FL initiative, summarized to a comprehensive guide. We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. We extensively evaluated FL against less complex alternatives in three distinct evaluation scenarios. Results: The proposed guide outlines essential steps, identified hurdles, and proposed solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results show that FL outperforms less complex alternatives in all evaluation scenarios, justifying the effort required to translate FL into real-world applications. Discussion & Conclusion: Our proposed guide aims to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications. Our results underscore the value of efforts needed to translate FL into real-world applications by demonstrating advantageous performance over alternatives, and emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings.
目标:联邦学习(FL)可以在保留数据本地的情况下实现协作模型训练。目前,由于许多阻碍将其转化为实践的障碍,大多数放射学领域的FL研究都是在模拟环境中进行的。少数现有现实世界的FL倡议很少详细介绍为克服这些障碍所采取的具体措施,留下了相当大的知识空白。关注实施现实世界FL,在比较FL与其他更简单选项的全局评估方面存在显著的不足。材料和方法:我们广泛审查了FL文献,根据其性质和阶段将见解进行分类,同时建立FL倡议,并将其总结为一本全面的指南。我们还在德国放射学合作网络(RACOON)内开发了自己的FL基础设施,并通过在六所大学医院的肺病理分割任务上训练FL模型来展示其功能性。我们在三个不同的评估场景对FL与更简单的替代方案进行了广泛评估。结果:所提出的指南概述了建立成功FL倡议的必要步骤,识别了障碍并提出了解决方案。我们的实验结果表明,FL在所有评估场景中都优于更简单的替代方案,从而为将FL融入放射学应用付出了所需的努力。讨论与结论:我们的指南旨在帮助未来的FL研究人员避免陷阱,加速FL向放射学应用的转化。我们的结果强调了将FL融入现实世界应用所需的努力,通过证明其相对于替代方案的优越性能,并着重强调了在现实环境中的战略组织、分布式数据和基础设施的稳健管理的重要性。
https://arxiv.org/abs/2405.09409
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
现有的偏差消除方法在指定的和评估过程中为实现不同社会群体之间的平衡,不可避免地做出不合理的或不受欢迎的预测,因为它们将个体事实排除在之外,导致现有知识的修改。在本文中,我们首先建立了一个新的偏见缓解基准BiasKE,利用现有和附加构建数据集,通过公平性、特异性和平衡度量来系统地评估偏差表现。同时,我们提出了一个新的偏差消除方法,称为公平性印章(FAST),它通过在个体有偏见知识上进行细粒度校准来实现可编辑的公平性。全面的实验证明,FAST在显著的偏差消除性能方面超过了最先进的基线,同时不影响知识保留的整体模型能力,突出了在LLM中可编辑公平性策略的可能性。
https://arxiv.org/abs/2405.09341
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
背景:自然语言处理技术的快速发展导致开发了具有潜在推翻精神卫生保健变革的大语言模型。这些模型在协助临床医生和为各种心理挑战经历的个人提供支持方面显示出希望。目标:本研究旨在比较GPT-4和Chat-GPT在回答一组18个心理提示方面的性能,以评估它们在精神卫生保健场所的潜在应用。方法:采用盲法进行研究,临床心理学家的评估过程中不知道模型的来源。提示涵盖了广泛的心理健康问题,包括抑郁症、焦虑和创伤,以确保全面评估。结果:研究结果表明,两个模型在性能上存在显著差异(p > 0.05)。GPT-4获得了平均评分8.29分,而Chat-GPT获得了平均评分6.52分。临床心理学家的评估表明,GPT-4在生成具有临床相关性和同情心的响应方面效果更好,从而为潜在用户提供了更好的支持和指导。结论:本研究为大型语言模型在精神卫生保健场所的应用提供了越来越多的证据。研究结果强调,该领域需要继续研究和开发,以优化这些模型用于临床使用。还需要进一步研究来了解两个模型性能差异的具体原因,并探讨它们在不同人群和心理状况下的可推广性。
https://arxiv.org/abs/2405.09300
Medical image interpretation using deep learning has shown promise but often requires extensive expert-annotated datasets. To reduce this annotation burden, we develop an Image-Graph Contrastive Learning framework that pairs chest X-rays with structured report knowledge graphs automatically extracted from radiology notes. Our approach uniquely encodes the disconnected graph components via a relational graph convolution network and transformer attention. In experiments on the CheXpert dataset, this novel graph encoding strategy enabled the framework to outperform existing methods that use image-text contrastive learning in 1% linear evaluation and few-shot settings, while achieving comparable performance to radiologists. By exploiting unlabeled paired images and text, our framework demonstrates the potential of structured clinical insights to enhance contrastive learning for medical images. This work points toward reducing demands on medical experts for annotations, improving diagnostic precision, and advancing patient care through robust medical image understanding.
使用深度学习进行医学图像解释已经显示出很大的潜力,但通常需要大量专家标注的数据。为了减轻这一标注负担,我们开发了一个图像-图卷积学习框架,将胸部X光片与从放射科笔记中自动提取的结构化报告知识图进行配对。我们的方法通过关系图卷积网络和Transformer注意力独特地编码了断开图组件。在CheXpert数据集的实验中,这种新颖的图编码策略使得该框架在1%的线性评估和少样本设置的图像文本对比学习方法中超过了现有方法,同时实现了与放射科医生相当的表现。通过利用未标注的成对图像和文本,我们的框架展示了结构化临床见解增强医学图像对比学习潜力。这项工作有望减少对医学专家的标注需求,提高诊断准确性,并通过可靠的医学图像理解推动患者护理。
https://arxiv.org/abs/2405.09594
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
This paper focuses on the integration of generative techniques into spatial-temporal data mining, considering the significant growth and diverse nature of spatial-temporal data. With the advancements in RNNs, CNNs, and other non-generative techniques, researchers have explored their application in capturing temporal and spatial dependencies within spatial-temporal data. However, the emergence of generative techniques such as LLMs, SSL, Seq2Seq and diffusion models has opened up new possibilities for enhancing spatial-temporal data mining further. The paper provides a comprehensive analysis of generative technique-based spatial-temporal methods and introduces a standardized framework specifically designed for the spatial-temporal data mining pipeline. By offering a detailed review and a novel taxonomy of spatial-temporal methodology utilizing generative techniques, the paper enables a deeper understanding of the various techniques employed in this field. Furthermore, the paper highlights promising future research directions, urging researchers to delve deeper into spatial-temporal data mining. It emphasizes the need to explore untapped opportunities and push the boundaries of knowledge to unlock new insights and improve the effectiveness and efficiency of spatial-temporal data mining. By integrating generative techniques and providing a standardized framework, the paper contributes to advancing the field and encourages researchers to explore the vast potential of generative techniques in spatial-temporal data mining.
本文重点关注将生成技术集成到空间-时间数据挖掘中,考虑到空间-时间数据的增长和多样性。随着循环神经网络(RNNs)、卷积神经网络(CNNs)和其他非生成技术的发展,研究人员已经探讨了它们在捕捉空间-时间数据中的时间依赖性和空间依赖性的应用。然而,像LLM、SSL、Seq2Seq和扩散模型这样的生成技术的出现为进一步增强空间-时间数据挖掘提供了新的可能性。本文对基于生成技术的空间-时间方法进行了全面分析,并引入了一个专门为空间-时间数据挖掘流程设计的标准化框架。通过提供详细的回顾和利用生成技术的新分类方法,本文使人们能够更深入地理解这个领域中采用的各种技术。此外,本文强调了值得关注的研究方向,呼吁研究人员深入研究空间-时间数据挖掘。它强调了需要探索未开拓的机会,推动知识的边界,以获得新的见解,并提高空间-时间数据挖掘的效率和效果。通过整合生成技术和提供标准化框架,本文为这个领域的发展做出了贡献,鼓励研究人员探讨生成技术在空间-时间数据挖掘中的巨大潜力。
https://arxiv.org/abs/2405.09592
The graph neural networks has been proved to be an efficient machine learning technique in real life applications. The handwritten recognition is one of the useful area in real life use where both offline and online handwriting recognition are required. The chain code as feature extraction technique has shown significant results in literature and we have been able to use chain codes with graph neural networks. To the best of our knowledge, this work presents first time a novel combination of handwritten trajectories features as chain codes and graph neural networks together. The handwritten trajectories for offline handwritten text has been evaluated using recovery of drawing order, whereas online handwritten trajectories are directly used with chain codes. Our results prove that present combination surpass previous results and minimize error rate in few epochs only.
已经证明,图神经网络在现实生活中应用是有效的机器学习技术。手写识别是现实生活中的一个有用的领域,需要同时进行离线和在线手写识别。作为特征提取技术,链式码在文献中已经显示出显著的成果,我们能够使用图神经网络与链式码一起工作。据我们所知,这项工作首次将手写轨迹特征与链式码和图神经网络相结合,形成了一种新的组合。我们使用恢复绘制顺序来评估手写在线文本的手写轨迹,而在线手写轨迹则直接使用链式码。我们的结果证明,这种结合超出了以前的结果,并且在几轮训练后仅能最小化误差率。
https://arxiv.org/abs/2405.09247
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM's distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE's efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
语言模型(LMs)作为会话助手最近变得越来越受欢迎,这些工具有助于人们完成各种任务。通常,这是通过在预先调整通用领域文本序列的基础上,通过指令微调和可能的选择优化方法,对LMs进行自适应。对这类LMS的评估理想上应该是通过人类判断来完成的,然而,这并不切实际。另一方面,通过辅助LMs作为评委或知识基础任务实现自动评估是可扩展的,但很难评估会话能力以及遵循指示的准确性。为了加速LMs作为会话助手的发展,我们提出了一个新颖的自动评估任务:HumanRankEval(HRE)。它包括一个大型、多样且高质量的问答集,每个答案都由人类编撰和评分。为了进行评估,HRE根据LM的分布对这些答案进行排序,然后计算它们与相应人类评分之间的相关性。我们通过研究HRE如何高效地分离各种大小的预训练和指令微调的LM来支持HRE的有效性。我们发现,HRE与人类判断高度相关,尤其是在指令微调后模型变化时。
https://arxiv.org/abs/2405.09186