The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
The pivotal shift from traditional paper-based records to sophisticated Electronic Health Records (EHR), enabled systematic collection and analysis of patient data through descriptive statistics, providing insight into patterns and trends across patient populations. This evolution continued toward predictive analytics, allowing healthcare providers to anticipate patient outcomes and potential complications before they occur. This progression from basic digital record-keeping to sophisticated predictive modelling and digital twins reflects healthcare's broader evolution toward more integrated, patient-centred approaches that combine data-driven insights with personalized care delivery. This chapter explores the evolution and significance of healthcare information systems, beginning with an examination of the implementation of EHR in the UK and the USA. It provides a comprehensive overview of the International Classification of Diseases (ICD) system, tracing its development from ICD-9 to ICD-10. Central to this discussion is the MIMIC-III database, a landmark achievement in healthcare data sharing and arguably the most comprehensive critical care database freely available to researchers worldwide. MIMIC-III has democratized access to high-quality healthcare data, enabling unprecedented opportunities for research and analysis. The chapter examines its structure, clinical outcome analysis capabilities, and practical applications through case studies, with a particular focus on mortality and length of stay metrics, vital signs extraction, and ICD coding. Through detailed entity-relationship diagrams and practical examples, the text illustrates MIMIC's complex data structure and demonstrates how different querying approaches can lead to subtly different results, emphasizing the critical importance of understanding the database's architecture for accurate data extraction.
从传统的纸质记录向复杂的电子健康记录(EHR)转变的关键性变化,使得通过描述性统计方法系统地收集和分析患者数据成为可能,从而揭示了不同人群中的模式和趋势。这种演变进一步向着预测性分析发展,使医疗服务提供者能够提前预知患者的结局和潜在并发症。从基本的数字记录管理到复杂的预测建模以及数字孪生的发展,反映了医疗保健更广泛的向更加集成、以患者为中心的方法转变,这种方法结合了数据驱动的洞察力和个人化护理交付。 本章探讨了医疗信息系统的演变及其意义,从英国和美国实施EHR开始。它还提供了关于国际疾病分类(ICD)系统的一个全面概述,并追溯其从ICD-9到ICD-10的发展历程。在这其中的核心是MIMIC-III数据库,这是医疗数据共享领域的一项里程碑式成就,也是目前世界上免费提供给研究人员的最全面的重症监护数据库之一。MIMIC-III使高质量的医疗数据获取民主化,并为研究和分析提供了前所未有的机会。本章考察了它的结构、临床结果分析能力以及通过案例研究的实际应用情况,特别关注死亡率和住院时间指标、生命体征提取以及ICD编码。 通过详细的实体关系图和实际示例,该文本展示了MIMIC复杂的数据结构,并说明了不同的查询方法如何会导致细微但重要的不同结果,强调理解数据库架构对于准确数据提取的重要性。
https://arxiv.org/abs/2501.09640
In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: this https URL
在这个项目中,我们解决了从文本生成图像时的不忠问题,特别是在涉及多个对象的动作场景中。为此,我们在CONFORM框架的基础上进行构建,该框架使用对比学习来提高包含多个对象的生成图像的准确性。然而,描绘涉及多种不同对象的动作仍然有很大的改进空间。为了解决这个问题,我们采用了一种语义超图对比邻近性学习方法,并引入了增强型对比结构理解和“对比但链接”技术。此外,我们通过InteractDiffusion进一步完善Stable Diffusion对动作的理解。我们的评估指标包括图像-文本相似度的CLIP和TIFA。此外,我们还进行了一项用户研究。即使对于Stable Diffusion理解得不那么好的动词,我们的方法也表现出有前景的结果。最后,通过对结果的分析,我们提供了未来的研究方向。我们的代码库可以在polybox下的此链接中找到:this https URL
https://arxiv.org/abs/2501.09055
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
多模态文档检索旨在从大量文档中识别和检索各种形式的多模态内容,例如图表、表格、图形和布局信息。尽管这一领域非常重要,但目前缺少一个有效的基准来评估系统在这方面的性能。为了填补这一空白,本工作引入了一个新的基准,命名为MMDocIR,它包含了两个不同的任务:页面级检索和布局级检索。前者专注于在长文档中定位最相关的页面,而后者则侧重于检测特定的布局,提供比整个页面分析更细致的粒度级别。一个布局可以指代各种元素,如文本段落、方程式、图形、表格或图表等。 MMDocIR基准包含了一个丰富的数据集,其中包括专家标注的1,685个问题标签和自举标注的173,843个问题标签,使其成为推动多模态文档检索领域训练与评估的重要资源。通过严格的实验,我们发现: (i) 视觉检索器在性能上显著优于文本检索器; (ii) MMDocIR训练集可以有效促进多模态文档检索的训练过程; (iii) 利用视觉语言模型(VLM)文本的信息进行检索的文本检索器的表现明显好于仅使用光学字符识别(OCR)文本信息的文本检索器。 这些发现强调了在多模态文档检索中整合视觉元素的潜在优势。
https://arxiv.org/abs/2501.08828
We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.
我们描述了一个公开可用的意第绪语OCR语料库(Yiddish OCR Corpus)的构建过程,并介绍了和评估了开源OCR工具套件Jochre 3,其中包括用于语料库注释的Alto编辑器、生成Alto OCR层的OCR软件以及一个可定制的OCR搜索引擎。目前版本的意第绪语OCR语料库包含658页,186,000个词汇单元和840,000个字符。 Jochre 3 OCR工具使用了多种微调过的YOLOv8模型来进行自上而下的页面布局分析,并采用了一个专门定制的卷积神经网络(CNN)用于字符识别。该工具在我们的测试语料库中达到了1.5%的字符错误率(CER),远远优于所有现有的公共意第绪语文本OCR模型。 我们使用Jochre 3 OCR对完整的6.6亿词的美国犹太图书中心Yiddish Book Center进行了分析,新的OCR结果可以通过该中心的OCR搜索引擎进行搜索。
https://arxiv.org/abs/2501.08442
This study explores the fine-tuning (FT) of the Open Pre-trained Transformer (OPT-125M) for grammatical acceptability tasks using the CoLA dataset. By comparing Vanilla-Fine-Tuning (VFT), Pattern-Based-Fine-Tuning (PBFT), and Parameter-Efficient Fine-Tuning techniques (PEFT) like Low-Rank Adaptation (LoRA), we demonstrate significant improvements in computational efficiency while maintaining high accuracy. Our experiments reveal that while VFT achieves the highest accuracy (81.2%), LoRA enhancing FT by reducing memory usage and iteration time by more than 50%, and increases accuracy in PBFT case. Context Distillation (CD), though computationally efficient, underperformed with accuracy around 31%. Our findings contribute to democratizing access to large language models (LLM) by reducing computational barriers.
这项研究探讨了使用CoLA数据集对开放预训练变压器(OPT-125M)进行语法可接受性任务的微调(FT)。通过比较普通微调(VFT)、基于模式的微调(PBFT)和参数高效微调技术(PEFT,如低秩适应LoRA),我们展示了在保持高精度的同时显著提高了计算效率。我们的实验表明,虽然VFT实现了最高的准确率(81.2%),但通过降低内存使用量和迭代时间超过50%,LoRA增强了微调的准确性,在PBFT的情况下尤其如此。上下文蒸馏(CD)尽管计算效率高,但由于精度仅约为31%,表现不佳。我们的研究结果有助于减少大型语言模型(LLM)的计算障碍,从而使其更加普及。
https://arxiv.org/abs/2501.07853
Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to mediocre accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. The proposed tokenization method enables the Vanilla Transformer to achieve a 90th percentile positioning error of 0.388 m in a highly NLOS indoor factory, surpassing conventional tokenization methods. The L-SwiGLU ViT further reduces the error to 0.355 m, achieving an 8.51% improvement. Additionally, the proposed model outperforms a 14.1 times larger model with a 46.13% improvement, underscoring its computational efficiency.
在具有挑战性的非视距(NLOS)环境中,室内定位通常会导致传统方法的精度较低。深度学习(DL)已被应用于解决这些问题;然而,许多深度学习方法忽视了计算复杂性,尤其是在浮点运算(FLOPs)方面,这使得它们不适合资源受限设备。基于Transformer的模型在自然语言处理(NLP)和计算机视觉(CV)任务中取得了显著成功,激发了它们在无线应用中的使用潜力。然而,在室内定位领域,这些模型的应用还处于初级阶段,并且直接应用Transformer进行室内定位可能会导致计算负担过重以及准确性方面的限制。 为了解决这些问题,在这项工作中,我们引入了一种新的标记化方法,称为传感器快照令牌化(SST),该方法保留了功率延迟配置文件(PDP)的变量特定表示,并通过有效捕捉多变量相关性来增强注意力机制。此外,我们还提出了一种轻量级Swish-Gated线性单元基Transformer(L-SwiGLU Transformer)模型,旨在减少计算复杂度而不牺牲定位精度。这些贡献共同减轻了计算负担和对大型数据集的依赖,使基于Transformer的模型更加高效且适合资源受限场景。 所提出的标记化方法使Vanilla Transformer在高度NLOS的室内工厂环境中实现了第90百分位定位误差为0.388米的成绩,超过了传统的标记化方法。L-SwiGLU ViT进一步将误差降低至0.355米,取得了8.51%的改进。此外,该模型还优于一个比其大14.1倍的模型,并且性能提高了46.13%,突显了其计算效率。
https://arxiv.org/abs/2501.07774
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
图像标记化器构成了现代文本到图像生成模型的基础,但训练起来非常困难。此外,大多数现有的文本到图像的模型依赖于大规模、高质量的私有数据集,这使得这些模型难以复制。在本研究中,我们引入了一种基于Transformer且具备文本感知能力的一维图像标记器(Text-Aware Transformer-based 1-Dimensional Tokenizer, TA-TiTok)。TA-TiTok是一种高效而强大的图像标记化工具,它可以利用离散或连续的1维令牌。TA-TiTok的独特之处在于它在解码阶段(即去标记过程)中整合了文本信息,这加速了收敛并提高了性能。此外,TA-TiTok还受益于简化且有效的单阶段训练流程,无需使用先前一维标记器复杂的两阶段蒸馏程序。这种设计允许模型轻松扩展到大规模数据集上。 基于此基础,我们进一步推出了一类名为Masked Generative Models (MaskGen)的文本到图像生成模型家族,该系列模型仅在开源数据上进行训练,并且性能与依赖私有数据训练的模型相当。我们的目标是发布高效的TA-TiTok标记器和开放权重的MaskGen模型,以促进更广泛的访问并使文本到图像掩码生成模型领域更加民主化。
https://arxiv.org/abs/2501.07730
This article explores the evolution of constructionism as an educational framework, tracing its relevance and transformation across three pivotal eras: the advent of personal computing, the networked society, and the current era of generative AI. Rooted in Seymour Papert constructionist philosophy, this study examines how constructionist principles align with the expanding role of digital technology in personal and collective learning. We discuss the transformation of educational environments from hierarchical instructionism to constructionist models that emphasize learner autonomy and interactive, creative engagement. Central to this analysis is the concept of an expanded personality, wherein digital tools and AI integration fundamentally reshape individual self-perception and social interactions. By integrating constructionism into the paradigm of smart education, we propose it as a foundational approach to personalized and democratized learning. Our findings underscore constructionism enduring relevance in navigating the complexities of technology-driven education, providing insights for educators and policymakers seeking to harness digital innovations to foster adaptive, student-centered learning experiences.
本文探讨了建构主义作为教育框架的演变,追溯其在三个关键时期的相关性和转变:个人计算时代的到来、网络化社会以及当前的生成式人工智能时代。本文基于西摩·派珀特(Seymour Papert)的建构主义哲学,研究了建构主义原则如何与数字技术在个人和集体学习中所扮演的角色扩展相适应。我们讨论了教育环境从传统的层级指令主义向强调学习者自主性和互动创造性参与的建构主义模式转变的过程。 本文的核心分析概念是“扩大化的个性”,即数字工具及人工智能整合从根本上重塑个体自我认知和社会交往方式。通过将建构主义融入智能教育框架,我们将其作为个性化和民主化学习的基础方法提出。我们的研究结果强调了建构主义在应对技术驱动型教育复杂性方面的持久相关性,并为寻求利用数字创新以培养适应性强、以学生为中心的学习体验的教师及政策制定者提供见解。
https://arxiv.org/abs/2501.07486
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
光学字符识别(OCR)对于挪威国家图书馆(NLN)的数字化过程至关重要,因为它可以将扫描文档转换为机器可读文本。然而,对于NLN藏品中的萨米语文献而言,OCR准确性不足。由于OCR质量会影响下游流程,因此评估和改进用于转录萨米语文本的OCR是使这些资源更具可访问性的必要条件。为了满足这一需求,这项工作对三种已建立的OCR方法——Transkribus、Tesseract 和 TrOCR 进行了微调,并对其在转写NLN藏品中的萨米语文本方面的效果进行了评估。我们的结果显示,在处理特定任务时,Transkribus和TrOCR的表现优于Tesseract;然而,对于非同域数据集而言,Tesseract则表现出色。此外,我们发现通过调整预训练模型并用人工标注和合成文本图像补充机器标注,即使在仅有中等量手动标注数据的情况下,也能为萨米语提供准确的OCR识别结果。
https://arxiv.org/abs/2501.07300
The centralization of Large Language Models (LLMs) development has created significant barriers to AI advancement, limiting the democratization of these powerful technologies. This centralization, coupled with the scarcity of high-quality training data and mounting complexity of maintaining comprehensive expertise across rapidly expanding knowledge domains, poses critical challenges to the continued growth of LLMs. While solutions like Retrieval-Augmented Generation (RAG) offer potential remedies, maintaining up-to-date expert knowledge across diverse domains remains a significant challenge, particularly given the exponential growth of specialized information. This paper introduces LLMs Networks (LLM-Net), a blockchain-based framework that democratizes LLMs-as-a-Service through a decentralized network of specialized LLM providers. By leveraging collective computational resources and distributed domain expertise, LLM-Net incorporates fine-tuned expert models for various specific domains, ensuring sustained knowledge growth while maintaining service quality through collaborative prompting mechanisms. The framework's robust design includes blockchain technology for transparent transaction and performance validation, establishing an immutable record of service delivery. Our simulation, built on top of state-of-the-art LLMs such as Claude 3.5 Sonnet, Llama 3.1, Grok-2, and GPT-4o, validates the effectiveness of the reputation-based mechanism in maintaining service quality by selecting high-performing respondents (LLM providers). Thereby it demonstrates the potential of LLM-Net to sustain AI advancement through the integration of decentralized expertise and blockchain-based accountability.
大型语言模型(LLMs)的发展集中化已经造成了人工智能进步的重要障碍,限制了这些强大技术的普及。这种集中化与高质量训练数据稀缺以及维持迅速扩展的知识领域中的全面专业知识所面临的日益复杂性相结合,对LLM的持续增长构成了关键挑战。虽然像检索增强生成(RAG)这样的解决方案可能提供潜在的补救措施,但在各种领域中保持更新的专业知识仍然是一项重大挑战,尤其是在专业信息呈指数级增长的情况下。本文介绍了基于区块链框架的LLM网络(LLM-Net),该框架通过分散化的专门LLM提供商网络实现了LLMs-as-a-Service的民主化。通过利用集体计算资源和分布式的专业知识领域,LLM-Net整合了针对各种特定领域的微调专家模型,确保持续的知识增长,并且通过协作提示机制保持服务质量。该框架的设计包括使用区块链技术进行透明交易和性能验证,建立不可变的服务交付记录。 我们基于最先进的大型语言模型(如Claude 3.5 Sonnet、Llama 3.1、Grok-2 和 GPT-4o)构建的仿真结果表明,声誉机制在选择高性能响应者(LLM 提供商)以维持服务质量方面是有效的。这证明了 LLM-Net 在通过整合去中心化的专业知识和基于区块链的责任性来推动人工智能进步方面的潜力。
https://arxiv.org/abs/2501.07288
TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at this https URL.
微软、Adobe、苹果和OpenAI提供的TTS(文本转语音)文档阅读器在全球范围内广泛应用。这些工具对于普通纯文本提供了相对较好的TTS结果,但在处理数学表达式时有时会跳过部分内容或提供不满意的结果。这是因为大多数现代学术论文都是用LaTeX编写的,在LaTeX公式被编译后,它们在文档中会被呈现为具有独特形式的文本。然而,传统的TTS文档阅读器只是输出其识别到的文字内容,并不考虑公式的数学意义。 为了应对这一问题,我们提出了MathReader,它有效集成了OCR(光学字符识别)、微调后的T5模型和TTS技术。当处理包含数学公式的文档时,MathReader展示出了比现有的微软Edge、Adobe Acrobat等TTS文档阅读器更低的词错误率(WER)。与微软Edge相比,MathReader将WER从0.510降低到了0.281;与Adobe Acrobat相比,则是从0.617降到了0.281。这将显著帮助缓解那些希望聆听文档、特别是视力受损用户所面临的不便。 该代码可在以下链接获取:[此处插入实际的URL,原文中没有给出具体网址]
https://arxiv.org/abs/2501.07088
Despite its importance, studying economic behavior across diverse, non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) populations presents significant challenges. We address this issue by introducing a novel methodology that uses Large Language Models (LLMs) to create synthetic cultural agents (SCAs) representing these populations. We subject these SCAs to classic behavioral experiments, including the dictator and ultimatum games. Our results demonstrate substantial cross-cultural variability in experimental behavior. Notably, for populations with available data, SCAs' behaviors qualitatively resemble those of real human subjects. For unstudied populations, our method can generate novel, testable hypotheses about economic behavior. By integrating AI into experimental economics, this approach offers an effective and ethical method to pilot experiments and refine protocols for hard-to-reach populations. Our study provides a new tool for cross-cultural economic studies and demonstrates how LLMs can help experimental behavioral research.
尽管研究不同文化背景下(特别是非西方、受过教育、工业化、富裕和民主化,即非WEIRD群体)的经济行为非常重要,但这一领域面临着重大挑战。为了解决这些问题,我们提出了一种新的方法,利用大型语言模型(LLMs)创建合成的文化代理(SCAs),以代表这些不同的文化背景人群。我们将这些SCAs置于经典的实验环境中进行测试,包括独裁者游戏和最后通牒博弈等。 我们的研究结果显示,在实验行为中存在显著的跨文化差异。值得注意的是,对于有数据记录的人群,SCA的行为与真实人类参与者的实际表现定性上非常相似。而对于缺乏相关研究资料的人群,这种方法能够生成有关经济行为的新颖且可测试的假设。通过将人工智能技术融入实验经济学领域,这一方法提供了一种既有效又伦理的方式来进行初步试验,并为难以触及的人群优化实验流程。 本研究表明了一种新的跨文化经济研究工具的应用潜力,并展示了大型语言模型如何助力于实验性行为研究的发展。
https://arxiv.org/abs/2501.06834
This paper explores the synergy between human cognition and Large Language Models (LLMs), highlighting how generative AI can drive personalized learning at scale. We discuss parallels between LLMs and human cognition, emphasizing both the promise and new perspectives on integrating AI systems into education. After examining challenges in aligning technology with pedagogy, we review AutoTutor-one of the earliest Intelligent Tutoring Systems (ITS)-and detail its successes, limitations, and unfulfilled aspirations. We then introduce the Socratic Playground, a next-generation ITS that uses advanced transformer-based models to overcome AutoTutor's constraints and provide personalized, adaptive tutoring. To illustrate its evolving capabilities, we present a JSON-based tutoring prompt that systematically guides learner reflection while tracking misconceptions. Throughout, we underscore the importance of placing pedagogy at the forefront, ensuring that technology's power is harnessed to enhance teaching and learning rather than overshadow it.
这篇论文探讨了人类认知与大型语言模型(LLMs)之间的协同作用,强调生成式AI如何推动大规模个性化学习。我们讨论了LLMs和人类认知之间的相似性,并着重阐述了将人工智能系统整合到教育中的前景及新的视角。在审视技术与教学法相融合所面临的挑战之后,我们回顾了AutoTutor——最早的智能辅导系统(ITS)之一,并详细介绍了它的成功、局限性和未实现的抱负。然后,我们介绍了一种下一代ITS——苏格拉底游乐场,该系统使用先进的基于转换器的模型来克服AutoTutor的限制,并提供个性化和适应性的辅导服务。为了展示其不断发展的能力,我们提出了一种基于JSON的教学提示,这种提示通过系统的引导帮助学习者反思并追踪他们的误解。在整个讨论过程中,我们强调了将教学法置于首位的重要性,确保技术的力量被用来增强教学与学习效果,而不是遮蔽它们。
https://arxiv.org/abs/2501.06682
Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and require expensive computing resources to achieve high performance. We thus propose a multi-agent system built on small language models, fine-tuned on bioinformatics data, and enhanced with retrieval augmented generation (RAG). Our system, BioAgents, enables local operation and personalization using proprietary data. We observe performance comparable to human experts on conceptual genomics tasks, and suggest next steps to enhance code generation capabilities.
创建端到端的生物信息学工作流程需要多样化的专业知识,这对初级和高级研究人员来说都提出了挑战,因为它要求对基因组学概念和计算技术有深入的理解。虽然大型语言模型(LLM)提供了一些帮助,但它们往往在执行复杂生物信息学任务时无法提供所需的细致指导,并且为了达到高性能还需要昂贵的计算资源。因此,我们提出了一种基于小型语言模型构建的多代理系统,这些模型经过了生物信息学数据的微调,并通过检索增强生成(RAG)得到了改进。我们的系统名为BioAgents,它能够利用专有数据进行本地操作和个性化设置。我们在概念性基因组任务上观察到了与人类专家相当的表现,并建议下一步提升代码生成能力的方法。
https://arxiv.org/abs/2501.06314
Foundation models for computational pathology have shown great promise for specimen-level tasks and are increasingly accessible to researchers. However, specimen-level models built on these foundation models remain largely unavailable, hindering their broader utility and impact. To address this gap, we developed SpinPath, a toolkit designed to democratize specimen-level deep learning by providing a zoo of pretrained specimen-level models, a Python-based inference engine, and a JavaScript-based inference platform. We demonstrate the utility of SpinPath in metastasis detection tasks across nine foundation models. SpinPath may foster reproducibility, simplify experimentation, and accelerate the adoption of specimen-level deep learning in computational pathology research.
基于计算病理学的基础模型在标本层面的任务中展现出巨大的潜力,并且逐渐为研究人员所使用。然而,构建于这些基础模型之上的标本级模型仍然难以获取,这限制了它们的广泛应用和影响。为了填补这一空白,我们开发了SpinPath工具包,旨在通过提供一系列预训练的标本级模型、基于Python的推理引擎以及基于JavaScript的推理平台来普及标本级别的深度学习技术。我们在九种基础模型上展示了SpinPath在转移性病变检测任务中的实用性。SpinPath有望促进研究结果的可重复性,简化实验过程,并加速计算病理学中应用标本级别深度学习的速度。
https://arxiv.org/abs/2501.05945
With the rapid pace of technological innovation, traditional methods of policy formation and legislating are becoming conspicuously anachronistic. The need for regulatory choices to be made to counter the deadening effect of regulatory lag is more important to developing markets and fostering growth than achieving one off regulatory perfection. This article advances scholarship on innovation policy and the regulation of technological innovation in the European Union. It does so by considering what building an agile yet robust anticipatory governance regulatory culture involves. It systematically excavates a variety of tools and elements that are being put into use in inventive ways and argues that these need to be more cohesively and systemically integrated into the regulatory toolbox. Approaches covered include strategic foresight, the critical embrace of iterative policy development and regulatory learning in the face of uncertainty and the embrace of bottom up approaches to cocreation of policy such as Policy Labs and the testing and regulatory learning through pilot regulation and experimentation. The growing use of regulatory sandboxes as an EU policy tool to boost innovation and navigate regulatory complexity as seen in the EU AI Act is also probed
随着技术革新的迅速步伐,传统的政策制定和立法方法正显得日益过时。为了对抗监管滞后带来的负面影响,有必要做出监管选择,这对发展市场和促进经济增长比追求一次性监管完善更为重要。本文推进了关于创新政策以及在欧洲联盟中对技术创新进行监管的研究。它通过探讨构建一种灵活且稳健的前瞻性治理监管文化所涉及的内容来实现这一目标。系统地发掘并分析了一系列被以创新方式使用的工具和元素,并提出这些需要更紧密、系统化地整合到监管工具箱中。 本文讨论的方法包括战略远见、在不确定性面前采用迭代政策发展及监管学习的态度,以及采纳自下而上的合作创造政策方法(如政策实验室),并通过试点法规与实验来进行测试和监管学习。此外,文中还探讨了欧盟政策工具有所增长的使用情况,例如通过监管沙盒来促进创新并应对监管复杂性的问题,这一趋势在《欧洲联盟人工智能法案》中尤为明显。 本文强调需要更加灵活且系统化的监管框架,以支持快速变化的技术环境中的持续发展与创新。
https://arxiv.org/abs/2501.05921
Here's a condensed 1920-character version: The rise of misinformation and fake news in online political discourse poses significant challenges to democratic processes and public engagement. While debunking efforts aim to counteract misinformation and foster fact-based dialogue, these discussions often involve language toxicity and emotional polarization. We examined over 86 million debunking tweets and more than 4 million Reddit debunking comments to investigate the relationship between language toxicity, pessimism, and social polarization in debunking efforts. Focusing on discussions of the 2016 and 2020 U.S. presidential elections and the QAnon conspiracy theory, our analysis reveals three key findings: (1) peripheral participants (1-degree users) play a disproportionate role in shaping toxic discourse, driven by lower community accountability and emotional expression; (2) platform mechanisms significantly influence polarization, with Twitter amplifying partisan differences and Reddit fostering higher overall toxicity due to its structured, community-driven interactions; and (3) a negative correlation exists between language toxicity and pessimism, with increased interaction reducing toxicity, especially on Reddit. We show that platform architecture affects informational complexity of user interactions, with Twitter promoting concentrated, uniform discourse and Reddit encouraging diverse, complex communication. Our findings highlight the importance of user engagement patterns, platform dynamics, and emotional expressions in shaping polarization in debunking discourse. This study offers insights for policymakers and platform designers to mitigate harmful effects and promote healthier online discussions, with implications for understanding misinformation, hate speech, and political polarization in digital environments.
在线政治讨论中虚假信息和假新闻的兴起对民主进程和公众参与提出了重大挑战。尽管辟谣努力旨在对抗虚假信息并促进基于事实的对话,但这些讨论往往涉及语言毒性以及情感极化。我们分析了超过8600万条辟谣推特和400多万条Reddit辟谣评论,以探讨语言毒性、悲观情绪和社会极化之间的关系在辟谣行动中的影响。 我们的研究聚焦于2016年和2020年的美国总统选举以及QAnon阴谋论的讨论中,揭示了三个关键发现: 1. 周边参与者(一级用户)在塑造有毒对话中发挥着不成比例的作用,这主要是由于较低的社区责任感和情感表达驱动。 2. 平台机制显著影响极化现象,Twitter放大了党派之间的分歧,而Reddit则因为其结构化的、由社区推动的互动模式导致整体毒性更高。 3. 语言毒性和悲观情绪之间存在负相关性,随着相互作用增加,毒性减少,尤其是在Reddit平台上更为明显。我们展示了平台架构如何影响用户交互的信息复杂度,Twitter促进了集中和统一的对话,而Reddit则鼓励多样且复杂的沟通方式。 我们的研究结果强调了用户参与模式、平台动态以及情感表达在塑造辟谣讨论中的极化现象方面的重要性。这项研究为政策制定者和平台设计人员提供了见解,帮助他们减轻有害影响并促进更健康的在线讨论,同时也对理解数字环境下的虚假信息、仇恨言论及政治极化的机制具有重要意义。
https://arxiv.org/abs/2501.06274
We introduce LLMQuoter, a lightweight, distillation-based model designed to enhance Retrieval Augmented Generation (RAG) by extracting the most relevant textual evidence for downstream reasoning tasks. Built on the LLaMA-3B architecture and fine-tuned with Low-Rank Adaptation (LoRA) on a 15,000-sample subset of HotpotQA, LLMQuoter adopts a "quote-first-then-answer" strategy, efficiently identifying key quotes before passing curated snippets to reasoning models. This workflow reduces cognitive overhead and outperforms full-context approaches like Retrieval-Augmented Fine-Tuning (RAFT), achieving over 20-point accuracy gains across both small and large language models. By leveraging knowledge distillation from a high-performing teacher model, LLMQuoter achieves competitive results in a resource-efficient fine-tuning setup. It democratizes advanced RAG capabilities, delivering significant performance improvements without requiring extensive model retraining. Our results highlight the potential of distilled quote-based reasoning to streamline complex workflows, offering a scalable and practical solution for researchers and practitioners alike.
我们介绍了LLMQuoter,这是一个基于蒸馏的轻量级模型,旨在通过提取最相关的文本证据来增强检索增强生成(RAG),以支持下游推理任务。该模型建立在LLaMA-3B架构之上,并使用低秩适应(LoRA)技术在一个包含15,000个样本的HotpotQA子集上进行了微调。LLMQuoter采用了一种“先引用后回答”的策略,能够高效地识别关键引文,并将经过筛选的小段文本传递给推理模型。这种工作流程减少了认知负担,并且在性能上超越了全上下文方法(如检索增强微调RAFT),无论是在小型语言模型还是大型语言模型中都取得了超过20个百分点的准确率提升。 通过从高性能教师模型进行知识蒸馏,LLMQuoter在一个资源高效的微调设置下实现了有竞争力的结果。它使高级RAG功能更加普及,并在不需大规模重新训练的情况下提供了显著的性能改进。我们的研究结果突显了基于蒸馏引文推理的潜力,能够简化复杂的流程工作流,为研究人员和实践者提供一个可扩展且实用的解决方案。
https://arxiv.org/abs/2501.05554
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
迄今为止,大多数大型视觉-语言模型(LVLM)主要在英语数据上进行训练,这使得它们难以理解非英语输入,并且无法生成所需目标语言的输出。现有的努力通过添加多语种训练数据来缓解这些问题,但这些方法大多是临时性的,缺乏对不同语言组合如何影响各种语言群体的理解。在这项工作中,我们全面研究了大规模多语言LVLM的训练策略。首先,我们进行了一系列跨越13个下游视觉-语言任务和43种语言的多阶段实验,系统地检查以下内容:(1)在不降低英语性能的情况下可以包含多少培训语言;(2)预训练以及(3)指令微调数据的最佳语言分布情况。此外,我们还(4)研究了如何提高跨语种文本图像理解,并为此任务引入了一个新的基准测试。令人惊讶的是,我们的分析揭示了:(i)一次可以同时包括多达100个培训语言;(ii)只需使用25-50%的非英语数据即可显著提升多语性能,同时保持强大的英语性能。(iii)在预训练和指令微调中包含非英语OCR数据对于提高跨语种文本图像理解至关重要。最后,我们将所有研究成果综合起来,训练出Centurio,这是一个支持100种语言的LVLM,在涵盖14项任务和56种语言的评估中达到了最先进的性能水平。
https://arxiv.org/abs/2501.05122