Handwritten font generation is important for preserving cultural heritage and creating personalized designs. It adds an authentic and expressive touch to printed materials, making them visually appealing and establishing a stronger connection with the audience. This paper aims to design a framework for generating handwritten fonts in the Gujarati script, mimicking the variation of human handwriting. The proposed font generation model consists of a learning phase and a generation phase. In the learning phase, Gujarati scripts are analyzed, and rules for designing each character are formulated. This ruleset involves the concatenation of strokes in a stroke-based manner, ensuring visual consistency in the resulting glyphs. The generation phase involves the user providing a small subset of characters, and the system automatically generates the remaining character glyphs based on extracted strokes and learned rules, resulting in handwritten Gujarati fonts. The resulting character glyphs are converted into an open-type font using the FontForge tool, making them compatible with any Gujarati editor. Both subjective and objective evaluations are conducted to assess the synthesized images and fonts. Subjective evaluation through user studies provides feedback on quality and visual appeal, achieving an overall accuracy of 84.84%. Notably, eleven characters demonstrated a success ratio above 90%. Objective evaluation using an existing recognition system achieves an overall accuracy of 84.28% in OCR evaluation. Notably, fifteen characters had a success ratio of 80% or higher.
手写字体生成对于保护和创建个性化的设计非常重要。它为印刷材料增添了真实和富有表现力的触感,使它们具有视觉吸引力,并与观众建立更强的联系。本文旨在设计一个在古吉拉特语中生成手写字体的框架,模仿人类手写的变化。所提出的字体生成模型包括学习和生成两个阶段。在学习阶段,对古吉拉特语进行分析,并制定了每个字符的设计规则。这个规则集涉及将笔画以一种 stroke-based 的方式连接起来,确保最终 glyph 具有视觉一致性。在生成阶段,用户提供一小部分字符,系统根据提取的笔画和学习到的规则自动生成其余字符的 glyph,从而生成手写古吉拉特字体。生成的字符 glyphs 通过 FontForge 工具转换为开放类型字体,使它们与任何古吉拉特编辑器兼容。 both subjective and objective evaluations are conducted to assess the synthesized images and fonts. Subjective evaluation through user studies provides feedback on quality and visual appeal, achieving an overall accuracy of 84.84%. Notably, eleven characters demonstrated a success ratio above 90%. Objective evaluation using an existing recognition system achieves an overall accuracy of 84.28% in OCR evaluation. Notably, fifteen characters had a success ratio of 80% or higher.
https://arxiv.org/abs/2404.03277
Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at this https URL along with a live demo at https://cmulab.dev
有效地使用自然语言处理(NLP)工具在资源有限的语言中需要对语言本身有深入的理解,熟悉最新的模型和训练方法,以及技术专业知识来部署这些模型。这可能会成为语言社区成员和语言学家使用 NLP 工具的一个重大障碍。本文介绍了 CMU 语言注释后端,一个开源框架,简化了模型部署和连续的人机交互式对 NLP 模型的微调。CMULAB 使用户能够利用多语言模型的力量,快速适应和扩展现有的语音识别、OCR、翻译和语义分析工具,甚至在与有限训练数据的情况下。我们描述了目前可用的各种工具和 API,以及开发人员如何轻松地将新的模型/功能添加到框架中。代码可以从此链接的 https://cmulab.dev 下载,同时还可以查看 live 演示https://cmulab.dev。
https://arxiv.org/abs/2404.02408
Efforts on the research and development of OCR systems for Low-Resource Languages are relatively new. Low-resource languages have little training data available for training Machine Translation systems or other systems. Even though a vast amount of text has been digitized and made available on the internet the text is still in PDF and Image format, which are not instantly accessible. This paper discusses text recognition for two scripts: Bengali and Nepali; there are about 300 and 40 million Bengali and Nepali speakers respectively. In this study, using encoder-decoder transformers, a model was developed, and its efficacy was assessed using a collection of optical text images, both handwritten and printed. The results signify that the suggested technique corresponds with current approaches and achieves high precision in recognizing text in Bengali and Nepali. This study can pave the way for the advanced and accessible study of linguistics in South East Asia.
低资源语言的OCR系统研究和发展相对较新。低资源语言缺乏用于训练机器翻译系统或其他系统的充足训练数据。尽管已经数字化并发布在互联网上的文本数量巨大,但文本仍然是PDF和图像格式,这些格式并不容易访问。本文讨论了两种语言:孟加拉语和尼泊尔语;孟加拉语有大约3000万 speakers,尼泊尔语有大约4000万 speakers。在这项研究中,使用编码器-解码器变压器,开发了一个模型,并通过一系列手写和打印的图像光学字符识别进行评估。结果表明,所提出的技术符合现有方法,在孟加拉语和尼泊尔语中识别文本的精度较高。这项研究可以为东南亚地区进行高级和可访问的语言学研究铺平道路。
https://arxiv.org/abs/2404.02375
AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.
AI民主化旨在创建一个普通人均能够利用AI技术的世界。为实现这一目标,许多研究所试图将他们的研究成果公开给公众。特别是,基于大规模数据的大规模预训练模型显示出前所未有的潜力,并发布后对AI民主化产生了重大影响。然而,发布的模型大部分都专门针对英语,因此AI民主化在非英语社区中还有很大的落后。为了缩小这种AI访问的差距,我们发布了基于日语的生成预训练Transformer(GPT),对比语言图像预训练(CLIP),稳定扩散,以及来自Transformer的隐藏单元双向编码器表示(HuBERT)预训练模型。通过提供这些模型,用户可以自由地与符合日本文化价值观的AI进行交互,并确保日本文化的身份,从而提高AI的民主化程度。此外,实验还发现,专门针对日本的预训练模型在日语任务上能有效实现高性能。
https://arxiv.org/abs/2404.01657
Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at this https URL .
预训练语言模型在几个AI应用中至关重要,但它们在训练过程中的高计算成本限制了其可访问性。像BLOOM和StarCoder这样的倡议旨在为协作社区开发预训练模型提供民主访问。然而,这些现有模型面临一些挑战:有限的多语言能力,持续预训练导致灾难性遗忘,而从头预训练计算成本高,同时符合AI安全和开发法规。本文介绍了Aurora-M,一种15B参数的多语言开源模型,基于英语、芬兰语、印度语、日语、越南语和代码进行训练。从StarCoderPlus持续预训练435亿个额外标记,Aurora-M总训练标记数超过2万亿个。这是第一个将人类审查的安全指令用于人类回滚的开源多语言模型,因此其开发不仅符合传统反制考虑,而且与拜登-哈里斯行政命令中关于人工智能的安全、安全和值得信赖的发展和使用的人工智能的具体关注保持一致。Aurora-M在各种任务和语言上进行了严格评估,展示了在灾难性遗忘方面的稳健性,其在多语言环境中的表现优于其他替代方案,特别是在安全评估方面。为了促进负责任的开放源LLM开发,Aurora-M及其变体在https://这个网址上发布。
https://arxiv.org/abs/2404.00399
The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game rules and autonomously generate game-play processes. The IDGE allows users to create games by issuing simple natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State Prediction task, wherein the model autoregressively predicts in-game states given player actions. It is a challenging task because the computation of in-game states must be precise; otherwise, slight errors could disrupt the game-play. To address this, we train the IDGE in a curriculum manner that progressively increases the model's exposure to complex scenarios. Our initial progress lies in developing an IDGE for Poker, a universally cherished card game. The engine we've designed not only supports a wide range of poker variants but also allows for high customization of rules through natural language inputs. Furthermore, it also favors rapid prototyping of new games from minimal samples, proposing an innovative paradigm in game development that relies on minimal prompt and data engineering. This work lays the groundwork for future advancements in instruction-driven game creation, potentially transforming how games are designed and played.
指令式游戏引擎(IDGE)项目旨在通过让大型语言模型(LLM)能够遵循自由形式游戏规则并自主生成游戏进程,从而民主化游戏开发。用户可以通过发出简单的自然语言指令来创建游戏,这大大降低了游戏开发的门槛。我们将IDGE的学习过程视为下一状态预测任务,其中模型自回归地预测给玩家动作的游戏状态。这是一个具有挑战性的任务,因为游戏状态的计算必须准确;否则,微小的错误可能会干扰游戏进程。为了应对这个问题,我们在 curriculum 方式下对IDGE进行训练,逐渐增加模型对复杂场景的暴露。我们在 initial progress 部分开发了一个 IDGE for Poker,这是一个被全球玩家珍视的扑克游戏。我们所设计的引擎不仅能支持各种扑克变种,还允许用户通过自然语言输入来高度定制规则。此外,它还鼓励快速原型设计新游戏,提出了一个游戏开发 paradigm 依赖于最小提示和数据工程的创新模式。这项工作为未来在指令式游戏创作方面的进一步发展奠定了基础,有可能改变游戏设计及其玩法。
https://arxiv.org/abs/2404.00276
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.
人类与人工智能(AI)之间的互动是一个反映多模态大型语言模型(MLLM)有效性的关键因素。然而,目前的MLLM主要关注图像级的理解,并将交互限制为文本指令,从而限制了其在使用和响应深度上的灵活性。在本文中,我们介绍了Draw-and-Understand项目:一个新的模型、一个多领域数据集和一个挑战性的基准,用于视觉提示。具体来说,我们提出了SPHINX-V,一种新的端到端训练的Multimodal Large Language Model(MLLM),它连接了视觉编码器、视觉提示编码器和各种视觉提示(点、边界框和自由形状)以及语言理解。为了促进MLLM在视觉提示方面的研究,我们引入了MDVP-Data和MDVP-Bench。MDVP-Data是一个多领域数据集,包含1600万独特图像-视觉提示-文本指令样本,包括自然图像、文档图像、OCR图像、移动屏幕截图、网页屏幕截图和多面板图像。此外,我们还介绍了MDVP-Bench,这是一个全面而具有挑战性的基准,用于评估模型对视觉提示指令的理解能力。我们的实验结果表明,通过视觉提示,SPHINX-V的令人印象深刻的多模态交互能力得到了展现,揭示了在详细像素级别描述和问题回答能力方面的显著改进。
https://arxiv.org/abs/2403.20271
Artificial Intelligence Generated Content (AIGC) technology development has facilitated the creation of rumors with misinformation, impacting societal, economic, and political ecosystems, challenging democracy. Current rumor detection efforts fall short by merely labeling potentially misinformation (classification task), inadequately addressing the issue, and it is unrealistic to have authoritative institutions debunk every piece of information on social media. Our proposed comprehensive debunking process not only detects rumors but also provides explanatory generated content to refute the authenticity of the information. The Expert-Citizen Collective Wisdom (ECCW) module we designed aensures high-precision assessment of the credibility of information and the retrieval module is responsible for retrieving relevant knowledge from a Real-time updated debunking database based on information keywords. By using prompt engineering techniques, we feed results and knowledge into a LLM (Large Language Model), achieving satisfactory discrimination and explanatory effects while eliminating the need for fine-tuning, saving computational costs, and contributing to debunking efforts.
人工智能生成的内容(AIGC)技术的发展促进了一些谣言的创建,这些谣言包含错误信息,对社会的、经济的和政治生态系统产生了影响,挑战了民主。当前的谣言检测措施仅仅通过分类潜在的错误信息(分类任务)来处理问题,远远不足以解决问题,而且让权威机构对社交媒体上的所有信息进行辟谣是不现实的。我们提出的全面辟谣过程不仅检测谣言,还提供生成内容来反驳信息的真实性。我们设计的专家-公民集体智慧(ECCW)模块确保了信息的可靠性评估的高精度,检索模块负责从实时更新的辟谣数据库中检索相关信息。通过使用提示工程技术,我们将结果和知识输入到LLM(大型语言模型),在实现满意的分类和解释效果的同时,消除了需要微调、节省计算成本,并为辟谣努力做出了贡献。
https://arxiv.org/abs/2403.20204
We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at this https URL code to reproduce the baselines will be available shortly.
我们介绍了RealKIE,这是一个针对五个具有挑战性的数据集的基准,旨在推动关键信息提取方法的发展,重点关注企业应用。这些数据集包括各种类型的文档,包括SEC S1报告、美国非披露协议、英国慈善报告、FCC发票和资源合同。每个数据集都带来了独特的挑战:糟糕的文本序列化、长文档中的稀疏注释以及复杂的表格布局。这些数据集为投资分析和法律数据处理等关键信息提取任务提供了一个真实的测试平台。除了提供这些数据集外,我们还深入描述了注释过程、文档处理技术和基线建模方法。这个贡献有助于开发能够处理实际挑战的人工智能模型,并支持对适用于行业特定问题的信息提取技术的进一步研究。已标注的数据和OCR输出可在此链接下载:https://code.google.com/archive/p/realkie/。基线模型将很快提供。
https://arxiv.org/abs/2403.20101
SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.
SLEDGE 是一款基于现实世界驾驶日志的车辆运动规划生成器。其核心组件是一个学习模型,能够生成代理体的边界框和道路图。模型的输出作为交通仿真的基础状态。SLEDGE 生成的实体具有连接性和场景内变量个数的变化,这使得将最现代的生成模型应用于此任务变得非平凡。因此,与现有的道路图表示的系统研究相结合,我们引入了一种新颖的位图到向量的自动编码器(RVAE)。它将代理体和道路图编码到位图化的潜在地图的 distinct 通道中。这促进了基于道路条件的代理体生成和扩散Transformer生成的道路和代理体的联合生成。使用 SLEDGE 生成的实体,可以更好地控制仿真,例如放大变道或增加交通密度。此外,SLEDGE 可以在现有的数据驱动模拟器中支持 500 米长的路线,而 nuPlan 等现有数据驱动模拟器无法实现。它为规划算法带来了新的挑战,这在由我们的模型生成的困难路线和密集交通的测试中表现出了超过 40% 的失败率。与 nuPlan 相比,SLEDGE 需要存储的内存减少了 500 倍(<4GB),使其成为一个更易访问的选项,并有助于推动该领域未来研究的民主化。
https://arxiv.org/abs/2403.17933
Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale dataset with 485K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.
问题回答(QA)和机器阅读理解(MRC)任务在近年来显著发展,主要得益于深度学习技术的快速发展以及最近大型语言模型的广泛应用。同时,许多针对QA和MRC任务的基准数据集已经变得可用。然而,现有的大型基准数据集主要是在类似于维基百科或互联网的同步文档集合中创建的。档案文档集合,如历史报纸,包含有价值的历史信息,但这些信息尚未被广泛用于训练大型语言模型。为了进一步推动QA和MRC任务的发展,克服前人数据集的局限,我们引入了ChroniclingAmericaQA,一个基于历史报纸收集的大型数据集,包含了485K个问题-答案对。我们的数据集是基于Chronicling America报纸收藏库的一个子集构建的,该收藏库跨度为120年。 利用数字化的历史报纸集合的一个显著挑战是OCR文本的质量较低。因此,为了实现对QA模型的真实测试,我们的数据集可以以三种方式使用:回答原始和噪音内容的问题,回答清洁和修正过的内容的问题,以及回答从报纸页面扫描图像中回答的问题。ChroniclingAmericaQA在现有QA数据集中的时间跨度最长,使其成为一个独特且有用的资源。
https://arxiv.org/abs/2403.17859
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.
制作有效的图例是重要的。读者 heavily依赖这些图例来理解图中的信息。然而,尽管为图和图例开发了一套先进的AI技术,但这些技术很少被测试其对帮助编写的有用性。本文介绍了SciCapenter,一个交互式系统,旨在结合最先进的AI技术,帮助科学图例的编写。SciCapenter为每张图生成多种图例,提供分数和全面的审核清单,以评估图例在多个关键方面的质量,如有用性、OCR提及、关键要点和视觉特性参考。用户可以直接在SciCapenter中编辑图例,重新提交进行修订,并逐步改进它们。一份用户研究结果表明,SciCapenter显著降低了编写图例的认知负担。参与者的反馈进一步提供了有价值的系统设计洞察,为未来的系统旨在增强编写图例提供了指导。
https://arxiv.org/abs/2403.17784
In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.
在本文中,我们提出了提高论文中图表的摘要质量的解决方案。我们采用概述论文文本内容的策略来生成图像摘要。在整个研究过程中,我们遇到了官方数据集中提供OCR信息的差异。为了纠正这个问题,我们使用PaddleOCR工具包从所有图像中提取OCR信息。此外,我们还观察到官方论文中某些文本内容与标题无关,从而在生成标题时引入噪声。为了减轻这个问题,我们利用LLaMA提取图像特定信息,通过基于图像提及的文本内容进行查询,有效地过滤出无关信息。此外,我们认识到在文本生成中使用最大似然估计与评估指标如ROUGE之间存在差异。为了弥合这个差距,我们将BRIO模型框架集成到我们的方法中,实现了生成和评估过程之间的更一致性。我们的方法在最终测试中排名第一,得分4.49。
https://arxiv.org/abs/2403.17342
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through the conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR and speech recognition are utilized to transform the images and speech signals into text content. All these variety of mechanisms of text generation also introduce errors into the captured text. This project aims at analyzing different kinds of error that occurs in text documents. The work employs two of the advanced deep neural network-based language models, namely, BART and MarianMT, to rectify the anomalies present in the text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both models can bring down the erroneous sentences by 20+%, BART can handle spelling errors far better (24.6%) than grammatical errors (8.8%).
文本 remains 是一种相关形式的信息表示形式。文本文件可以通过数字原生平台或转换其他媒体文件(如图像和语音)来创建。虽然数字原生文本是通过物理或虚拟键盘获得的,但使用 OCR 和语音识别等技术将图像和语音信号转换为文本内容。所有这些文本生成机制也引入了错误到捕获到的文本中。本项目旨在分析文本文件中发生的不同类型的错误。该工作采用两个先进的基于深度神经网络的语言模型,即 BART 和 MarianMT,来纠正文本中的异常。使用这些模型的可转移学习来微调其纠正错误的能力。对这两种模型在处理定义的错误类别的效果进行了比较研究。观察到,虽然两种模型都可以将错误的句子降低 20% 以上,但 BART 在处理拼写错误方面远比语法错误(8.8%)要好(24.6%)。
https://arxiv.org/abs/2403.16655
Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
之前的研究表明,预训练技术可以提高视觉文档理解(VDU)的性能,通常需要模型具备感知和推理文档文本和布局的能力(例如,文本和表格单元格的位置)。为此,我们提出了一个视觉引导的生成式文本布局预训练方案,名为ViTLP。给定一个文档图像,模型优化了语言和布局建模目标,生成并生成交互式的文本和布局序列。为了应对使用Transformer处理长文档的局限性,我们引入了一种简单而有效的多段生成预训练方案,帮助ViTLP处理任何长度的文档。ViTLP可以作为本地 OCR 模型来定位和识别文档图像中的文本。此外,ViTLP可以有效地应用于各种下游VDU任务。大量实验证明,ViTLP在基准VDU任务上的竞争性能与现有基线相当,包括信息提取、文档分类和文档问题回答。
https://arxiv.org/abs/2403.16516
Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature
在过去的几年里,基于扩散模型的文本转图像(T2I)生成方法已经获得了显著的关注。然而,基本的扩散模型通常在生成的图像中显示的文本中存在拼写不准确的问题。生成视觉文本的能力至关重要,既具有学术意义,又具有广泛的应用价值。为了产生准确的视觉文本图像,最先进的技术采用了一种基于字符级别控制的图像生成方法,包括一个文本布局生成器和一个根据生成的文本布局进行条件的图像生成器。然而,我们的研究揭示了这些模型仍然面临三个主要挑战,促使我们开发一个测试平台来促进未来的研究。我们引入了一个专门为测试模型生成具有长篇和复杂视觉文本的图像而设计的基准,即LenCom-Eval。接着,我们引入了一个无需训练的框架来增强两种级联生成方法。我们在LenCom-Eval和MARIO-Eval基准上评估了我们的方法的有效性,并展示了在包括CLIPScore、OCR精度、召回、F1分数、准确性和编辑距离分数在内的各种评估指标上显着改善。例如,与基准模型相比,我们提出的框架在LenCom-Eval基准上提高了超过23%,而在MARIO-Eval基准上提高了13.5%。我们的工作为该领域通过专注于生成长篇和罕见文本序列的图像做出了独特的贡献,而这一领域之前尚未被现有文献所探索。
https://arxiv.org/abs/2403.16422
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at this https URL.
光字符识别(OCR)是一个已经确立的任务,旨在识别图像中的文本。虽然有许多现成的OCR模型存在,但它们通常是在科学(如公式)或通用打印英语文本上训练的。从化学出版物中提取文本需要一种在两种情况下都具有良好能力的OCR模型。 Nougat是一个最近的工具,表现出对学术文档的解析能力,但它无法解析PubMed文章中的表格,而这在学术社区中占有重要地位,也是本研究的核心。为了弥合这一差距,我们提出了Printed English and Chemical Equations(PEaCE)数据集,包含合成和真实世界记录,并评估了当用此资源训练Transformer-based OCR模型时,其有效性。鉴于真实世界记录包含在合成记录中不存在的 artifacts,我们提出了模拟这种特性的变换。我们进行了一系列实验,以探索补丁大小、多领域训练以及我们提出的变换对OCR模型的影响,最终发现,在补丁尺寸较小、跨领域训练并使用我们提出的变换训练的模型中,模型的性能最好。我们的数据和代码可以在该https URL上找到。
https://arxiv.org/abs/2403.15724
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE). Nonetheless, most existing methods are predominantly designed for Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence. Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges. To overcome these limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03% and 9.03% respectively on the dev and test set.
大语言模型(LLMs)在理解和生成文本方面表现出色,这促使许多研究人员将它们应用于信息抽取(IE)任务,包括关系抽取(RE)。然而,现有的方法主要针对句子级别的关系抽取(SentRE)任务,通常涵盖单个句子内的有限关系和三元组事实。此外,某些方法将关系视为提示模板中的候选选项,导致处理效率低下,性能较差,当处理文档级别关系抽取(DocRE)任务时,这会带来明确的挑战。为了克服这些限制,我们引入了AutoRE,一种端到端的关系抽取模型,采用名为RHF(关系-头-事实)的新颖关系抽取范式。与现有方法不同,AutoRE不依赖于已知的关系选项,因此更贴近现实场景。此外,我们还使用参数有效微调(PEFT)算法(QLoRA)开发了一个易于扩展的关系抽取框架。我们对RE-DocRED数据集的实验结果展示了AutoRE的最佳性能,实现了最先进的水平,分别比TAG提高了10.03%和9.03%。
https://arxiv.org/abs/2403.14888
General purpose AI, such as ChatGPT, seems to have lowered the barriers for the public to use AI and harness its power. However, the governance and development of AI still remain in the hands of a few, and the pace of development is accelerating without proper assessment of risks. As a first step towards democratic governance and risk assessment of AI, we introduce Particip-AI, a framework to gather current and future AI use cases and their harms and benefits from non-expert public. Our framework allows us to study more nuanced and detailed public opinions on AI through collecting use cases, surfacing diverse harms through risk assessment under alternate scenarios (i.e., developing and not developing a use case), and illuminating tensions over AI development through making a concluding choice on its development. To showcase the promise of our framework towards guiding democratic AI, we gather responses from 295 demographically diverse participants. We find that participants' responses emphasize applications for personal life and society, contrasting with most current AI development's business focus. This shows the value of surfacing diverse harms that are complementary to expert assessments. Furthermore, we found that perceived impact of not developing use cases predicted participants' judgements of whether AI use cases should be developed, and highlighted lay users' concerns of techno-solutionism. We conclude with a discussion on how frameworks like Particip-AI can further guide democratic AI governance and regulation.
通用人工智能(如ChatGPT)似乎已经降低了公众使用人工智能并利用其力量的门槛。然而,人工智能的治理和开发仍然掌握在少数人手中,开发速度加速,而没有对风险进行适当评估。作为实现民主治理和人工智能风险评估的第一步,我们引入了Particip-AI框架,一个收集非专家公众当前和未来人工智能使用案例及其害处和好处的研究框架。通过收集案例、在替代场景下进行风险评估(即开发和使用案例)揭示不同损害,并通过在发展问题上的结论选择来阐明人工智能发展的紧张局势。为了展示框架在指导民主人工智能方面的潜力,我们收集了295个具有不同人口统计学特征的参与者的反馈。我们发现,参与者的回答重点强调了人工智能在个人生活和社會领域的应用,与目前大多数人工智能发展的商业重点相反。这表明了揭示互补专家评估的不同损害的价值。此外,我们发现,没有开发使用案例的感知影响预测了参与者的AI使用案例是否开发的判断,并强调了 lay用户对技术解决论的担忧。我们得出结论,讨论了框架如Particip-AI如何进一步指导民主人工智能治理和法规。
https://arxiv.org/abs/2403.14791
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
极化、信任下降以及支持民主规范的不稳定支持,对美国民主构成了严重的威胁。接触到的经过验证和质量的新闻可能会降低个人对这些威胁的易感性,使公民对虚假信息、极化和极端派系言论更具韧性。本项目旨在探讨如何在生态可行的环境中提高用户接触和参与经过验证和思想平衡的新闻。我们依靠一个针对28,457名Twitter用户的2周大规模的现场实验(从2023年1月19日至2月3日)。我们创建了28个使用GPT-2的机器人,它们在用户推特关于体育、娱乐或生活方式时回复,内容包含两个硬编码元素:主题相关高质量新闻组织的URL和鼓励关注其Twitter账号的提示。为了进一步测试机器人的不同性别效果,我们将被随机分配接受机器人回复的用户。我们研究了我们的时间干预是否增加了新闻媒体组织的关注、新闻内容的分享和喜欢,以及关于政治的推特和喜欢政治内容的推特。我们发现,经过治疗的用户关注了更多的新闻账户,女性机器人治疗的用户比控制用户更可能喜欢新闻内容。然而,绝大多数结果的规模较小,仅限于已经具有政治兴趣的Twitter用户,这表明他们在推特上关于政治的推特。这些发现对社交媒体和新闻机构具有影响,并为未来在大型语言模型和其他计算干预上如何有效增强用户在高质量新闻和公共事务上的在线参与提供了方向。
https://arxiv.org/abs/2403.13362