SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.
SLEDGE 是一款基于现实世界驾驶日志的车辆运动规划生成器。其核心组件是一个学习模型,能够生成代理体的边界框和道路图。模型的输出作为交通仿真的基础状态。SLEDGE 生成的实体具有连接性和场景内变量个数的变化,这使得将最现代的生成模型应用于此任务变得非平凡。因此,与现有的道路图表示的系统研究相结合,我们引入了一种新颖的位图到向量的自动编码器(RVAE)。它将代理体和道路图编码到位图化的潜在地图的 distinct 通道中。这促进了基于道路条件的代理体生成和扩散Transformer生成的道路和代理体的联合生成。使用 SLEDGE 生成的实体,可以更好地控制仿真,例如放大变道或增加交通密度。此外,SLEDGE 可以在现有的数据驱动模拟器中支持 500 米长的路线,而 nuPlan 等现有数据驱动模拟器无法实现。它为规划算法带来了新的挑战,这在由我们的模型生成的困难路线和密集交通的测试中表现出了超过 40% 的失败率。与 nuPlan 相比,SLEDGE 需要存储的内存减少了 500 倍(<4GB),使其成为一个更易访问的选项,并有助于推动该领域未来研究的民主化。
https://arxiv.org/abs/2403.17933
Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale dataset with 485K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.
问题回答(QA)和机器阅读理解(MRC)任务在近年来显著发展,主要得益于深度学习技术的快速发展以及最近大型语言模型的广泛应用。同时,许多针对QA和MRC任务的基准数据集已经变得可用。然而,现有的大型基准数据集主要是在类似于维基百科或互联网的同步文档集合中创建的。档案文档集合,如历史报纸,包含有价值的历史信息,但这些信息尚未被广泛用于训练大型语言模型。为了进一步推动QA和MRC任务的发展,克服前人数据集的局限,我们引入了ChroniclingAmericaQA,一个基于历史报纸收集的大型数据集,包含了485K个问题-答案对。我们的数据集是基于Chronicling America报纸收藏库的一个子集构建的,该收藏库跨度为120年。 利用数字化的历史报纸集合的一个显著挑战是OCR文本的质量较低。因此,为了实现对QA模型的真实测试,我们的数据集可以以三种方式使用:回答原始和噪音内容的问题,回答清洁和修正过的内容的问题,以及回答从报纸页面扫描图像中回答的问题。ChroniclingAmericaQA在现有QA数据集中的时间跨度最长,使其成为一个独特且有用的资源。
https://arxiv.org/abs/2403.17859
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.
制作有效的图例是重要的。读者 heavily依赖这些图例来理解图中的信息。然而,尽管为图和图例开发了一套先进的AI技术,但这些技术很少被测试其对帮助编写的有用性。本文介绍了SciCapenter,一个交互式系统,旨在结合最先进的AI技术,帮助科学图例的编写。SciCapenter为每张图生成多种图例,提供分数和全面的审核清单,以评估图例在多个关键方面的质量,如有用性、OCR提及、关键要点和视觉特性参考。用户可以直接在SciCapenter中编辑图例,重新提交进行修订,并逐步改进它们。一份用户研究结果表明,SciCapenter显著降低了编写图例的认知负担。参与者的反馈进一步提供了有价值的系统设计洞察,为未来的系统旨在增强编写图例提供了指导。
https://arxiv.org/abs/2403.17784
In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.
在本文中,我们提出了提高论文中图表的摘要质量的解决方案。我们采用概述论文文本内容的策略来生成图像摘要。在整个研究过程中,我们遇到了官方数据集中提供OCR信息的差异。为了纠正这个问题,我们使用PaddleOCR工具包从所有图像中提取OCR信息。此外,我们还观察到官方论文中某些文本内容与标题无关,从而在生成标题时引入噪声。为了减轻这个问题,我们利用LLaMA提取图像特定信息,通过基于图像提及的文本内容进行查询,有效地过滤出无关信息。此外,我们认识到在文本生成中使用最大似然估计与评估指标如ROUGE之间存在差异。为了弥合这个差距,我们将BRIO模型框架集成到我们的方法中,实现了生成和评估过程之间的更一致性。我们的方法在最终测试中排名第一,得分4.49。
https://arxiv.org/abs/2403.17342
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through the conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR and speech recognition are utilized to transform the images and speech signals into text content. All these variety of mechanisms of text generation also introduce errors into the captured text. This project aims at analyzing different kinds of error that occurs in text documents. The work employs two of the advanced deep neural network-based language models, namely, BART and MarianMT, to rectify the anomalies present in the text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both models can bring down the erroneous sentences by 20+%, BART can handle spelling errors far better (24.6%) than grammatical errors (8.8%).
文本 remains 是一种相关形式的信息表示形式。文本文件可以通过数字原生平台或转换其他媒体文件(如图像和语音)来创建。虽然数字原生文本是通过物理或虚拟键盘获得的,但使用 OCR 和语音识别等技术将图像和语音信号转换为文本内容。所有这些文本生成机制也引入了错误到捕获到的文本中。本项目旨在分析文本文件中发生的不同类型的错误。该工作采用两个先进的基于深度神经网络的语言模型,即 BART 和 MarianMT,来纠正文本中的异常。使用这些模型的可转移学习来微调其纠正错误的能力。对这两种模型在处理定义的错误类别的效果进行了比较研究。观察到,虽然两种模型都可以将错误的句子降低 20% 以上,但 BART 在处理拼写错误方面远比语法错误(8.8%)要好(24.6%)。
https://arxiv.org/abs/2403.16655
Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
之前的研究表明,预训练技术可以提高视觉文档理解(VDU)的性能,通常需要模型具备感知和推理文档文本和布局的能力(例如,文本和表格单元格的位置)。为此,我们提出了一个视觉引导的生成式文本布局预训练方案,名为ViTLP。给定一个文档图像,模型优化了语言和布局建模目标,生成并生成交互式的文本和布局序列。为了应对使用Transformer处理长文档的局限性,我们引入了一种简单而有效的多段生成预训练方案,帮助ViTLP处理任何长度的文档。ViTLP可以作为本地 OCR 模型来定位和识别文档图像中的文本。此外,ViTLP可以有效地应用于各种下游VDU任务。大量实验证明,ViTLP在基准VDU任务上的竞争性能与现有基线相当,包括信息提取、文档分类和文档问题回答。
https://arxiv.org/abs/2403.16516
Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature
在过去的几年里,基于扩散模型的文本转图像(T2I)生成方法已经获得了显著的关注。然而,基本的扩散模型通常在生成的图像中显示的文本中存在拼写不准确的问题。生成视觉文本的能力至关重要,既具有学术意义,又具有广泛的应用价值。为了产生准确的视觉文本图像,最先进的技术采用了一种基于字符级别控制的图像生成方法,包括一个文本布局生成器和一个根据生成的文本布局进行条件的图像生成器。然而,我们的研究揭示了这些模型仍然面临三个主要挑战,促使我们开发一个测试平台来促进未来的研究。我们引入了一个专门为测试模型生成具有长篇和复杂视觉文本的图像而设计的基准,即LenCom-Eval。接着,我们引入了一个无需训练的框架来增强两种级联生成方法。我们在LenCom-Eval和MARIO-Eval基准上评估了我们的方法的有效性,并展示了在包括CLIPScore、OCR精度、召回、F1分数、准确性和编辑距离分数在内的各种评估指标上显着改善。例如,与基准模型相比,我们提出的框架在LenCom-Eval基准上提高了超过23%,而在MARIO-Eval基准上提高了13.5%。我们的工作为该领域通过专注于生成长篇和罕见文本序列的图像做出了独特的贡献,而这一领域之前尚未被现有文献所探索。
https://arxiv.org/abs/2403.16422
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at this https URL.
光字符识别(OCR)是一个已经确立的任务,旨在识别图像中的文本。虽然有许多现成的OCR模型存在,但它们通常是在科学(如公式)或通用打印英语文本上训练的。从化学出版物中提取文本需要一种在两种情况下都具有良好能力的OCR模型。 Nougat是一个最近的工具,表现出对学术文档的解析能力,但它无法解析PubMed文章中的表格,而这在学术社区中占有重要地位,也是本研究的核心。为了弥合这一差距,我们提出了Printed English and Chemical Equations(PEaCE)数据集,包含合成和真实世界记录,并评估了当用此资源训练Transformer-based OCR模型时,其有效性。鉴于真实世界记录包含在合成记录中不存在的 artifacts,我们提出了模拟这种特性的变换。我们进行了一系列实验,以探索补丁大小、多领域训练以及我们提出的变换对OCR模型的影响,最终发现,在补丁尺寸较小、跨领域训练并使用我们提出的变换训练的模型中,模型的性能最好。我们的数据和代码可以在该https URL上找到。
https://arxiv.org/abs/2403.15724
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE). Nonetheless, most existing methods are predominantly designed for Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence. Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges. To overcome these limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03% and 9.03% respectively on the dev and test set.
大语言模型(LLMs)在理解和生成文本方面表现出色,这促使许多研究人员将它们应用于信息抽取(IE)任务,包括关系抽取(RE)。然而,现有的方法主要针对句子级别的关系抽取(SentRE)任务,通常涵盖单个句子内的有限关系和三元组事实。此外,某些方法将关系视为提示模板中的候选选项,导致处理效率低下,性能较差,当处理文档级别关系抽取(DocRE)任务时,这会带来明确的挑战。为了克服这些限制,我们引入了AutoRE,一种端到端的关系抽取模型,采用名为RHF(关系-头-事实)的新颖关系抽取范式。与现有方法不同,AutoRE不依赖于已知的关系选项,因此更贴近现实场景。此外,我们还使用参数有效微调(PEFT)算法(QLoRA)开发了一个易于扩展的关系抽取框架。我们对RE-DocRED数据集的实验结果展示了AutoRE的最佳性能,实现了最先进的水平,分别比TAG提高了10.03%和9.03%。
https://arxiv.org/abs/2403.14888
General purpose AI, such as ChatGPT, seems to have lowered the barriers for the public to use AI and harness its power. However, the governance and development of AI still remain in the hands of a few, and the pace of development is accelerating without proper assessment of risks. As a first step towards democratic governance and risk assessment of AI, we introduce Particip-AI, a framework to gather current and future AI use cases and their harms and benefits from non-expert public. Our framework allows us to study more nuanced and detailed public opinions on AI through collecting use cases, surfacing diverse harms through risk assessment under alternate scenarios (i.e., developing and not developing a use case), and illuminating tensions over AI development through making a concluding choice on its development. To showcase the promise of our framework towards guiding democratic AI, we gather responses from 295 demographically diverse participants. We find that participants' responses emphasize applications for personal life and society, contrasting with most current AI development's business focus. This shows the value of surfacing diverse harms that are complementary to expert assessments. Furthermore, we found that perceived impact of not developing use cases predicted participants' judgements of whether AI use cases should be developed, and highlighted lay users' concerns of techno-solutionism. We conclude with a discussion on how frameworks like Particip-AI can further guide democratic AI governance and regulation.
通用人工智能(如ChatGPT)似乎已经降低了公众使用人工智能并利用其力量的门槛。然而,人工智能的治理和开发仍然掌握在少数人手中,开发速度加速,而没有对风险进行适当评估。作为实现民主治理和人工智能风险评估的第一步,我们引入了Particip-AI框架,一个收集非专家公众当前和未来人工智能使用案例及其害处和好处的研究框架。通过收集案例、在替代场景下进行风险评估(即开发和使用案例)揭示不同损害,并通过在发展问题上的结论选择来阐明人工智能发展的紧张局势。为了展示框架在指导民主人工智能方面的潜力,我们收集了295个具有不同人口统计学特征的参与者的反馈。我们发现,参与者的回答重点强调了人工智能在个人生活和社會领域的应用,与目前大多数人工智能发展的商业重点相反。这表明了揭示互补专家评估的不同损害的价值。此外,我们发现,没有开发使用案例的感知影响预测了参与者的AI使用案例是否开发的判断,并强调了 lay用户对技术解决论的担忧。我们得出结论,讨论了框架如Particip-AI如何进一步指导民主人工智能治理和法规。
https://arxiv.org/abs/2403.14791
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
极化、信任下降以及支持民主规范的不稳定支持,对美国民主构成了严重的威胁。接触到的经过验证和质量的新闻可能会降低个人对这些威胁的易感性,使公民对虚假信息、极化和极端派系言论更具韧性。本项目旨在探讨如何在生态可行的环境中提高用户接触和参与经过验证和思想平衡的新闻。我们依靠一个针对28,457名Twitter用户的2周大规模的现场实验(从2023年1月19日至2月3日)。我们创建了28个使用GPT-2的机器人,它们在用户推特关于体育、娱乐或生活方式时回复,内容包含两个硬编码元素:主题相关高质量新闻组织的URL和鼓励关注其Twitter账号的提示。为了进一步测试机器人的不同性别效果,我们将被随机分配接受机器人回复的用户。我们研究了我们的时间干预是否增加了新闻媒体组织的关注、新闻内容的分享和喜欢,以及关于政治的推特和喜欢政治内容的推特。我们发现,经过治疗的用户关注了更多的新闻账户,女性机器人治疗的用户比控制用户更可能喜欢新闻内容。然而,绝大多数结果的规模较小,仅限于已经具有政治兴趣的Twitter用户,这表明他们在推特上关于政治的推特。这些发现对社交媒体和新闻机构具有影响,并为未来在大型语言模型和其他计算干预上如何有效增强用户在高质量新闻和公共事务上的在线参与提供了方向。
https://arxiv.org/abs/2403.13362
The increasing complexity of deep neural networks poses significant barriers to democratizing them to resource-limited edge devices. To address this challenge, split federated learning (SFL) has emerged as a promising solution by of floading the primary training workload to a server via model partitioning while enabling parallel training among edge devices. However, although system optimization substantially influences the performance of SFL under resource-constrained systems, the problem remains largely uncharted. In this paper, we provide a convergence analysis of SFL which quantifies the impact of model splitting (MS) and client-side model aggregation (MA) on the learning performance, serving as a theoretical foundation. Then, we propose AdaptSFL, a novel resource-adaptive SFL framework, to expedite SFL under resource-constrained edge computing systems. Specifically, AdaptSFL adaptively controls client-side MA and MS to balance communication-computing latency and training convergence. Extensive simulations across various datasets validate that our proposed AdaptSFL framework takes considerably less time to achieve a target accuracy than benchmarks, demonstrating the effectiveness of the proposed strategies.
深度神经网络的复杂度增加对将它们民主化到资源受限的边缘设备上存在重大障碍。为解决这个问题,分裂式联邦学习(SFL)已成为一个有前景的解决方案,通过通过模型分割将主要训练工作负载分摊到服务器,同时使边缘设备之间进行并行训练。然而,尽管系统优化在资源受限系统中的SFL性能中产生了重大影响,但问题仍然没有明确的路线图。在本文中,我们提供了SFL收敛分析, quantify了模型分割(MS)和客户端模型聚合(MA)对学习性能的影响,为SFL在资源受限边缘计算系统中的发展提供了理论基础。然后,我们提出了AdaptSFL,一种新的资源适应性SFL框架,以加速在资源受限边缘计算系统中的SFL。具体来说,AdaptSFL动态地控制客户端模型聚合和分裂以平衡通信计算延迟和训练收敛。在各种数据集上的大量模拟证实,与基准相比,我们提出的AdaptSFL框架用时显著减少,证明了所提出策略的有效性。
https://arxiv.org/abs/2403.13101
Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at this https URL.
结构信息对于理解富含文本的图像(如文档、表格和图表)的语义至关重要。现有的多模态大型语言模型(MLLMs)用于视觉文档理解已经配备了文本识别能力,但缺乏对文本丰富文档图像的一般结构理解能力。在本文中,我们强调在视觉文档理解中结构信息的重要性,并提出了一种统一的结构学习方法来提高MLLM的性能。我们的统一结构学习方法包括结构感知的解析任务和跨越5个领域的多粒度文本局部化任务:文档、网页、表格、图表和自然图像。为了更好地表示结构信息,我们设计了一个简单而有效的视觉到文本模块H-Reducer,它不仅可以保留布局信息,还可以通过卷积合并水平相邻补丁来减小视觉特征的长度,使LLM更有效地理解高分辨率图像。此外,通过为公开可用的文本丰富图像构建结构感知的文本序列和多粒度文本与边界框的对,我们构建了一个全面的训练集DocStruct4M来支持结构学习。最后,我们构建了一个小但高质量的关系调整数据集DocReason25K,用于在文档领域触发详细解释能力。我们的模型DocOwl 1.5在10个视觉文档理解基准测试中实现了最先进的性能,将文本丰富模型的最先进性能提高了10个点以上。我们的代码、模型和数据集都可以在https:// this URL公开获取。
https://arxiv.org/abs/2403.12895
Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by \citet{chen2023pali3}, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by \citet{liu2023deplot}. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by \citet{hsieh2023distilling}. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt \cite{chen2023program}, our model outperforms the recently introduced Gemini Ultra and GPT-4V.
我们的研究显示,随着多模态任务越来越多地应用于语言模型,视觉语言模型(VLMs)的表现越来越强。然而,推理能力仍然局限于较小的VLM,而大型语言模型(LLMs)的表现已经得到了显著的改进。为了将LLM的推理能力应用到VLM中,我们提出了一种技术。在最近引入的ChartQA上,我们的方法在应用到PaLI3-5B VLM上时获得了最先进的性能,同时还能够在PlotQA和FigureQA上实现更好的性能。 首先,我们通过使用改进版本的图表到表格的转换任务继续前预训练阶段来提高图表表示。然后,我们提出了构建一个比原始训练集更大的20倍数据集的提议。为了提高通用推理能力和提高数值运算,我们使用表格表示来合成推理痕迹。最后,我们对模型进行了微调,使用了由\citet{hsieh2023distilling}引入的多任务损失。 我们的变体ChartPaLI-5B在不需要上游OCR系统的情况下,甚至比10倍更大的模型(如PaLIX-55B)的表现还要好。当使用简单的程序思维提示进行进一步的推理细化时,我们的模型甚至超过了最近引入的Gemini Ultra和GPT-4V。
https://arxiv.org/abs/2403.12596
Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.
自动光学检测(AOI)在制造业过程中扮演着关键角色,主要利用高分辨率成像仪器进行扫描。它通过分析图像纹理或模式来检测异常,因此在工业制造和质量控制中成为必不可少的工具。尽管AOI非常重要,但部署模型进行AOI通常面临挑战。这些挑战包括样本量有限、源域间差异和成像过程中的光线和相机位置变化对模型的敏感性等。这些因素共同削弱了模型的预测准确性。传统的AOI往往没有充分利用机器或内部图像的丰富机制参数信息,包括统计参数,这些参数通常对AOI分类有利。为了应对这个问题,我们引入了一个外部模式引导的数据挖掘框架,主要基于光学字符识别(OCR),旨在从图像作为第二模态提取统计特征以提高性能,称之为OANet(Ocr-Aoi-Net)。我们方法的关键方面是对单模态模型的外模式特征与通过卷积神经网络编码的图像特征之间的对齐。这种协同作用使不同模态语义表示的融合更加精确。我们进一步引入了特征精度和一个门控函数在我们的OANet中优化这些特征,提高推理和决策能力。实验结果表明,我们的方法显著提高了缺陷检测模型的召回率,在具有挑战性的场景下,表现仍然良好。
https://arxiv.org/abs/2403.11536
The maintenance, archiving and usage of the design drawings is cumbersome in physical form in different industries for longer period. It is hard to extract information by simple scanning of drawing sheets. Converting them to their digital formats such as Computer-Aided Design (CAD), with needed knowledge extraction can solve this problem. The conversion of these machine drawings to its digital form is a crucial challenge which requires advanced techniques. This research proposes an innovative methodology utilizing Deep Learning methods. The approach employs object detection model, such as Yolov7, Faster R-CNN, to detect physical drawing objects present in the images followed by, edge detection algorithms such as canny filter to extract and refine the identified lines from the drawing region and curve detection techniques to detect circle. Also ornaments (complex shapes) within the drawings are extracted. To ensure comprehensive conversion, an Optical Character Recognition (OCR) tool is integrated to identify and extract the text elements from the drawings. The extracted data which includes the lines, shapes and text is consolidated and stored in a structured comma separated values(.csv) file format. The accuracy and the efficiency of conversion is evaluated. Through this, conversion can be automated to help organizations enhance their productivity, facilitate seamless collaborations and preserve valuable design information in a digital format easily accessible. Overall, this study contributes to the advancement of CAD conversions, providing accurate results from the translating process. Future research can focus on handling diverse drawing types, enhanced accuracy in shape and line detection and extraction.
设计图的维护、归档和使用在不同的行业中存在形式上的复杂性,尤其是在较长时间内。通过简单的扫描图纸来提取信息非常困难。将它们转换为数字格式,如计算机辅助设计(CAD),如果需要知识提取,可以解决这个问题。将这些机器图纸转换为数字形式是一个关键的挑战,需要先进的技术。这项研究提出了利用深度学习方法的创新方法。该方法采用物体检测模型(如Yolov7、Faster R-CNN)来检测图像中的物理绘图物体,然后使用边缘检测算法(如canny滤波器)提取和优化已识别的线条和曲线检测技术(如圆检测)来检测圆。此外,还提取了图纸中的装饰物(复杂形状)。为了确保全面的转换,还集成了光学字符识别(OCR)工具,用于从设计图中识别和提取文本元素。提取的数据包括线条、形状和文本,以结构化的逗号分隔值(.csv)文件格式进行汇总。评估了转换的准确性和效率。通过这项研究,转换可以自动化,帮助组织提高生产力,促进无缝合作,并轻松地保存有价值的数字设计信息。总的来说,这项研究为CAD转换的发展做出了贡献,从翻译过程中提供了准确的结果。未来的研究可以关注处理不同类型的绘图,形状和线检测的准确度,以及提取。
https://arxiv.org/abs/2403.11291
Misinformation undermines public trust in science and democracy, particularly on social media where inaccuracies can spread rapidly. Experts and laypeople have shown to be effective in correcting misinformation by manually identifying and explaining inaccuracies. Nevertheless, this approach is difficult to scale, a concern as technologies like large language models (LLMs) make misinformation easier to produce. LLMs also have versatile capabilities that could accelerate misinformation correction; however, they struggle due to a lack of recent information, a tendency to produce plausible but false content and references, and limitations in addressing multimodal information. To address these issues, we propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving contextual evidence and refutations, MUSE can provide accurate and trustworthy explanations and references. It also describes visuals and conducts multimodal searches for correcting multimodal misinformation. We recruit fact-checking and journalism experts to evaluate corrections to real social media posts across 13 dimensions, ranging from the factuality of explanation to the relevance of references. The results demonstrate MUSE's ability to correct misinformation promptly after appearing on social media; overall, MUSE outperforms GPT-4 by 37% and even high-quality corrections from laypeople by 29%. This work underscores the potential of LLMs to combat real-world misinformation effectively and efficiently.
错误信息破坏了科学和民主的公众信任,特别是在社交媒体上,不准确信息可以迅速传播。专家和普通人群通过手动识别和解释不准确信息来纠正错误信息。然而,这种方法难以扩展,因为像大型语言模型(LLMs)这样的技术使得制造不准确信息变得更加容易。LLMs还具有多才多艺的功能,可以加速不准确信息的纠正。然而,它们由于缺乏最新信息、倾向于制作合理但虚假的内容和参考以及难以处理多模态信息而遇到困难。为了应对这些问题,我们提出了MUSE,一个带有最新信息访问和可信度评估的LLM。通过检索上下文证据和反驳,MUSE可以提供准确和可信的解释和参考。它还可以描述视觉信息和进行多模态搜索来纠正多模态不准确信息。我们招募了事实核查和新闻专家对13个方面的真实社交媒体帖子进行纠正,从事实解释的准确性到参考的相关性。结果表明,MUSE在社交媒体上出现后能够迅速纠正错误信息;总体而言,MUSE超过了GPT-4,甚至超过了来自普通人的高质量更正。这项工作强调了LLM在对抗现实世界的错误信息方面有效地和高效性的潜力。
https://arxiv.org/abs/2403.11169
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
This paper introduces semantic features as a general conceptual framework for fully explainable neural network layers. A well-motivated proof of concept model for relevant subproblem of MNIST consists of 4 such layers with the total of 4.8K learnable parameters. The model is easily interpretable, achieves human-level adversarial test accuracy with no form of adversarial training, requires little hyperparameter tuning and can be quickly trained on a single CPU. The general nature of the technique bears promise for a paradigm shift towards radically democratised and truly generalizable white box neural networks. The code is available at this https URL
本文介绍了一种作为完全可解释神经网络层的一般概念框架的语义特征。一个动机强烈的证明概念模型,用于解释与MNIST相关子问题,包括4层,具有总4.8K可学习参数。该模型易于解释,在没有任何形式的对抗训练的情况下,实现了与人类水平相当的对抗测试准确性。 hyperparameter无需太多调整,可以在单个CPU上快速训练。这种技术的一般性质为彻底民主化和真正通用白皮书神经网络范式带来了希望。代码可在此链接处获取:
https://arxiv.org/abs/2403.09863
Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
多模态大型语言模型(MLLMs)表现出惊人的推理能力,然而,与LLM前身的易受攻击性相比,它们也更容易受到黑客攻击。虽然MLLM中的预对齐LLM的安全机制仍然能够检测到不安全的响应,但我们的观察发现,由于引入了图像特征,MLLM预对齐LLM的安全机制很容易被绕过。为了构建稳健的MLLM,我们提出了ECSO(闭眼,安全开启),一种新颖的训练免费的保护方法,它利用了MLLMs固有的安全意识,并通过自适应地将不安全的图像转换为文本来激活预对齐LLM的安全机制。在五个最先进的(SoTA)MLLM上的实验表明,我们的ECSO显著增强了模型安全性(例如,在MM-SafetyBench (SD+OCR)上的改进率为37.6%,在VLSafe上的改进率为71.3%)。同时,我们在常见MLLM基准上保持了使用价值结果。此外,我们还证明了ECSO可以作为数据引擎,用于为MLLM的对齐生成监督微调(SFT)数据,而无需额外的人干预。
https://arxiv.org/abs/2403.09572