Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents -- Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.
大语言模型(LLMs)在支持长上下文理解和处理任务方面取得了显著的进步。然而,将LLMs的生成推理扩展到如此长远的上下文会带来相当大的计算开销,并且需要大量的GPU内存来维持基于Transformer的LLM的关键值(KV)缓存。现有的KV缓存压缩方法,如量化,由于上下文长度增加而面临内存瓶颈,而静态大小的缓存,如 eviction,则存在低效的策略。这些限制使得将LLMs部署到类似于消费者级设备(如单个Nvidia 4090 GPU)上具有挑战性。为了克服这些限制,我们提出了Locret,一个用于长上下文LLM推理的框架,并在其上引入保留头用于评估KV缓存单元的因果重要性,从而在固定缓存大小内实现更准确的缓存置换。Locret是在使用标准长上下文SFT数据集的少量数据对冻结的基本LLM骨干进行微调的。在推理过程中,我们沿着预填充模式分片低重要性缓存单元,从而大大减少了峰值GPU内存使用。我们进行了广泛的实证研究来评估Locret,实验结果表明,Locret在内存效率和生成的内容质量方面优于最近竞争方法,包括InfLLM、量化、SirLLM和MInference——Locret在Φ-3-mini-128K和Llama-3.1-8B-instruct上的KV缓存压缩比分别超过了20倍和8倍。此外,Locret可以与其他方法,如量化和技术词合并。据我们所知,Locret是第一个在单个Nvidia 4090 GPU上部署LLM-like模型的框架,实现了128K长上下文推理,同时不牺牲生成质量,并且对系统优化要求较少。
https://arxiv.org/abs/2410.01805
The advancements in artificial intelligence in recent years, such as Large Language Models (LLMs), have fueled expectations for breakthroughs in genomic foundation models (GFMs). The code of nature, hidden in diverse genomes since the very beginning of life's evolution, holds immense potential for impacting humans and ecosystems through genome modeling. Recent breakthroughs in GFMs, such as Evo, have attracted significant investment and attention to genomic modeling, as they address long-standing challenges and transform in-silico genomic studies into automated, reliable, and efficient paradigms. In the context of this flourishing era of consecutive technological revolutions in genomics, GFM studies face two major challenges: the lack of GFM benchmarking tools and the absence of open-source software for diverse genomics. These challenges hinder the rapid evolution of GFMs and their wide application in tasks such as understanding and synthesizing genomes, problems that have persisted for decades. To address these challenges, we introduce GFMBench, a framework dedicated to GFM-oriented benchmarking. GFMBench standardizes benchmark suites and automates benchmarking for a wide range of open-source GFMs. It integrates millions of genomic sequences across hundreds of genomic tasks from four large-scale benchmarks, democratizing GFMs for a wide range of in-silico genomic applications. Additionally, GFMBench is released as open-source software, offering user-friendly interfaces and diverse tutorials, applicable for AutoBench and complex tasks like RNA design and structure prediction. To facilitate further advancements in genome modeling, we have launched a public leaderboard showcasing the benchmark performance derived from AutoBench. GFMBench represents a step toward standardizing GFM benchmarking and democratizing GFM applications.
近年来人工智能在基因领域的进步,如大型语言模型(LLMs),已经推动了基因组基础模型(GFMs)突破的期望。藏于生命进化之初的基因代码,具有通过基因组建模对人类和生态系统产生巨大影响潜力。最近GFM的突破,如Evo,吸引了大量投资和关注基因建模,因为它们解决了长期挑战,将仿真基因组研究转化为自动、可靠、高效的范式。在当前不断繁荣的基因组技术革命背景下,GFM研究面临两个主要挑战:缺乏GFM基准测试工具和缺乏多样性的基因组软件。这些挑战阻碍了GFM的快速发展和其在诸如理解和合成基因组等任务上的广泛应用,这些问题已经持续了几十年。为了应对这些挑战,我们介绍了GFMBench,一个专注于GFM导向基准测试的框架。GFMBench标准化基准集并自动为多种开源GFM进行基准测试。它从四个大型基准中整合了数百万个基因组序列,为广泛的基因组应用(如GFM)提供了民主化的GFM。此外,GFMBench作为开源软件发布,提供了用户友好的界面和丰富的教程,适用于AutoBench和复杂的任务(如RNA设计结构预测)。为了推动进一步的基因组建模,我们已经推出了基于AutoBench的公开领导者板,展示基准性能。GFMBench标志着将GFM基准测试标准化和民主化的工作向前迈进了一步。
https://arxiv.org/abs/2410.01784
Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.
近年来,扩散模型的进步显著提高了文本到图像(T2I)生成,但它们通常在提高细粒度精度的同时,难以平衡高级控制。像ControlNet和T2I-Adapter这样的方法在遵循成熟艺术家的轮廓方面表现出色,但往往过于刚性,重复新手用户的粗略轮廓中的无意缺陷。与此同时,粗粒度方法,如基于轮廓的抽象框架,提供了一种更易接近的输入处理方式,但缺乏用于详细、专业使用的精确控制。为了克服这些局限性,我们提出了KnobGen,一种双路径方法,通过平滑地适应各种轮廓复杂度和用户技能水平,将基于轮廓的图像生成民主化。KnobGen使用粗粒度控制器(CGC)模块和高精度控制器(FGC)模块。这些模块的相对强度可以通过我们的开关推理机制进行调整,以满足用户的具体需求。这些机制确保KnobGen可以从新手用户的轮廓和资深艺术家的作品中灵活生成图像。这可以在MultiGen-20M数据集和一个新的手绘轮廓数据集上得到保持控制的同时保持图像的自然外观。
https://arxiv.org/abs/2410.01595
Game development is a highly technical practice that traditionally requires programming skills. This serves as a barrier to entry for would-be developers or those hoping to use games as part of their creative expression. While there have been prior game development tools focused on accessibility, they generally still require programming, or have major limitations in terms of the kinds of games they can make. In this paper we introduce Mechanic Maker, a tool for creating a wide-range of game mechanics without programming. It instead relies on a backend symbolic learning system to synthesize game mechanics from examples. We conducted a user study to evaluate the benefits of the tool for participants with a variety of programming and game development experience. Our results demonstrated that participants' ability to use the tool was unrelated to programming ability. We conclude that tools like ours could help democratize game development, making the practice accessible regardless of programming skills.
游戏开发是一个高度技术性的实践,通常需要编程技能。这成为想要开发游戏或希望将游戏作为其创意表达的人的进入障碍。虽然以前有一些专门关注游戏可访问性的游戏开发工具,但它们通常仍然需要编程或者在游戏类型方面存在重大限制。在本文中,我们介绍了 Mechanic Maker,一种用于创建各种游戏机制而无需编程的游戏开发工具。它依赖于一个后端符号学习系统从示例中合成游戏机制。我们对具有各种编程和游戏开发经验的参与者的用户研究表明,该工具的使用与编程能力无关。我们得出结论,像我们这样的工具可以帮助实现游戏开发的民主化,使游戏开发实践不受编程技能的限制。
https://arxiv.org/abs/2410.01096
Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context, particularly as the complexity of the tasks increases.
人类具备多模态阅读能力,使他们能够积极整合各种模态的信息以形成推理。面对词汇歧义等挑战,我们补充了其他模态,如缩微图像或教科书插图。机器能否实现类似的跨模态理解能力呢?为了回答这个问题,我们提出了一个名为“理解 pun with Image Explanations (UNPIE)”的新 benchmark,旨在评估多模态输入在解决词汇歧义方面的影响。由于其固有的歧义, pun 成为这个评估的理想主题。我们的数据集包括 1,000 个 pun,每个 pun 都附带一张图像,解释两个意思。我们通过注释来提出三个多模态挑战,以评估不同多模态理解能力方面; pun grounded,disambiguation 和 reconstruction。结果显示,当给予视觉上下文时,各种 Socratic 模型和 visual-language 模型优于仅基于文本的模型,特别是当任务复杂性增加时。
https://arxiv.org/abs/2410.01023
Disclaimer: Samples in this paper may be harmful and cause discomfort! Patronizing and condescending language (PCL) is a form of speech directed at vulnerable groups. As an essential branch of toxic language, this type of language exacerbates conflicts and confrontations among Internet communities and detrimentally impacts disadvantaged groups. Traditional pre-trained language models (PLMs) perform poorly in detecting PCL due to its implicit toxicity traits like hypocrisy and false sympathy. With the rise of large language models (LLMs), we can harness their rich emotional semantics to establish a paradigm for exploring implicit toxicity. In this paper, we introduce PclGPT, a comprehensive LLM benchmark designed specifically for PCL. We collect, annotate, and integrate the Pcl-PT/SFT dataset, and then develop a bilingual PclGPT-EN/CN model group through a comprehensive pre-training and supervised fine-tuning staircase process to facilitate implicit toxic detection. Group detection results and fine-grained detection from PclGPT and other models reveal significant variations in the degree of bias in PCL towards different vulnerable groups, necessitating increased societal attention to protect them.
声明:本文中的样本可能会对您造成伤害,并可能导致不适!这是一种针对脆弱群体的言论形式,也是毒性语言的一种表现形式。作为有害语言的一个重要分支,这种语言加剧了互联网社区之间的冲突和对抗,同时对弱势群体造成伤害。传统的预训练语言模型(PLMs)由于其内隐的毒性特征(如虚伪和缺乏同情)在检测PCL方面表现不佳。随着大型语言模型的崛起,我们可以利用它们丰富的情感语义来建立一个探索隐性毒性的范式。在本文中,我们介绍了PclGPT,专门为PCL设计的全面LLM基准。我们收集、注释和整合了Pcl-PT/SFT数据集,然后通过全面的预训练和监督微调级过程开发了双语PclGPT-EN/CN模型组,以促进对PCL中隐性毒性的检测。PclGPT和其它模型的集团检测结果以及从PclGPT和其他模型中精细检测的差异揭示了不同脆弱群体在PCL中偏见程度的不同,需要我们加大关注以保护他们。
https://arxiv.org/abs/2410.00361
Autonomous racing demands safe control of vehicles at their physical limits for extended periods of time, providing insights into advanced vehicle safety systems which increasingly rely on intervention provided by vehicle autonomy. Participation in this field carries with it a high barrier to entry. Physical platforms and their associated sensor suites require large capital outlays before any demonstrable progress can be made. Simulators allow researches to develop soft autonomous systems without purchasing a platform. However, currently available simulators lack visual and dynamic fidelity, can still be expensive to buy, lack customisation, and are difficult to use. AARK provides three packages, ACI, ACDG, and ACMPC. These packages enable research into autonomous control systems in the demanding environment of racing to bring more people into the field and improve reproducibility: ACI provides researchers with a computer vision-friendly interface to Assetto Corsa for convenient comparison and evaluation of autonomous control solutions; ACDG enables generation of depth, normal and semantic segmentation data for training computer vision models to use in perception systems; and ACMPC gives newcomers to the field a modular full-stack autonomous control solution, capable of controlling vehicles to build from. AARK aims to unify and democratise research into a field critical to providing safer roads and trusted autonomous systems.
自动驾驶赛车对车辆在物理极限范围内的安全控制提出了要求,为人们深入了解自动驾驶系统提供了对高级车辆安全系统的深入了解,这些系统 increasingly依赖于车辆自主性的干预。参与这个领域需要很高的门槛。物理平台及其相关传感器套件需要巨额资金投入,在实现任何可见的进步之前,都需要投入大量资源。仿真器允许研究人员在不购买平台的情况下开发软自主系统。然而,目前可用的仿真器缺乏视觉和动态精度,仍然很难购买,缺乏定制化,而且很难使用。AARK提供了三个软件包,ACI,ACDG和ACMPC。这些软件包使研究人员能够在赛车领域对自动驾驶系统进行研究,为更多人进入这个领域并提高可重复性做出了贡献:ACI为研究人员提供了一个计算机视觉友好的界面,以便比较和评估自动驾驶解决方案;ACDG能够生成深度、法和语义分割数据,用于训练计算机视觉模型用于感知系统;而ACMPC为新手研究人员提供了一个模块化的全栈自动驾驶解决方案,可用于控制车辆构建。AARK的目标是将和民主化该领域的研究,以提供更安全的道路和值得信赖的自动驾驶系统。
https://arxiv.org/abs/2410.00358
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.
我们提出了MM1.5,一种新型的多模态大型语言模型(MLLM),旨在提高在文本丰富的图像理解、视觉参考和接地以及多图像推理方面的能力。在MM1架构的基础上,MM1.5采用了一种以数据为中心的训练方法,系统地探索了不同数据混合在整个模型训练生命周期中的影响。这包括高质量的OCR数据和合成字幕,用于持续预训练,以及用于监督微调的优化视觉指令数据混合。我们的模型从1B到30B个参数,涵盖密度和专家混合(MoE)变体,并证明了精细的数据选择和训练策略在小规模上也可以产生强大的性能(1B和3B)。此外,我们引入了两个专用变体:MM1.5-Video,用于视频理解,MM1.5-UI,专为移动UI理解而设计。通过广泛的实证研究和抽象,我们提供了对训练过程和决策的详细洞察,为未来MLLM发展的研究提供了宝贵的指导。
https://arxiv.org/abs/2409.20566
Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability. Our code is available at this https URL.
近年来,在Vision-Language Models(VLMs)的进步和高质量多模态对齐数据的稀缺性启发了大量的关于合成VLM数据生成的研究。VLM数据构建的常规范式使用了句法和OCR领域的专家,或者更强的VLM API和昂贵的 human annotation。在本文中,我们提出了World to Code(W2C),一个精心编写的多模态数据生成管道,将最终生成输出组织成Python代码格式。该管道利用VLM本身通过不同的提示提取跨模态信息,并通过一致性过滤策略再次过滤生成的输出。通过实验证明,W2C通过提高不同VLM上各种现有视觉问答和视觉 grounding 基准测试的高质量,展示了其高质量。进一步的分析还表明,VLM的新的代码解析能力比通常使用的详细文本摘要能力表现出更好的跨模态等价性。我们的代码可在此处访问:https://www.aclweb.org/anthology/W2C22020018/
https://arxiv.org/abs/2409.20424
In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
在本文中,我们为日本电子券的错误纠正方法创建了基准并评估了其在OCR(光学字符识别)系统中的有效性。对于自动处理,正确识别扫描的电子券文本至关重要,例如发票上的公司名称。然而,由于噪声(如邮票)的存在,完美识别是复杂的。因此,正确校正错误的OCR结果至关重要。然而,目前尚未公开的日本电子券的OCR错误纠正基准存在,且相关方法的研究不足。在本研究中,我们通过现有的日本电子券服务测量了文本识别准确性,并开发了一个后OCR修正基准。然后,我们使用语言模型提出了简单的错误纠正基线,并验证了所提出的方法是否能有效纠正这些错误。在实验中,与所提出的错误纠正算法相比,所提出的算法显著提高了整体识别准确性。
https://arxiv.org/abs/2409.19948
OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.
OCR错误在数字化历史档案中很常见,这显著地影响其可用性和价值。生成语言模型(LMs)已经表明,通过提供损坏文本的上下文以及更广泛的社会文化上下文,使用Context Leveraging OCR Correction(CLOCR-C)过程来纠正这些错误具有潜力。然而,要获得足够的训练数据来微调这种模型可能具有挑战性。本文表明,使用LM在合成数据上微调语言模型,并使用字符级别马尔可夫污染过程,可以显著提高纠正OCR错误的能力。在基LM上训练的模型的字符级别错误率降低了55%,并且单词级别错误率降低了32%,在真实数据上训练的模型无法与之相比。关键发现包括:在欠腐蚀数据上训练比在过腐蚀数据上训练更好;非均匀字符级别污染比均匀污染更好;每个观察值的更多词元胜过每个观察值的更多观察值。本文的输出是一组8个用于训练有效CLOCR-C模型的提示,一个由11,000个合成19世纪世纪报纸文章组成的数据集,以及用于创建合成污染数据的Python库scrambleText。
https://arxiv.org/abs/2409.19735
In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique <see> token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the <see> token allows the model to first see--observing the regions of the image related to the input question--and then tell--providing articulated textual responses. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets. Leveraging the advanced text processing prowess of GPT-4, we develop the TVG (TableQA with Vision Grounding) dataset, which not only provides text-based Question Answering (QA) pairs but also incorporates precise vision grounding for these pairs. Our approach demonstrates substantial advancements in KIE performance, achieving state-of-the-art results on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made publicly available.
在数字时代,理解和视觉丰富文档的能力至关重要。传统的关键词提取(KIE)方法主要依赖于光学字符识别(OCR),这通常引入显著的延迟、计算开销和错误。当前的先进图像到文本方法通常没有相应的视觉支撑产生纯文本输出。在本文中,我们介绍了STNet(See then Tell Net),一种新颖的端到端模型,旨在提供相关视觉支撑的准确答案。与传统方法不同,STNet利用一个独特的<see>标记来观察相关图像区域,并通过一个解码器解释与该标记相关的物理坐标。定位到答案文本的开头,<see>标记允许模型首先观察输入问题相关的图像区域,然后告诉--提供明确的文本回答。为了提高模型的观察能力,我们收集了广泛的结构化表格识别数据集。利用GPT-4的先进文本处理能力,我们开发了TVG(表QA with Vision Grounding)数据集,它不仅提供了基于文本的问答对,还包含了这些对的精确视觉支撑。我们的方法在关键词提取(KIE)性能上取得了显著的进步,在诸如CORD、SROIE和DocVQA等公开可用数据集上实现了最先进的结果。代码也将公开发布。
https://arxiv.org/abs/2409.19573
Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at this https URL.
文档内容分析是计算机视觉领域的一个重要研究领域。尽管在诸如OCR、排版检测和公式识别等方法上取得了显著的进步,但现有的开源解决方案由于文档类型和内容的多样性,很难始终如一地提供高质量的内容提取。为了应对这些挑战,我们提出了MinerU,一个用于高精度文档内容提取的开源解决方案。MinerU利用先进的PDF-Extract-Kit模型从各种文档中有效提取内容,并采用精细的预处理和后处理规则来确保最终结果的准确性。实验结果表明,MinerU在各种文档类型上始终具有卓越的性能,显著提高了内容提取的质量和一致性。MinerU开源项目可在此处访问:https://www.mineru.org/。
https://arxiv.org/abs/2409.18839
Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available on this website.
编程教程以编写屏幕录像的形式非常重要,这对初学者和经验丰富的开发者都至关重要。然而,由于视频格式的难度,编写屏幕录像教程存在一定的挑战。为了解决大规模和多样数据集的缺乏问题,我们介绍了CodeSCAN数据集。该数据集包含从Visual Studio Code开发环境中捕获的12,000个屏幕截图,涵盖了24种编程语言,25种字体以及超过90个独特主题,还包括多样布局变化和真实的用户交互。此外,我们详细评估了集成开发环境(IDE)元素检测、颜色-to-black-and-white转换和光学字符识别(OCR)的性能。我们希望我们的贡献促进更多研究屏幕录像分析,并将创建数据集和基准的源代码公开发布在网站上。
https://arxiv.org/abs/2409.18556
Companies claim to "democratise" artificial intelligence (AI) when they donate AI open source software (OSS) to non-profit foundations or release AI models, among others, but what does this term mean and why do they do it? As the impact of AI on society and the economy grows, understanding the commercial incentives behind AI democratisation efforts is crucial for ensuring these efforts serve broader interests beyond commercial agendas. Towards this end, this study employs a mixed-methods approach to investigate commercial incentives for 43 AI OSS donations to the Linux Foundation. It makes contributions to both research and practice. It contributes a taxonomy of both individual and organisational social, economic, and technological incentives for AI democratisation. In particular, it highlights the role of democratising the governance and control rights of an OSS project (i.e., from one company to open governance) as a structural enabler for downstream goals, such as attracting external contributors, reducing development costs, and influencing industry standards, among others. Furthermore, OSS donations are often championed by individual developers within companies, highlighting the importance of the bottom-up incentives for AI democratisation. The taxonomy provides a framework and toolkit for discerning incentives for other AI democratisation efforts, such as the release of AI models. The paper concludes with a discussion of future research directions.
公司声称,当他们向非盈利基金会捐赠AI开源软件(OSS)或发布AI模型时,是在“民主化”人工智能(AI),但这个术语的含义是什么,他们为什么要这样做呢?随着AI对社会的影响和对经济的推动作用越来越大,理解AI民主化努力背后的商业激励至关重要,以确保这些努力不仅仅是为了商业目的而进行的。为此,本研究采用了一种混合方法来调查43家AI开源软件捐赠给Linux基金会背后的商业激励。这项研究既具有研究价值,又具有实践意义。它为AI民主化的个人和组织社会、经济和技术激励提供了一个分类框架。 特别是,它突出了通过实现OSS项目(即从一家公司到开源治理)的民主化治理和控制权,作为下游目标(如吸引外部贡献者、降低开发成本、影响行业标准等)的制度性促进因素的重要性。此外,公司的个人开发者通常也会主张进行OSS捐赠,这凸显了AI民主化运动中自下而上的激励机制的重要性。分类为AI民主化努力提供了框架和工具包,也为其他AI民主化努力提供了指导。 论文 concludes with a discussion of future research directions.(论文结论部分进行了未来研究的讨论。)
https://arxiv.org/abs/2409.17876
Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.
使用准确表示文本生成图像,特别是在非拉丁字母语言中,对扩散模型来说是一个重大的挑战。现有的方法,如通过辅助网络将提示条件图整合(例如,ControlNet),已经朝着解决这个问题的方向迈出了步伐。然而,扩散模型在需要控制文本生成的任务中往往表现不足,例如指定特定的字体或生成小字体文本。在本文中,我们介绍了一种名为JoyType的多语言视觉文本创建方法,旨在在图像生成过程中保持文本的字体风格。我们的方法从包括100万对数据的JoyType-1M训练数据开始。每对数据包括图像、描述和与图像中字体风格对应的符号指令。然后我们开发了一个文本控制网络Font ControlNet,负责提取字体风格信息以引导图像生成。为了进一步提高模型在保持字体风格方面的能力,尤其是在生成小字体文本方面,我们将多层OCR感知损失集成到扩散过程中。这个增强使得JoyType能够使用低级描述符进行文本渲染。我们的评估基于视觉和准确度指标,表明JoyType显著优于现有的最先进方法。此外,JoyType可以作为一个插件,在其他稳定扩散模型上促进创建各种图像风格。我们的项目目前是对此https URL开放的源代码项目。
https://arxiv.org/abs/2409.17524
Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a 'view from nowhere' but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at this https URL and will be continually updated.
最近的对语言模型的讨论引发了一个问题,即可能会偏袒某些观点。但解决方案不应该是追求一个“不偏不倚的观点”,而是要利用不同的观点。我们介绍了一个名为Plurals的系统和一个Python库,用于多观点的人工智能协商。Plurals由代理(LLM,可选)组成,它们在可定制的结构中进行协商,并有主持人监督协商过程。Plurals是一个生成模拟社交群体的系统。Plurals与政府数据集结合使用,创建了具有全国代表性的个人形象,包括源于民主协商理论的协商模板,并允许用户在结构中自定义信息共享结构和协商行为。六个案例研究展示了理论概念的忠实性和有效性。三个随机实验表明,模拟的焦点小组产生的输出与相关受众(在75%的试验中选择,而不是零散生成的75%)的在线样本相吻合。Plurals既是多观点人工智能的一个范例,也是一个具体的系统。Plurals库可通过此链接获得,并将持续更新。
https://arxiv.org/abs/2409.17213
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters. Detection-based approaches have until now been largely discarded for HTR because reading characters separately is often challenging, and character-level annotation is difficult and expensive. We overcome these challenges thanks to three main insights: (i) synthetic pre-training with sufficiently diverse data enables learning reasonable character localization for any script; (ii) modern transformer-based detectors can jointly detect a large number of instances, and, if trained with an adequate masking strategy, leverage consistency between the different detections; (iii) once a pre-trained detection model with approximate character localization is available, it is possible to fine-tune it with line-level annotation on real data, even with a different alphabet. Our approach, dubbed DTLR, builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding, predicting character values one by one, while we treat a complete line in parallel. Remarkably, we demonstrate good performance on a large range of scripts, usually tackled with specialized approaches. In particular, we improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets. Our code and models are available at this https URL.
我们提出了一个基于检测的文本行识别方法,无论是打印(OCR)还是手写(HTR),支持拉丁字母、汉字或密码字符。基于检测的方法此前在HTR上被废弃,因为单独阅读字符往往具有挑战性,而字符级别的注释既困难又昂贵。我们通过三个主要见解克服了这些挑战: (i)充分多样化的数据进行预训练,使得学习任何文本中合理的字符局部定位变得容易; (ii)基于现代Transformer-based检测器可以共同检测大量实例,如果使用适当的遮盖策略进行训练,可以利用不同检测之间的一致性; (iii)一旦预训练检测模型具有近似的字符局部定位,那么在真实数据上进行线级注释,甚至具有不同的字母表,都可以进行微调。 我们的方法被称为DTLR,与现有的HTR方法完全不同,后者依赖于自回归解码,逐个预测字符值。而我们对整个行并行处理。 值得注意的是,我们在一系列剧本上表现出良好的性能,通常采用专门的方法来解决。特别是,我们提高了CASIA v2数据集上中文剧本识别的现有最佳性能,以及Borg和Copiale数据集上密码识别的现有最佳性能。 我们的代码和模型可以从该链接https://www.thunberg.cs.berkeley.edu/~jie/dtlr/index.html获取。
https://arxiv.org/abs/2409.17095
This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.
本文研究了Transformer架构中OCR敏感神经元的存在及其对历史文件中命名实体识别(NER)性能的影响。通过分析响应于干净和噪音文本输入的神经元活动模式,我们识别出并中和了OCR敏感神经元,以提高模型性能。基于两个开放访问的大型语言模型(Llama2和Mistral)的实验证明存在OCR敏感区域,并表明在历史报纸和经典评论上NER性能得到了提高,突出了定向神经元调节改善模型在噪音文本上的性能的潜力。
https://arxiv.org/abs/2409.16934
We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat's effectiveness in addressing the limitations of existing aproaches.
我们深入探讨了在视频监控场景中准确估计3D人体姿态和形状的挑战。从倡导使用度量诸如W-MPJPE和W-PVE等指标开始,这些指标省略了(Procrustes)归一化步骤,以提高模型评估。接着我们引入了RotAvat技术。这种技术旨在通过优化3D网格与地面平面之间的对齐来提高这些指标。通过定性比较,我们证明了RotAvat在解决现有方法的局限性方面非常有效。
https://arxiv.org/abs/2409.16861