This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95\% with sample rates reduced by 75\% and bit depths and clip length reduced by 50\% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.
本文介绍了一种名为帕雷托数据框架的方法,用于确定并选择最小可行数据(MFD),以在受约束平台(如嵌入式系统、移动设备和物联网设备)上实现机器学习应用。我们证明了战略数据降维可以在保持高性能的同时显著降低带宽、能量、计算和存储成本。该框架通过优化受约束环境中的最小可行数据来保持效率,同时不牺牲性能。它解决了物联网应用中常见的低效实践,如过度配备传感器和过度精确、信号过度采样等,并提出了可扩展的传感器选择、信号提取和传输以及数据表示的最佳实践。一种实验方法证明了在降维、量化和截断后有效的高频数据特征;结果表明,通过将采样率降低75%,比特深度和截断长度降低50%,性能可以维持95%,这相当于降低数据和资源的成本。这些发现对受约束系统的设计和开发有影响。论文还讨论了框架的更广泛的影响,包括其在促进先进AI技术在物联网应用和行业(如农业、交通和制造业)中的应用和普及,以改善数据驱动洞察的访问和扩大其益处。
https://arxiv.org/abs/2409.12112
In the fast-evolving field of Generative AI, platforms like MidJourney, DALL-E, and Stable Diffusion have transformed Text-to-Image (T2I) Generation. However, despite their impressive ability to create high-quality images, they often struggle to generate accurate text within these images. Theoretically, if we could achieve accurate text generation in AI images in a ``zero-shot'' manner, it would not only make AI-generated images more meaningful but also democratize the graphic design industry. The first step towards this goal is to create a robust scoring matrix for evaluating text accuracy in AI-generated images. Although there are existing bench-marking methods like CLIP SCORE and T2I-CompBench++, there's still a gap in systematically evaluating text and typography in AI-generated images, especially with diffusion-based methods. In this paper, we introduce a novel evaluation matrix designed explicitly for quantifying the performance of text and typography generation within AI-generated images. We have used letter by letter matching strategy to compute the exact matching scores from the reference text to the AI generated text. Our novel approach to calculate the score takes care of multiple redundancies such as repetition of words, case sensitivity, mixing of words, irregular incorporation of letters etc. Moreover, we have developed a Novel method named as brevity adjustment to handle excess text. In addition we have also done a quantitative analysis of frequent errors arise due to frequently used words and less frequently used words. Project page is available at: this https URL.
在快速发展的生成式人工智能领域,像MidJourney、DALL-E和Stable Diffusion这样的平台已经改变了文本到图像(T2I)生成。然而,尽管它们在创建高质量图像方面具有令人印象深刻的能力,但它们通常很难在这些图像中生成准确的文本。从理论上讲,如果我们能够在AI图像中以“零 shot”方式实现准确文本生成,这将使AI生成的图像更具意义,并将图形设计行业民主化。实现这一目标的第一个步骤是创建一个用于评估AI生成图像文本准确性的 robust 评分矩阵。尽管已经存在像CLIP SCORE和T2I-CompBench++这样的基准方法,但在系统地评估文本和字体在AI生成图像中的表现方面,仍有差距。在本文中,我们介绍了一种专门针对衡量文本和字体生成在AI生成图像中表现的全新评分矩阵。我们使用逐字匹配策略计算参考文本到AI生成的文本的准确匹配分数。我们的新方法考虑了多个冗余现象,如单词重复、 case sensitivity、混合单词、字母不规则接入等。此外,我们还开发了一种名为简洁调整的新方法来处理过剩文本。此外,我们还对经常出现的错误进行了定量分析,以及对不经常出现的错误进行了深入分析。项目页面可在以下链接处访问:https://this URL。
https://arxiv.org/abs/2409.11874
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: this https URL.
我们介绍 NVLM 1.0,一系列前沿级的多模态大型语言模型(LLMs),在视觉语言任务上实现最先进的成果,与领先的商业模型(如 GPT-4o)和开放访问模型(如 Llama 3-V 405B 和 InternVL 2)相媲美。值得注意的是,在多模态训练后,NVLM 1.0在仅文本输入的测试上的表现已经超过了其LLM骨干网络。在模型设计方面,我们全面比较了编码器-仅多模态LLM(如LLLaVA)和基于跨注意力的模型(如Flamingo)。基于两种方法的优势和劣势,我们提出了一个新颖的架构,既提高了训练效率,又增强了多模态推理能力。此外,我们还引入了一种1D贴标签设计(基于贴标签的动态高分辨率图像)用于提高多模态推理和OCR相关任务的表现。在训练数据方面,我们精心挑选和提供了关于我们的多模态预训练和监督微调数据集的详细信息。我们的研究结果表明,数据质量和任务多样性比规模更重要,即使在预训练阶段,所有架构都在所有架构中更为重要。值得注意的是,我们为NVLM-1.0模型开发了生产级别的多模态性,使它们在视觉语言任务上保持甚至提高文本输入性能,同时保持LLM骨干网络的性能。为了实现这一目标,我们将高质量文本仅数据集融入多模态训练中,同时提供大量多模态数学和推理数据,从而增强模型的多模态能力。为了推动该领域的研究,我们已发布模型权重,并将开源代码:https:// this URL。
https://arxiv.org/abs/2409.11402
As humanoid robots transition from labs to real-world environments, it is essential to democratize robot control for non-expert users. Recent human-robot imitation algorithms focus on following a reference human motion with high precision, but they are susceptible to the quality of the reference motion and require the human operator to simplify its movements to match the robot's capabilities. Instead, we consider that the robot should understand and adapt the reference motion to its own abilities, facilitating the operator's task. For that, we introduce a deep-learning model that anticipates the robot's performance when imitating a given reference. Then, our system can generate multiple references given a high-level task command, assign a score to each of them, and select the best reference to achieve the desired robot behavior. Our Self-AWare model (SAW) ranks potential robot behaviors based on various criteria, such as fall likelihood, adherence to the reference motion, and smoothness. We integrate advanced motion generation, robot control, and SAW in one unique system, ensuring optimal robot behavior for any task command. For instance, SAW can anticipate falls with 99.29% accuracy. For more information check our project page: this https URL
随着人形机器从实验室向现实环境演变,为非专家用户民主化机器人控制至关重要。最近的人-机模仿算法关注于在精确度高度的情况下遵循参考人类运动,但它们容易受到参考运动的质量,并需要人类操作员简化其动作以匹配机器人的能力。相反,我们认为机器人应该理解并适应参考运动,从而简化了操作员的任务。为此,我们引入了一个深度学习模型,用于预测在模仿给定参考时机器人的表现。然后,我们的系统可以根据高级任务指令生成多个参考,为它们分配分数,并选择最佳参考以实现所需机器人行为。我们的自省模型(SAW)根据各种标准对潜在机器人行为进行排名,例如跌倒可能性、遵守参考运动和平滑度。我们将其运动生成、机器人控制和SAW功能整合在一个独特的系统中,确保为任何任务命令提供最优的机器人行为。例如,SAW可以在99.29%的准确度上预测跌倒。更多详情请查看我们的项目页面:此链接
https://arxiv.org/abs/2409.10308
The dawn of Foundation Models has on the one hand revolutionised a wide range of research problems, and, on the other hand, democratised the access and use of AI-based tools by the general public. We even observe an incursion of these models into disciplines related to human psychology, such as the Affective Computing domain, suggesting their affective, emerging capabilities. In this work, we aim to raise awareness of the power of Foundation Models in the field of Affective Computing by synthetically generating and analysing multimodal affective data, focusing on vision, linguistics, and speech (acoustics). We also discuss some fundamental problems, such as ethical issues and regulatory aspects, related to the use of Foundation Models in this research area.
基础模型黎明期的出现一方面彻底颠覆了广泛的 Research 问题,另一方面则民主了公众对基于 AI 工具的访问和使用。我们甚至观察到这些模型在人类心理学领域(如情感计算领域)的入侵,暗示着它们情感和新兴的能力。在这项工作中,我们旨在通过合成和分析多模态情感数据,将基础模型在情感计算领域的力量提升到一个新的高度,重点关注视觉、语言学和语音(声学)领域。我们还讨论了在这个研究领域使用基础模型时的一些基本问题,如伦理问题和监管方面。
https://arxiv.org/abs/2409.08907
Audio and speech coding lack unified evaluation and open-source testing. Many candidate systems were evaluated on proprietary, non-reproducible, or small data, and machine learning-based codecs are often tested on datasets with similar distributions as trained on, which is unfairly compared to digital signal processing-based codecs that usually work well with unseen data. This paper presents a full-band audio and speech coding quality benchmark with more variable content types, including traditional open test vectors. An example use case of audio coding quality assessment is presented with open-source Opus, 3GPP's EVS, and recent ETSI's LC3 with LC3+ used in Bluetooth LE Audio profiles. Besides, quality variations of emotional speech encoding at 16 kbps are shown. The proposed open-source benchmark contributes to audio and speech coding democratization and is available at this https URL.
音频和语音编码缺乏统一评估和开源测试。许多候选系统在非可重复或小数据集上进行评估,而基于机器学习的编码方案通常在训练于类似分布的数据集上进行测试,这不利于与通常能与未见数据良好合作的数字信号处理编码方案相比较。本文提出了一个带有更多变体内容的完整带宽音频和语音编码质量基准,包括传统的开放式测试向量。还给出了使用开源Opus、3GPP的EVS和最近ETSI的LC3(用于蓝牙LE音频配置文件中的LC3+)的音频编码质量示例。此外,展示了16 kbps码率下情感语音编码的质量变化。所提出的开放式基准对音频和语音编码的民主化做出了贡献,并且您可以在此链接https:// URL找到它。
https://arxiv.org/abs/2409.08374
Large language models have demonstrated remarkable capabilities in natural language processing, yet their application to political discourse analysis remains underexplored. This paper introduces a novel approach to evaluating presidential debate performances using LLMs, addressing the longstanding challenge of objectively assessing debate outcomes. We propose a framework that analyzes candidates' "Policies, Persona, and Perspective" (3P) and how they resonate with the "Interests, Ideologies, and Identity" (3I) of four key audience groups: voters, businesses, donors, and politicians. Our method employs large language models to generate the LLM-POTUS Score, a quantitative measure of debate performance based on the alignment between 3P and 3I. We apply this framework to analyze transcripts from recent U.S. presidential debates, demonstrating its ability to provide nuanced, multi-dimensional assessments of candidate performances. Our results reveal insights into the effectiveness of different debating strategies and their impact on various audience segments. This study not only offers a new tool for political analysis but also explores the potential and limitations of using LLMs as impartial judges in complex social contexts. In addition, this framework provides individual citizens with an independent tool to evaluate presidential debate performances, which enhances democratic engagement and reduces reliance on potentially biased media interpretations and institutional influence, thereby strengthening the foundation of informed civic participation.
大型语言模型在自然语言处理方面表现出了非凡的能力,然而将它们应用于政治言论分析仍然是一个未被探索的领域。本文提出了一种利用大型语言模型评估总统辩论表现的新方法,解决了长期以来客观评估辩论结果的挑战。我们提出了一个分析候选人的"政策、人格和观点"(3P)以及它们如何与四个关键受众群体:选民、企业、捐赠者和政治家的"兴趣、理念和身份"(3I)相呼应的框架。我们的方法使用大型语言模型生成LLM-POTUS评分,这是一种基于3P和3I之间一致性的定量衡量辩论表现的方法。我们将这个框架应用于对最近美国总统辩论的转录进行分析,证明了其提供 nuanced, multi-dimensional评估候选表现的能力。我们的结果揭示了不同辩论策略的有效性以及它们对各种受众群体的影响。这项研究不仅为政治分析提供了一种新的工具,而且探索了使用大型语言模型作为中立评判者在一个复杂的社会环境中的潜力和限制。此外,这个框架为个人公民提供了一个独立评估总统辩论表现的工具,这提高了民主参与度,减少了可能带有偏见媒体的解释和制度影响力,从而加强了公民参与的基石。
https://arxiv.org/abs/2409.08147
Online spaces allow people to discuss important issues and make joint decisions, regardless of their location or time zone. However, without proper support and thoughtful design, these discussions often lack structure and politeness during the exchanges of opinions. Artificial intelligence (AI) represents an opportunity to support both participants and organizers of large-scale online participation processes. In this paper, we present an extension of adhocracy+, a large-scale open source participation platform, that provides two additional debate modules that are supported by AI to enhance the discussion quality and participant interaction.
互联网空间允许人们讨论重要问题并做出共同决策,不受地理位置或时区的限制。然而,如果没有适当的支持和周到的设计,这些讨论在意见交流过程中往往缺乏结构和礼貌。人工智能(AI)代表了一个支持大规模在线参与过程的参与者和支持者的机会。在本文中,我们提出了一个名为adhocracy+的大型开源参与平台扩展,该平台支持两个由AI支持的辩论模块,以提高讨论质量和参与者的互动。
https://arxiv.org/abs/2409.07780
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.
本文提出了一种将2D视频转换为沉浸式立体3D的新框架,以满足沉浸体验中3D内容日益增长的需求。利用基本模型作为先验,我们的方法克服了传统方法的限制,提高了性能,确保了显示设备所需的高保真度生成。所提出的系统包括两个主要步骤:基于深度的视频透视用于扭曲和提取遮挡mask,以及立体视频修复。我们利用预训练的稳定视频扩散作为基础,并引入了立体视频修复任务的微调协议。为了处理具有不同长度和分辨率的输入视频,我们探索了自适应策略和分块处理。最后,为了支持我们的训练,开发了一套复杂的数据处理流程来重建大型和高质量的数据集。我们的框架在2D到3D视频转换方面取得了显著的改进,为像苹果视觉专业版和3D显示器这样的3D设备提供了实用的解决方案。总之,本文为该领域提供了一种从单目输入生成高质量立体视频的有效方法,可能改变了我们数字媒体的经历方式。
https://arxiv.org/abs/2409.07447
Materials design often relies on human-generated hypotheses, a process inherently limited by cognitive constraints such as knowledge gaps and limited ability to integrate and extract knowledge implications, particularly when multidisciplinary expertise is required. This work demonstrates that large language models (LLMs), coupled with prompt engineering, can effectively generate non-trivial materials hypotheses by integrating scientific principles from diverse sources without explicit design guidance by human experts. These include design ideas for high-entropy alloys with superior cryogenic properties and halide solid electrolytes with enhanced ionic conductivity and formability. These design ideas have been experimentally validated in high-impact publications in 2023 not available in the LLM training data, demonstrating the LLM's ability to generate highly valuable and realizable innovative ideas not established in the literature. Our approach primarily leverages materials system charts encoding processing-structure-property relationships, enabling more effective data integration by condensing key information from numerous papers, and evaluation and categorization of numerous hypotheses for human cognition, both through the LLM. This LLM-driven approach opens the door to new avenues of artificial intelligence-driven materials discovery by accelerating design, democratizing innovation, and expanding capabilities beyond the designer's direct knowledge.
材料设计通常依赖于人类产生的假设,这一过程在认知限制(如知识缺口和集成能力有限)下具有局限性,特别是在需要跨学科专业知识的情况下。这项工作展示了大型语言模型(LLMs)与提示工程相结合,可以有效地通过整合来自不同来源的科学原则来生成非 trivial 材料假设,而无需通过人类专家的显式设计指导。这些包括具有卓越抗冷性能的高熵合金和具有增强离子导电性和可塑性的卤素固体电解质。这些设计理念已经在 2023 年发表的高影响力出版物中进行了实验验证,证明了 LLM 在生成不是在文献中已建立的创新想法方面具有能力。我们的方法主要依赖于材料系统图编码处理-结构-性能关系,通过压缩大量论文中的关键信息来实现更有效的数据整合,并评估和归类人类认知中的多个假设。通过 LLM 的这种方法为人工智能驱动的材料发现打开了新的途径,通过加速设计、民主化创新和扩大设计者的知识范围来拓展能力。
https://arxiv.org/abs/2409.06756
Open-source, multilingual medical large language models (LLMs) have the potential to serve linguistically diverse populations across different regions. Adapting generic LLMs for healthcare often requires continual pretraining, but this approach is computationally expensive and sometimes impractical. Instruction fine-tuning on a specific task may not always guarantee optimal performance due to the lack of broader domain knowledge that the model needs to understand and reason effectively in diverse scenarios. To address these challenges, we introduce two multilingual instruction fine-tuning datasets, MMed-IFT and MMed-IFT-MC, containing over 200k high-quality medical samples in six languages. We propose a two-stage training paradigm: the first stage injects general medical knowledge using MMed-IFT, while the second stage fine-tunes task-specific multiple-choice questions with MMed-IFT-MC. Our method achieves competitive results on both English and multilingual benchmarks, striking a balance between computational efficiency and performance. We plan to make our dataset and model weights public at \url{this https URL} in the future.
开源、多语言的医疗大型语言模型(LLMs)具有服务于不同地区具有不同语言和文化多样性的人口潜力。为医疗适应通用LLMs通常需要持续的预训练,但这种方法计算代价高昂,有时候也不切实际。在特定任务上的指令微调可能不能保证最优性能,因为模型在多样场景中理解和推理泛化领域知识的需求不足。为了应对这些挑战,我们引入了两个多语言指令微调数据集MMed-IFT和MMed-IFT-MC,包含六个语言超过200k个高质量的医疗样本。我们提出了一个两阶段训练范式:第一阶段通过MMed-IFT注入通用医疗知识,而第二阶段使用MMed-IFT-MC对任务特定多选题进行微调。我们的方法在英语和多语言基准测试中都实现了竞争力的结果,实现了计算效率和性能之间的平衡。我们计划在将来的某个时间将我们的数据集和模型权重公开发布在\url{这个链接}上。
https://arxiv.org/abs/2409.05732
Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.
目前,大量的文档数据以非结构化方式存在,包括可移动文档格式(PDF)文件和图像。从这些文档中提取信息因不同的表格样式、复杂的表格和不同语言的存在而带来了巨大的挑战。为了解决这个问题,已经开发了几种开源工具包,如Camelot、Plumb和Paddle Paddle Structure V2(PP-StructureV2),以帮助从PDF或图像中提取表格。然而,每个工具包都有其局限性。Camelot和pdfnumber只能从数字PDF中提取表格,而不能处理基于图像的PDF和图片。另一方面,PP-StructureV2可以全面提取基于图像的PDF和表格。然而,它缺乏区分不同应用场景的能力,例如有线表格和无线表格、数字PDF和基于图像的PDF。为解决这些问题,我们引入了PDF表格提取(PdfTable)工具包。这个工具包整合了多个开源模型,包括七个表格识别模型、四个光学字符识别(OCR)识别工具和三个排版分析模型。通过优化PDF表格提取过程,PdfTable在各种应用场景中实现可扩展性。我们通过验证自标签的有线表格数据集和开源无线公开表格重建数据集(PubTabNet)来证实PdfTable工具包的有效性。PdfTable代码将在Github上发布:https://this URL。
https://arxiv.org/abs/2409.05125
Fuel cells using oxygen and glucose could power microscopic robots operating in blood vessels. Swarms of such robots can significantly reduce oxygen concentration, depending on the time between successive transits of the lung, hematocrit variation in vessels and tissue oxygen consumption. These factors differ among circulation paths through the body. This paper evaluates how these variations affect the minimum oxygen concentration due to robot consumption and where it occurs: mainly in moderate-sized veins toward the end of long paths prior to their merging with veins from shorter paths. This shows that tens of billions of robots can obtain hundreds of picowatts throughout the body with minor reduction in total oxygen. However, a trillion robots significantly deplete oxygen in some parts of the body. By storing oxygen or limiting their consumption in long circulation paths, robots can actively mitigate this depletion. The variation in behavior is illustrated in three cases: the portal system which involves passage through two capillary networks, the spleen whose slits significantly slow some of the flow, and large tissue consumption in coronary circulation.
使用氧气和葡萄糖作为燃料的燃料电池可以为体内运行的微小机器人提供动力。这类机器人在经过肺的连续转运之后,可以显著降低血红蛋白浓度。这些因素在身体的不同循环路径中有所不同。本文评估了这些变化如何影响机器人消费导致的最低氧气浓度以及发生的位置:主要在长循环路径的中间部分,在它们与较短路径的静脉汇合之前。这表明数十亿个机器人可以通过轻微降低总氧气浓度,在整个身体中获得数百皮瓦的电能。然而,万亿个机器人会在某些身体部位耗尽氧气。通过在长循环路径中储存氧气或限制其消费,机器人可以积极减轻这种耗竭。行为的变化在三个例子中得到了说明:涉及通过两个毛细血管网络的门户系统、脾脏的裂孔显著减慢某些流量以及冠状循环中的大组织消耗。
https://arxiv.org/abs/2409.04916
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 \times 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
我们提出了Open-MAGVIT2,这是一系列从300M到1.5B大小的自回归图像生成模型。Open-MAGVIT2项目生产了一个开源的谷歌MAGVIT-v2词表的副本,该词表具有超级大型的代码本(即$2^{18}$个编码),并在ImageNet $256 \times 256$上实现了最先进的重构性能(1.17 rFID)。此外,我们还探讨了其在普通自回归模型中的应用和在预测具有超级大词汇量时的重要性。为了帮助自回归模型预测,我们通过非对称词因式分解将其因素化为两个大小不同的子词表,并进一步引入“下一子词预测”以提高子词之间的相互作用,从而提高生成质量。我们发布了所有模型和代码,以促进该领域自回归图像生成的创新和创造力。
https://arxiv.org/abs/2409.04410
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
光学字符识别(OCR)继续面临着准确性挑战,这些挑战会影响后续应用。为解决这些问题,我们探讨了OCR置信分数在提高后OCR错误检测中的应用价值。我们的研究涉及分析不同OCR系统之间置信分数与错误率之间的相关性。我们开发了ConfBERT,一种基于BERT的模型,它将OCR置信分数融入词向量表示中,并提供了可选的噪声调整预训练阶段。我们的实验结果表明,将OCR置信分数集成到系统中可以提高错误检测能力。这项工作强调了OCR置信分数在提高检测准确性的重要性,并揭示了商业和开源OCR技术之间的巨大性能差异。
https://arxiv.org/abs/2409.04117
Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.
目前,像Vision Transformers (ViTs)这样的视觉编码器模型通常在图像识别任务上表现出色,但无法同时支持文本识别,如人类视觉识别。为了克服这一限制,我们提出了UNIT,一种旨在将图像和文本识别统一在单个模型中的新训练框架。从预先训练用于图像识别任务的视觉编码器开始,UNIT引入了一个轻量级的语言解码器用于预测文本输出,并引入了一个轻量级的视觉解码器,以防止原始图像编码能力的灾难性遗忘。训练过程包括两个阶段:内层预训练和外层微调。在内层预训练阶段,UNIT从多尺度输入中学习统一的表示,使图像和文档在它们通常使用的分辨率上具有常见的一致性,以实现基本的识别能力。在外层微调阶段,模型引入了尺度交换数据,具有与最常用分辨率不同的图像和文档,以增强其规模鲁棒性。值得注意的是,UNIT保留了原始的视觉编码器架构,使其在推理和部署方面具有免费成本。通过多个基准测试的实验证实,我们的方法在文档相关任务(如OCR和DocQA)方面显著优于现有方法,而保持其在自然图像上的性能,表明其具有在不牺牲其核心图像识别能力的情况下显著增强文本识别的能力。
https://arxiv.org/abs/2409.04095
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at this https URL.
多模态大型语言模型(MLLMs)通过提高文档图像的支持分辨率取得了承诺的OCR-免费文档理解性能。然而,这以生成单个文档图像数千个视觉标记为代价,导致GPU内存过多,推理时间较慢,尤其是在多页文档理解中。在本文中,为了应对这些挑战,我们提出了一个高分辨率DocCompressor模块,将每个高分辨率文档图像压缩成324个标记,并指导低分辨率全局视觉特征。有了这个压缩模块,我们通过三阶段训练框架DocOwl2来开发DocOwl2:单图像预训练、多图像继续预训练和多任务微调。DocOwl2在多页文档理解基准测试中达到了最先进水平,并将第一标记延迟降低了50%以上,证明了在多页问答、证据页的解释和跨页结构理解方面的先进能力。与在类似数据上训练的单图像MLLM相比,我们的DocOwl2的单页理解性能几乎不需要20%的视觉标记。我们的代码、模型和数据公开可得,在这个https URL上。
https://arxiv.org/abs/2409.03420
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL
翻译: 由大型语言模型(LLMs)驱动的自主代理(Autonomous agents powered by large language models, LLMs)引起了 significant research interest。然而,开源社区在为代理任务开发专用模型方面面临许多挑战,这源于高质量代理数据集的稀缺性和该领域中缺乏标准协议。我们介绍并公开发布 xLAM,一系列设计用于 AI 代理任务的 large action models。xLAM 系列包括五个模型,具有稠密和专家架构,从 1B 到 8x22B 参数,使用可扩展、灵活的流水线进行训练,该流水线统一、增强和合成多样数据,以增强 AI 代理的泛化能力和性能。我们的实验结果表明,xLAM 在多个代理能力基准测试中始终交付非凡的性能,特别是在 Berkeley Function-Calling Leaderboard 上获得了第 1 名,超过了 GPT-4、Claude-3 等很多模型。通过发布 xLAM 系列,我们旨在提高开源 LLMs 在自主 AI 代理方面的性能,可能加速进展并使 AI 代理任务的高性能模型触手可及,让更多人受益。模型可以从此链接获取:https://
https://arxiv.org/abs/2409.03215
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
本文介绍了MMMU-Pro,一个强大的多学科多模态理解与推理(MMMU)基准。MMMU-Pro通过一个基于MMMU的三步过程来严格评估多模态模型的真实理解和推理能力: (1)排除仅由文本模型回答的问题, (2)增加候选选项, (3)引入一个仅视觉输入设置,其中问题嵌入图像中。 这种设置挑战了AI真正“看”和“读”的同时,测试了将视觉和文本信息无缝整合的基本人类认知技能。 结果表明,在MMMU-Pro上,模型性能远低于在MMMU上,models之间的性能差异从16.8%到26.9%不等。我们研究了OCR提示和连锁推理(CoT)的影响,发现OCR提示对性能没有影响,而CoT通常提高性能。 MMMU-Pro提供了一个更严格的评估工具,紧密模仿现实世界场景,为多模态AI未来的研究提供了宝贵的方向。
https://arxiv.org/abs/2409.02813
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
传统的光学字符识别(OCR-1.0)系统由于人们对人工智能处理人造光学字符的需求不断增加,越来越难以满足人们的需求。在本文中,我们统称为所有人工光学信号(例如纯文本、数学/分子式、表格、图表、乐谱甚至几何形状)为“字符”,并提出了一个卓越的模型——GOT,以促进OCR-2.0的到来。GOT具有580M参数,是一个统一、优雅、端到端的模型,由高压缩编码器和一个长上下文解码器组成。作为OCR-2.0模型,GOT可以处理各种OCR任务。在输入方面,模型支持常见的场景和文档风格的图像,以切片和整页方式。在输出方面,GOT可以通过一个简单的提示生成纯文本或格式化的结果(Markdown/Tikz/smiles/Kern)。此外,该模型还具有交互式OCR功能,即基于坐标或颜色引导的区域级识别。此外,我们还为GOT适应动态分辨率的多页OCR技术。在实验中,我们提供了足够的结果来证明我们模型的优越性。
https://arxiv.org/abs/2409.01704