This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
本文介绍了华为翻译服务中心(HW-TSC)为第19届国际文档分析和识别会议(ICDAR2025)的“复杂布局端到端文档图像机器翻译”竞赛提出的解决方案。我们利用最先进的开源大视觉-语言模型(LVLM),引入了一个结合多任务学习与感知链式思维的训练框架,以开发一个全面的端到端文档翻译系统。在推理阶段,我们应用最小贝叶斯解码和后处理策略,进一步增强系统的翻译能力。我们的解决方案独特地在一个统一框架内同时解决了基于OCR(光学字符识别)和无OCR文档图像翻译任务。本文详细介绍了训练方法、推理策略、LVLM基础模型、训练数据、实验设置及结果,展示了有效进行文档图像机器翻译的方法。
https://arxiv.org/abs/2504.17315
The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.
关键点提取领域,对于像基于运动的结构重建(SfM)和同时定位与地图构建(SLAM)这样的视觉应用至关重要。这一领域的研究已经从依赖手工设计的方法转向了利用深度学习技术。尽管深度学习方法显著提高了性能,但它们通常需要大量的计算资源,这限制了其在实时边缘应用中的部署。创建轻量级神经网络的努力虽然取得了一些成功,却往往导致效率与精度之间的取舍问题。此外,由这些网络生成的高维描述符对分布式应用程序提出了挑战,因为后者需要高效的通信和协调能力,从而凸显出设计紧凑但又准确度竞争性高的描述符的需求。 在本文中,我们介绍了EdgePoint2,这是一个专门为嵌入式系统上的边缘计算应用量身定制的一系列轻量级关键点检测和描述神经网络。该网络架构经过优化,在不牺牲精度的前提下提高了效率。为了训练出紧凑的描述符,我们引入了正交普罗克鲁斯特斯损失与相似性损失相结合的方法,这可以作为超球面嵌入蒸馏任务中的通用方法。此外,我们提供了14种子模型以满足各种应用需求。我们的实验表明,在使用更低维度描述符(32/48/64)的情况下,EdgePoint2在不同的具有挑战性的场景中始终能实现最佳状态的精度和效率。 除了准确性之外,EdgePoint2还拥有显著的优势:灵活性、鲁棒性和适应性,使其成为视觉任务中的极具竞争力的选择,尤其是在需要适应不同计算能力和通信限制的应用环境中。
https://arxiv.org/abs/2504.17280
Despite significant interest and advancements in humanoid robotics, most existing commercially available hardware remains high-cost, closed-source, and non-transparent within the robotics community. This lack of accessibility and customization hinders the growth of the field and the broader development of humanoid technologies. To address these challenges and promote democratization in humanoid robotics, we demonstrate Berkeley Humanoid Lite, an open-source humanoid robot designed to be accessible, customizable, and beneficial for the entire community. The core of this design is a modular 3D-printed gearbox for the actuators and robot body. All components can be sourced from widely available e-commerce platforms and fabricated using standard desktop 3D printers, keeping the total hardware cost under $5,000 (based on U.S. market prices). The design emphasizes modularity and ease of fabrication. To address the inherent limitations of 3D-printed gearboxes, such as reduced strength and durability compared to metal alternatives, we adopted a cycloidal gear design, which provides an optimal form factor in this context. Extensive testing was conducted on the 3D-printed actuators to validate their durability and alleviate concerns about the reliability of plastic components. To demonstrate the capabilities of Berkeley Humanoid Lite, we conducted a series of experiments, including the development of a locomotion controller using reinforcement learning. These experiments successfully showcased zero-shot policy transfer from simulation to hardware, highlighting the platform's suitability for research validation. By fully open-sourcing the hardware design, embedded code, and training and deployment frameworks, we aim for Berkeley Humanoid Lite to serve as a pivotal step toward democratizing the development of humanoid robotics. All resources are available at this https URL.
尽管人形机器人的兴趣和进步显著,但现有的商用硬件仍然价格高昂、封闭源代码且对机器人社区不够透明。这种缺乏可访问性和定制化阻碍了该领域的发展以及人形技术的更广泛开发。为了应对这些挑战并促进人形机器人的民主化进程,我们展示了伯克利人形机器人Lite(Berkeley Humanoid Lite),这是一款开源的人形机器人,旨在具有可访问性、可定制性和社区友好性。 这款设计的核心是用于执行器和机器人机身的模块化3D打印齿轮箱。所有组件都可以从广泛可用的电子商务平台获取,并通过标准桌面3D打印机制造,使硬件总成本保持在5,000美元以下(基于美国市场价格)。该设计强调模块性和易于制造性。 为了克服3D打印齿轮箱与金属替代品相比在强度和耐用性方面的固有限制,我们采用了摆线齿轮设计,在这种情况下提供了一个理想的外形尺寸。通过广泛的测试验证了3D打印执行器的耐久性,并减轻了对塑料组件可靠性的担忧。 为展示Berkeley Humanoid Lite的能力,我们进行了一系列实验,包括使用强化学习开发一个行走控制器。这些实验成功展示了从模拟到硬件中零样本策略转移(zero-shot policy transfer),突显了该平台在研究验证中的适用性。 通过完全开源硬件设计、嵌入式代码以及训练和部署框架,我们的目标是让Berkeley Humanoid Lite成为推动人形机器人开发民主化的关键步骤。所有资源均可在此网址获取:[此URL]
https://arxiv.org/abs/2504.17249
Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
表情包常用于幽默和文化评论,但它们也被越来越多地用来传播仇恨内容。由于表情包的多媒体特性,含有仇恨的内容往往能规避传统的仅基于文本或仅基于图像的检测系统,尤其是在使用隐晦或代号式参考时更是如此。为应对这些挑战,我们提出了一种多模态仇恨检测框架,该框架整合了关键组件:OCR技术用于提取嵌入的文字;标题生成来中立地描述视觉内容;次标签分类以对仇恨内容进行精细化分类;RAG(检索增强生成)用于相关背景信息的获取;以及VQA(视觉问答)用于分析符号和上下文线索的迭代。这使得该框架能够揭露简单管道无法检测到的潜在信号。 在Facebook仇恨表情包数据集上的实验结果显示,所提出的多模态框架在准确率和AUC-ROC指标上均优于单一模式及传统多模态模型的表现。
https://arxiv.org/abs/2504.16723
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: this https URL
近期,开源文本到视频生成模型的激增显著激发了研究社区的热情,然而它们对专有训练数据集的依赖仍然是一个主要限制。虽然现有的公开数据集如Koala-36M采用了从早期平台抓取并算法过滤网络视频的方法,但这些数据集仍然缺乏用于精细调整先进视频生成模型所需的高质量内容。我们推出了Tiger200K,这是一个来自用户生成内容(UGC)平台的、由人工精挑细选而成的高视觉质量视频数据集。 通过优先考虑视觉保真度和审美品质,Tiger200K突显了人类专业知识在数据整理中的关键作用,并提供了高质量的时间连贯性视频-文本对,用于通过包括镜头边界检测、OCR(光学字符识别)、边框检测、运动过滤及精细双语描述在内的简单但有效的管道进行微调和优化视频生成架构。该数据集将不断扩展并作为开源倡议发布,以推进视频生成模型的研究和应用。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2504.15182
The performance of OCR has improved with the evolution of AI technology. As OCR continues to broaden its range of applications, the increased likelihood of interference introduced by various usage environments can prevent it from achieving its inherent performance. This results in reduced recognition accuracy under certain conditions, and makes the quality control of recognition devices more challenging. Therefore, to ensure that users can properly utilize OCR, we compiled the real-world external disturbance factors that cause performance degradation, along with the resulting image degradation phenomena, into an external disturbance factor table and, by also indicating how to make use of it, organized them into guidelines.
随着人工智能技术的发展,OCR(光学字符识别)的性能得到了提升。然而,随着OCR应用范围的不断扩大,在各种使用环境中引入干扰的可能性也随之增加,这阻碍了其发挥固有的高性能潜力。在某些条件下,这会导致识别准确度下降,并使得对识别设备的质量控制更具挑战性。因此,为了确保用户能够正确利用OCR技术,我们将导致性能下降的真实世界外部干扰因素及其引发的图像退化现象整理成一个外部干扰因子表,并通过说明如何使用该表格的方式,制定了相应的指导方针。
https://arxiv.org/abs/2504.14913
Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.
准确而高效的医学图像分割对于推进临床诊断和手术规划至关重要,但由于解剖结构的差异性和对低复杂度模型的需求,这仍然是一个复杂的挑战。在这篇论文中,我们介绍了Med-2D SegNet,这是一种新颖且高度高效的分割架构,它在保持极小计算负担的同时提供了卓越的准确性。Med-2D SegNet在多个基准数据集上(包括KVASIR-SEG、PH2、EndoVis和GLAS)实现了最先进的性能,在20个不同数据集中Dice相似性系数(DSC)平均达到了89.77%。 其成功的核心在于紧凑的Med Block,这是一种专门设计的编码器结构,结合了维度扩展与参数减少功能。这使得Med-2D SegNet能够在保持模型参数量仅为207万的情况下实现精确特征提取。在跨数据集泛化能力方面,特别是在息肉分割任务中,该网络表现出色:它是在KVASIR-SEG上训练的,并且在未见过的数据集中展示了强大的性能,证明了其在零样本学习场景中的鲁棒性,尽管我们认为还有改进的空间。 无论是二元还是多类分割任务,Med-2D SegNet都达到了顶级性能,重新定义了准确性和效率之间的平衡,并为医学图像分析设定了新的标准。这项工作为开发适用于临床环境和资源受限设置的高性价比诊断工具铺平道路,是先进医疗技术普及化的一个进步步骤。
https://arxiv.org/abs/2504.14715
Achieving human-level dexterity in robotic hands remains a fundamental challenge for enabling versatile manipulation across diverse applications. This extended abstract presents BiDexHand, a cable-driven biomimetic robotic hand that combines human-like dexterity with accessible and efficient mechanical design. The robotic hand features 16 independently actuated degrees of freedom and 5 mechanically coupled joints through novel phalange designs that replicate natural finger motion. Performance validation demonstrated success across all 33 grasp types in the GRASP Taxonomy, 9 of 11 positions in the Kapandji thumb opposition test, a measured fingertip force of 2.14\,N, and the capability to lift a 10\,lb weight. As an open-source platform supporting multiple control modes including vision-based teleoperation, BiDexHand aims to democratize access to advanced manipulation capabilities for the broader robotics research community.
在机器人手中实现人类级别的灵巧性仍然是使各种应用中具备多功能操作的基本挑战。这份扩展摘要介绍了BiDexHand,这是一种仿生缆绳驱动的机器人手,它结合了类人般的灵巧性和可访问且高效的机械设计。这款机器人手具有16个独立驱动的自由度,并通过创新的指骨设计将5个机械耦合关节连接起来,以模仿自然手指运动。 性能验证显示,在GRASP分类法中的33种抓握类型中全部成功;在卡潘迪拇指对掌测试中的11个位置中有9个位置也获得了成功。另外还测得指尖力为2.14牛顿,并且能够提起重达10磅的物体。 作为支持多种控制模式(包括基于视觉的遥操作)的开源平台,BiDexHand旨在使更广泛的机器人研究社区更容易获得先进的操作能力。
https://arxiv.org/abs/2504.14712
Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize data science by enabling those with limited programming expertise to conduct data analyses, including in scientific research, business, and policymaking. However, analysts in many real-world settings must often exercise fine-grained control over specific analysis steps, verify intermediate results explicitly, and iteratively refine their analytical approaches. Such tasks present barriers to building robust and reproducible analyses using LLMs alone or even in conjunction with existing authoring tools (e.g., computational notebooks). This paper introduces Flowco, a new mixed-initiative system to address these challenges. Flowco leverages a visual dataflow programming model and integrates LLMs into every phase of the authoring process. A user study suggests that Flowco supports analysts, particularly those with less programming experience, in quickly authoring, debugging, and refining data analyses.
进行数据分析通常涉及编写代码,以转换、可视化、分析和解释数据。大型语言模型(LLMs)现在能够生成用于简单常规分析的此类代码。通过使编程经验有限的人也能开展数据分析工作,包括在科学研究、商业以及政策制定领域,LLMs承诺将推动数据科学的普及化。然而,在许多现实场景中,分析师常常需要对特定的分析步骤进行精细控制,明确验证中间结果,并迭代改进他们的分析方法。这些任务给仅使用LLMs或结合现有的代码编写工具(例如计算笔记本)来构建稳健和可重复的分析带来了障碍。本文介绍了一种新的混合倡议系统Flowco,旨在解决这些问题。Flowco采用视觉数据流编程模型,并在作者过程的每个阶段中集成大型语言模型。一项用户研究表明,Flowco支持分析师,特别是那些具有较少编程经验的人,在快速编写、调试和优化数据分析方面取得了成功。
https://arxiv.org/abs/2504.14038
The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: this https URL
第一代大型语言模型——可以称为生成式AI的“第一章”(2020-2023)——通过大规模参数和数据扩展取得了显著的成功,但同时在知识延迟、浅层推理以及认知过程受限等方面表现出根本性的局限性。在此期间,提示工程作为我们与AI交互的主要界面出现,并通过自然语言实现了对话级别的交流。如今,“第二章”(2024年至今)正在浮现,模型正从隐式空间中的知识检索系统转变为通过测试时扩展技术实现的思维构建引擎。这一新的范例通过基于语言的思想建立了与AI的心灵层面连接。本文将阐明认知工程的概念基础,并解释为什么现在是其发展的关键时期。我们将系统地分解这些高级方法并通过全面教程和优化实施来普及对认知工程的访问,让每位从业者都能参与人工智能的第二章。我们在GitHub仓库中定期更新有关测试时扩展的研究论文集合:[此链接](this https URL)
https://arxiv.org/abs/2504.13828
Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical human-centric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.
目前的人工智能安全工作主要集中在过滤有害内容、防止人类行为被操控以及消除网络安全或生物安全领域的存在性风险。虽然这些问题非常紧迫,但这种狭隘的关注点忽略了塑造社会长期发展轨迹的关键人性考量。在这份立场文件中,我们指出了忽视人工智能对未来就业影响的风险,并建议支持向具有人类自主性的有意义劳动的全面过渡。通过经济理论的视角,我们强调了AI对人类生计的跨期影响以及劳动力市场结构变化所加剧的收入不平等现象。此外,主要利益相关者在AI开发中采用封闭源代码的做法类似于寻租行为,它利用资源、滋生创造性劳动中的平庸,并垄断创新。为解决这一问题,我们主张建立一个强有力的国际版权体系,通过实施集体许可制度来确保使用数据训练AI模型的公平补偿机制。我们强烈建议制定以工人为中心的全球人工智能治理框架,以此增进共同繁荣和经济正义并减少技术债务。
https://arxiv.org/abs/2504.13959
The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at this https URL.
大型视觉-语言模型(LVLM)的迅速发展显著推动了文档理解领域的应用,特别是在光学字符识别(OCR)和多语种翻译方面。然而,当前对LVLMs的评估方法,如广泛使用的OCRBench,主要侧重于验证其在短文本响应和布局简单的长文本响应上的准确性,而忽视了它们理解和处理复杂布局设计下的长文本能力的重要性。 本文中,我们提出了菜单OCR与翻译基准测试(MOTBench),这是一个专门针对跨文化交流中的菜单翻译重要性的评估框架。MOTBench要求LVLMs能够准确识别并翻译菜单上每道菜及其价格和单位物品的信息,从而全面评估其视觉理解和语言处理能力。我们的基准测试包括一系列中文和英文菜单的集合,这些菜单具有复杂的布局、多样的字体以及跨不同语言的文化特有元素,并且附带有精确的人工标注。 实验结果显示,我们自动化的评估结果与专业的手动评估高度一致。我们对多个公开可用的最新LVLM进行了评测,并通过分析它们的输出来识别其性能的优势和不足,为未来LVLM的发展提供了宝贵的见解。MOTBench可在此网址访问:[this https URL](请注意实际提供一个有效的URL)。
https://arxiv.org/abs/2504.13945
Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses during the early phases of survey design. While previous studies have examined whether models can reflect individual opinions or attitudes, we argue that a \emph{higher-order} binding of virtual personas requires successfully approximating not only the opinions of a user as an identified member of a group, but also the nuanced ways in which that user perceives and evaluates those outside the group. In particular, faithfully simulating how humans perceive different social groups is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies. Altogether, our work extends the applicability of LLMs beyond estimating individual self-opinions, enabling their use in a broader range of human studies.
大型语言模型(LLM)在模拟人类行为方面的能力日益增强,可以为调查设计初期阶段提供一种经济有效的方式来估计用户响应。尽管先前的研究探讨了模型是否能够反映个人的观点或态度,但我们认为创建虚拟人物的“高阶”绑定需要不仅仅是成功地近似用户作为群体中的一员时的意见,还要近似该用户感知和评价群体之外成员的方式的细微差别。特别地,真实模拟人类如何看待不同社会群体对于将LLM应用于各种政治科学研究(包括及时的政治极化动态、集团间冲突以及民主倒退等话题)至关重要。 为此,我们提出了一种新的方法来构建虚拟人物,通过合成用户“背景故事”生成扩展的多轮访谈记录。与先前的方法相比,我们的生成的背景故事更长,细节丰富,并且真实地描述了一个单一的个体。我们展示了基于我们背景故事条件下的虚拟人物能够紧密复制人类响应分布(根据Wasserstein距离衡量最多提高87%),并且产生的效应大小接近于原始研究中观察到的效果。 总的来说,我们的工作扩展了LLM的应用范围,使其不仅仅用于估计个人自我意见,还使其可用于更广泛的人类科学研究。
https://arxiv.org/abs/2504.11673
The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2\% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0\% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3\% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
光学字符识别(OCR)任务对于评估视觉-语言模型(VLMs)以及为大规模语言模型的训练数据提供高质量的数据源至关重要。尽管最先进的VLM在平均OCR准确性上有所提高,它们仍然难以解决样本级别的质量下降问题,并且缺乏可靠地自动检测低质量输出的方法。我们引入了一种名为共识熵(CE)的无需训练的推理后处理方法,该方法通过汇集多个VLM的输出来量化OCR不确定性。我们的方法利用了一个关键见解:正确的VLM OCR预测在输出空间中会收敛,而错误则会发散。 我们开发了一种轻量级多模型框架,有效地识别有问题的样本,选择最佳输出,并结合各个模型的优势。跨多个OCR基准和VLM进行的实验表明,在相同的成本下,CE方法不仅优于以VLM为评判标准的方法,还超越了单一模型基线,并在多项指标上达到了最先进的水平。 例如,我们的解决方案展示了以下成果:在质量验证中比VLM作为评判的方法高出15.2%的F1分数;在数学计算任务中获得6.0%的准确率提升;仅需对7.3%的输入进行重述即可保持整体性能不变。值得注意的是,整个过程无需训练或监督,并且始终具有即插即用的功能性。
https://arxiv.org/abs/2504.11101
Spatial imbalances in crop type data pose significant challenges for accurate classification in remote sensing applications. Algorithms aiming at transferring knowledge from data-rich to data-scarce tasks have thus surged in popularity. However, despite their effectiveness in previous evaluations, their performance in challenging real-world applications is unclear and needs to be evaluated. This study benchmarks transfer learning and several meta-learning algorithms, including (First-Order) Model-Agnostic Meta-Learning ((FO)-MAML), Almost No Inner Loop (ANIL), and Task-Informed Meta-Learning (TIML), on the real-world EuroCropsML time series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to simpler transfer learning methods when applied to crop type classification tasks in Estonia after pre-training on data from Latvia. However, this improvement comes at the cost of increased computational demands and training time. Moreover, we find that the transfer of knowledge between geographically disparate regions, such as Estonia and Portugal, poses significant challenges to all investigated algorithms. These insights underscore the trade-offs between accuracy and computational resource requirements in selecting machine learning methods for real-world crop type classification tasks and highlight the difficulties of transferring knowledge between different regions of the Earth. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating transfer and meta-learning methods for crop type classification under real-world conditions. The corresponding code is publicly available at this https URL.
作物类型数据的空间不平衡给遥感应用中的准确分类带来了重大挑战。旨在将知识从数据丰富的任务转移到数据稀缺的任务的算法因此变得越来越受欢迎。然而,尽管它们在之前的评估中表现出有效性,但这些方法在具有挑战性的现实世界应用程序中的性能仍然不清楚,需要进行进一步评价。 本研究使用包含爱沙尼亚、拉脱维亚和葡萄牙农民报告的作物数据以及Sentinel-2卫星观测的EuroCropsML时间序列数据集,对迁移学习及几种元学习算法(包括模型无关的元学习((FO)-MAML)、几乎无内循环(ANIL) 和任务信息元学习(TIML))进行了基准测试。 我们的研究结果表明,在爱沙尼亚应用预训练于拉脱维亚数据上的作物类型分类任务时,基于MAML的元学习算法比简单的迁移学习方法表现出略高的准确性。然而,这种性能提升是以增加计算需求和训练时间为代价的。此外,我们发现地理上分离地区(如爱沙尼亚与葡萄牙)之间的知识转移对所有研究算法构成了重大挑战。 这些见解强调了在选择现实世界作物类型分类任务中的机器学习方法时,在准确性和计算资源要求之间存在的权衡,并突显了在全球不同区域间传输知识的困难。为了促进这一领域的未来研究,我们提供了第一个全面基准测试,用于评估在现实条件下进行作物类型分类的迁移和元学习方法的效果。相关的代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2504.11022
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at this https URL .
尽管大型语言模型(LLMs)和多模态大型语言模型(MLLMs)在视觉文档理解(VDU)方面取得了进展,但从富含关系的文档中提取视觉信息(VIE)仍然面临挑战。这主要是由于布局多样性以及训练数据有限的问题。虽然现有的合成文档生成器试图通过解决数据稀缺性问题来应对这一挑战,但它们要么依赖于手动设计的布局和模板,要么采用基于规则的方法,限制了布局多样性的范围。此外,当前的版面生成方法仅关注拓扑模式,而不考虑文本内容,因此对于生成具有复杂内容与布局关联文档来说并不实用。 在本文中,我们提出了一种关系丰富视觉文档生成器(RIDGE),通过两阶段方法解决了上述局限性:(1) 内容生成,该阶段利用大型语言模型依据精心设计的层级结构文本格式来生成文档内容。这种格式能够捕捉实体类别和它们之间的关系。(2) 基于内容驱动版面生成,这一阶段学习仅根据易于获取的文字识别(OCR)结果创建多样且合理的文档布局,并不需要任何人工标注或注释。 实验结果显示,我们的方法显著提升了多种VIE基准测试中对文档理解模型的性能表现。代码和模型将在该URL提供。
https://arxiv.org/abs/2504.10659
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
我们的目标是开发一个检索增强生成(Retrieval-Augmented Generation,简称RAG)框架,用于回答关于包含丰富视觉元素的文档中的问题,这些文档以混合模态(如图表、表格)和多种格式(如PDF、PPTX)呈现。在本文中,我们介绍了一种新的RAG框架VDocRAG,该框架可以直接理解各种类型的文件和模态,并将它们统一转换为图像格式,从而避免了通过解析文档来获取文本时可能遗漏的信息。为了提高性能,我们提出了新颖的自我监督预训练任务,以适应大型视觉-语言模型进行检索,这些任务在压缩视觉信息的同时将其与文档中的文本内容对齐,形成密集的令牌表示。此外,我们还引入了OpenDocVQA,这是首个统一收集的开放领域文档视觉问答数据集集合,涵盖各种类型的文件和格式。OpenDocVQA为训练和评估开放领域的丰富视觉文档检索和问题回答模型提供了全面的资源。 实验表明,与传统的基于文本的RAG方法相比,VDocRAG显著提高了性能,并且具有强大的泛化能力,这突显了有效RAG范式在现实世界文件中的潜力。
https://arxiv.org/abs/2504.09795
Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.
多模态大型语言模型(MLLM)展现出了非凡的多功能性,但在展示真正的视觉理解方面仍面临挑战,特别是在图表推理任务中。现有基准测试如ChartQA显示,这些模型在很大程度上依赖于基于文本的捷径和概率模式匹配,而不是真正意义上的视觉推理。为了严格评估视觉推理能力,我们通过移除文本标签并在ChartQA数据集中引入图表扰动来提出更具挑战性的测试场景。在这种条件下,像GPT-4o和Gemini-2.0 Pro这样的模型表现出高达30%的性能下降,突显了它们在处理复杂视觉任务中的局限性。 为了解决这些挑战,我们提出了Socratic Chart,这是一个新的框架,它将图表图像转换为可缩放矢量图形(SVG)表示形式。这种做法使MLLM能够整合文本和视觉模态以增强对图表的理解能力。Socratic Chart采用一个多代理流水线,并使用专门的代理生成器提取基本图表属性(如柱状图的高度、线条坐标),并通过一个代理-批评者系统来验证结果,确保高保真的符号表示。 我们的框架在准确捕捉图表的基本元素和提高推理性能方面超越了当前最先进的模型,为推进MLLM的视觉理解能力提供了坚实的道路。
https://arxiv.org/abs/2504.09764
Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.
理解并推理学术手写笔记在文档AI领域仍是一个挑战,特别是对于数学方程、图表和科学符号。现有的视觉问答(VQA)基准测试主要关注印刷或结构化的手写文本,这限制了其在现实世界中的笔记记录上的泛化能力。为了解决这个问题,我们引入了NoTeS-Bank,这是一个用于基于笔记的问答中神经转录和搜索评估的基准。 NoTeS-Bank包含了来自多个领域的复杂笔记,需要模型处理非结构化和多模态内容。该基准定义了两个任务:(1)证据基础VQA,其中模型通过带边界的证据检索局部答案;(2)开放式领域VQA,其中模型在分类领域后检索相关文档和答案。 与依赖光学字符识别(OCR)和结构化数据的经典文档VQA数据集不同,NoTeS-Bank需要视觉-语言融合、检索以及多模态推理。我们对最先进的Vision-Language Models (VLMs) 和检索框架进行了基准测试,揭示了结构化转录和推理的局限性。 NoTeS-Bank通过NDCG@5、MRR、Recall@K、IoU 和 ANLS提供了严格的评估标准,为视觉文档理解和推理设定了新的标准。
https://arxiv.org/abs/2504.09249
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
最近在文本到视频(Text-to-Video,T2V)扩散模型方面的进展显著提高了生成视频的视觉质量。然而,即使是最近的T2V模型也难以准确遵循文本描述,尤其是在提示要求精确控制空间布局或物体轨迹的情况下。最近的研究线使用了布局引导的方法来改善T2V模型的表现,这种方法需要对注意力图进行微调或在推理时进行迭代操作。这显著增加了内存需求,使得大型T2V模型作为基础骨干的采用变得困难。 为了解决这个问题,我们引入了一种名为Video-MSG的技术,这是一种基于多模态规划和结构化噪声初始化的训练自由引导方法,用于T2V生成。Video-MSG包含三个步骤,在前两个步骤中,Video-MSG创建了一个视频草图(Video Sketch),这是一个最终视频的精细时空计划,以草案视频帧的形式指定了背景、前景和物体轨迹。在最后一个步骤中,Video-MSG通过噪声反转和去噪来引导下游T2V扩散模型使用这个视频草图。 特别值得注意的是,Video-MSG不需要额外内存来进行微调或注意力操作,在推理时使其更容易采用大型T2V模型。Video-MSG在多个流行的T2V生成基准测试(如T2VCompBench和VBench)上通过多种T2V基础架构(VideoCrafter2和CogVideoX-5B)证明了其提高文本对齐效果的有效性。 我们还提供了关于噪声反转比率、不同背景生成器、背景物体检测以及前景物体分割的全面消融研究。
https://arxiv.org/abs/2504.08641