Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses during the early phases of survey design. While previous studies have examined whether models can reflect individual opinions or attitudes, we argue that a \emph{higher-order} binding of virtual personas requires successfully approximating not only the opinions of a user as an identified member of a group, but also the nuanced ways in which that user perceives and evaluates those outside the group. In particular, faithfully simulating how humans perceive different social groups is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies. Altogether, our work extends the applicability of LLMs beyond estimating individual self-opinions, enabling their use in a broader range of human studies.
大型语言模型(LLM)在模拟人类行为方面的能力日益增强,可以为调查设计初期阶段提供一种经济有效的方式来估计用户响应。尽管先前的研究探讨了模型是否能够反映个人的观点或态度,但我们认为创建虚拟人物的“高阶”绑定需要不仅仅是成功地近似用户作为群体中的一员时的意见,还要近似该用户感知和评价群体之外成员的方式的细微差别。特别地,真实模拟人类如何看待不同社会群体对于将LLM应用于各种政治科学研究(包括及时的政治极化动态、集团间冲突以及民主倒退等话题)至关重要。 为此,我们提出了一种新的方法来构建虚拟人物,通过合成用户“背景故事”生成扩展的多轮访谈记录。与先前的方法相比,我们的生成的背景故事更长,细节丰富,并且真实地描述了一个单一的个体。我们展示了基于我们背景故事条件下的虚拟人物能够紧密复制人类响应分布(根据Wasserstein距离衡量最多提高87%),并且产生的效应大小接近于原始研究中观察到的效果。 总的来说,我们的工作扩展了LLM的应用范围,使其不仅仅用于估计个人自我意见,还使其可用于更广泛的人类科学研究。
https://arxiv.org/abs/2504.11673
The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2\% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0\% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3\% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
光学字符识别(OCR)任务对于评估视觉-语言模型(VLMs)以及为大规模语言模型的训练数据提供高质量的数据源至关重要。尽管最先进的VLM在平均OCR准确性上有所提高,它们仍然难以解决样本级别的质量下降问题,并且缺乏可靠地自动检测低质量输出的方法。我们引入了一种名为共识熵(CE)的无需训练的推理后处理方法,该方法通过汇集多个VLM的输出来量化OCR不确定性。我们的方法利用了一个关键见解:正确的VLM OCR预测在输出空间中会收敛,而错误则会发散。 我们开发了一种轻量级多模型框架,有效地识别有问题的样本,选择最佳输出,并结合各个模型的优势。跨多个OCR基准和VLM进行的实验表明,在相同的成本下,CE方法不仅优于以VLM为评判标准的方法,还超越了单一模型基线,并在多项指标上达到了最先进的水平。 例如,我们的解决方案展示了以下成果:在质量验证中比VLM作为评判的方法高出15.2%的F1分数;在数学计算任务中获得6.0%的准确率提升;仅需对7.3%的输入进行重述即可保持整体性能不变。值得注意的是,整个过程无需训练或监督,并且始终具有即插即用的功能性。
https://arxiv.org/abs/2504.11101
Spatial imbalances in crop type data pose significant challenges for accurate classification in remote sensing applications. Algorithms aiming at transferring knowledge from data-rich to data-scarce tasks have thus surged in popularity. However, despite their effectiveness in previous evaluations, their performance in challenging real-world applications is unclear and needs to be evaluated. This study benchmarks transfer learning and several meta-learning algorithms, including (First-Order) Model-Agnostic Meta-Learning ((FO)-MAML), Almost No Inner Loop (ANIL), and Task-Informed Meta-Learning (TIML), on the real-world EuroCropsML time series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to simpler transfer learning methods when applied to crop type classification tasks in Estonia after pre-training on data from Latvia. However, this improvement comes at the cost of increased computational demands and training time. Moreover, we find that the transfer of knowledge between geographically disparate regions, such as Estonia and Portugal, poses significant challenges to all investigated algorithms. These insights underscore the trade-offs between accuracy and computational resource requirements in selecting machine learning methods for real-world crop type classification tasks and highlight the difficulties of transferring knowledge between different regions of the Earth. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating transfer and meta-learning methods for crop type classification under real-world conditions. The corresponding code is publicly available at this https URL.
作物类型数据的空间不平衡给遥感应用中的准确分类带来了重大挑战。旨在将知识从数据丰富的任务转移到数据稀缺的任务的算法因此变得越来越受欢迎。然而,尽管它们在之前的评估中表现出有效性,但这些方法在具有挑战性的现实世界应用程序中的性能仍然不清楚,需要进行进一步评价。 本研究使用包含爱沙尼亚、拉脱维亚和葡萄牙农民报告的作物数据以及Sentinel-2卫星观测的EuroCropsML时间序列数据集,对迁移学习及几种元学习算法(包括模型无关的元学习((FO)-MAML)、几乎无内循环(ANIL) 和任务信息元学习(TIML))进行了基准测试。 我们的研究结果表明,在爱沙尼亚应用预训练于拉脱维亚数据上的作物类型分类任务时,基于MAML的元学习算法比简单的迁移学习方法表现出略高的准确性。然而,这种性能提升是以增加计算需求和训练时间为代价的。此外,我们发现地理上分离地区(如爱沙尼亚与葡萄牙)之间的知识转移对所有研究算法构成了重大挑战。 这些见解强调了在选择现实世界作物类型分类任务中的机器学习方法时,在准确性和计算资源要求之间存在的权衡,并突显了在全球不同区域间传输知识的困难。为了促进这一领域的未来研究,我们提供了第一个全面基准测试,用于评估在现实条件下进行作物类型分类的迁移和元学习方法的效果。相关的代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2504.11022
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at this https URL .
尽管大型语言模型(LLMs)和多模态大型语言模型(MLLMs)在视觉文档理解(VDU)方面取得了进展,但从富含关系的文档中提取视觉信息(VIE)仍然面临挑战。这主要是由于布局多样性以及训练数据有限的问题。虽然现有的合成文档生成器试图通过解决数据稀缺性问题来应对这一挑战,但它们要么依赖于手动设计的布局和模板,要么采用基于规则的方法,限制了布局多样性的范围。此外,当前的版面生成方法仅关注拓扑模式,而不考虑文本内容,因此对于生成具有复杂内容与布局关联文档来说并不实用。 在本文中,我们提出了一种关系丰富视觉文档生成器(RIDGE),通过两阶段方法解决了上述局限性:(1) 内容生成,该阶段利用大型语言模型依据精心设计的层级结构文本格式来生成文档内容。这种格式能够捕捉实体类别和它们之间的关系。(2) 基于内容驱动版面生成,这一阶段学习仅根据易于获取的文字识别(OCR)结果创建多样且合理的文档布局,并不需要任何人工标注或注释。 实验结果显示,我们的方法显著提升了多种VIE基准测试中对文档理解模型的性能表现。代码和模型将在该URL提供。
https://arxiv.org/abs/2504.10659
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
我们的目标是开发一个检索增强生成(Retrieval-Augmented Generation,简称RAG)框架,用于回答关于包含丰富视觉元素的文档中的问题,这些文档以混合模态(如图表、表格)和多种格式(如PDF、PPTX)呈现。在本文中,我们介绍了一种新的RAG框架VDocRAG,该框架可以直接理解各种类型的文件和模态,并将它们统一转换为图像格式,从而避免了通过解析文档来获取文本时可能遗漏的信息。为了提高性能,我们提出了新颖的自我监督预训练任务,以适应大型视觉-语言模型进行检索,这些任务在压缩视觉信息的同时将其与文档中的文本内容对齐,形成密集的令牌表示。此外,我们还引入了OpenDocVQA,这是首个统一收集的开放领域文档视觉问答数据集集合,涵盖各种类型的文件和格式。OpenDocVQA为训练和评估开放领域的丰富视觉文档检索和问题回答模型提供了全面的资源。 实验表明,与传统的基于文本的RAG方法相比,VDocRAG显著提高了性能,并且具有强大的泛化能力,这突显了有效RAG范式在现实世界文件中的潜力。
https://arxiv.org/abs/2504.09795
Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.
多模态大型语言模型(MLLM)展现出了非凡的多功能性,但在展示真正的视觉理解方面仍面临挑战,特别是在图表推理任务中。现有基准测试如ChartQA显示,这些模型在很大程度上依赖于基于文本的捷径和概率模式匹配,而不是真正意义上的视觉推理。为了严格评估视觉推理能力,我们通过移除文本标签并在ChartQA数据集中引入图表扰动来提出更具挑战性的测试场景。在这种条件下,像GPT-4o和Gemini-2.0 Pro这样的模型表现出高达30%的性能下降,突显了它们在处理复杂视觉任务中的局限性。 为了解决这些挑战,我们提出了Socratic Chart,这是一个新的框架,它将图表图像转换为可缩放矢量图形(SVG)表示形式。这种做法使MLLM能够整合文本和视觉模态以增强对图表的理解能力。Socratic Chart采用一个多代理流水线,并使用专门的代理生成器提取基本图表属性(如柱状图的高度、线条坐标),并通过一个代理-批评者系统来验证结果,确保高保真的符号表示。 我们的框架在准确捕捉图表的基本元素和提高推理性能方面超越了当前最先进的模型,为推进MLLM的视觉理解能力提供了坚实的道路。
https://arxiv.org/abs/2504.09764
Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.
理解并推理学术手写笔记在文档AI领域仍是一个挑战,特别是对于数学方程、图表和科学符号。现有的视觉问答(VQA)基准测试主要关注印刷或结构化的手写文本,这限制了其在现实世界中的笔记记录上的泛化能力。为了解决这个问题,我们引入了NoTeS-Bank,这是一个用于基于笔记的问答中神经转录和搜索评估的基准。 NoTeS-Bank包含了来自多个领域的复杂笔记,需要模型处理非结构化和多模态内容。该基准定义了两个任务:(1)证据基础VQA,其中模型通过带边界的证据检索局部答案;(2)开放式领域VQA,其中模型在分类领域后检索相关文档和答案。 与依赖光学字符识别(OCR)和结构化数据的经典文档VQA数据集不同,NoTeS-Bank需要视觉-语言融合、检索以及多模态推理。我们对最先进的Vision-Language Models (VLMs) 和检索框架进行了基准测试,揭示了结构化转录和推理的局限性。 NoTeS-Bank通过NDCG@5、MRR、Recall@K、IoU 和 ANLS提供了严格的评估标准,为视觉文档理解和推理设定了新的标准。
https://arxiv.org/abs/2504.09249
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
最近在文本到视频(Text-to-Video,T2V)扩散模型方面的进展显著提高了生成视频的视觉质量。然而,即使是最近的T2V模型也难以准确遵循文本描述,尤其是在提示要求精确控制空间布局或物体轨迹的情况下。最近的研究线使用了布局引导的方法来改善T2V模型的表现,这种方法需要对注意力图进行微调或在推理时进行迭代操作。这显著增加了内存需求,使得大型T2V模型作为基础骨干的采用变得困难。 为了解决这个问题,我们引入了一种名为Video-MSG的技术,这是一种基于多模态规划和结构化噪声初始化的训练自由引导方法,用于T2V生成。Video-MSG包含三个步骤,在前两个步骤中,Video-MSG创建了一个视频草图(Video Sketch),这是一个最终视频的精细时空计划,以草案视频帧的形式指定了背景、前景和物体轨迹。在最后一个步骤中,Video-MSG通过噪声反转和去噪来引导下游T2V扩散模型使用这个视频草图。 特别值得注意的是,Video-MSG不需要额外内存来进行微调或注意力操作,在推理时使其更容易采用大型T2V模型。Video-MSG在多个流行的T2V生成基准测试(如T2VCompBench和VBench)上通过多种T2V基础架构(VideoCrafter2和CogVideoX-5B)证明了其提高文本对齐效果的有效性。 我们还提供了关于噪声反转比率、不同背景生成器、背景物体检测以及前景物体分割的全面消融研究。
https://arxiv.org/abs/2504.08641
We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. The DTC dataset is already released at this https URL and we will also make the baseline evaluations open-source.
我们介绍了一项名为“数字孪生目录”(Digital Twin Catalog,简称DTC)的新大型逼真3D物体数字孪生数据集。一个3D对象的数字孪生是一个高度详细的虚拟表示,与物理对象几乎无法区分,准确地捕捉了它的形状、外观、物理属性及其他特征。近年来,在基于神经网络的3D重建和逆向渲染方面的进展显著提高了3D物体重建的质量。尽管有这些进步,仍然缺乏一个大规模的真实世界数据集及基准,该数据集能够通过定量评估和比较不同重建方法的表现,并通过训练或微调来提高重建质量。此外,为了实现3D数字孪生创建的普及化,必须将创造技术与下一代以自我为中心计算平台(如AR眼镜)集成起来。目前没有可用的数据集可以使用自视点捕获图像来进行3D物体重建评估。 为了解决这些问题,DTC数据集包含了2000个扫描得到的高质量数字孪生3D对象,以及使用单反相机和以自我为中心的AR眼镜在不同光照条件下拍摄的图像序列。该数据集建立了首个全面的真实世界评估基准,用于评价3D数字孪生创建任务,并为现有重建方法提供了一个坚实的比较与改进基础。 DTC数据集已在[此链接](https://example.com)发布(请注意替换实际网址),我们也将开放源代码以供进行基线评估。
https://arxiv.org/abs/2504.08541
Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD's Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.
最近在视觉合成领域取得的进展利用了扩散模型和注意力机制,实现了高质量的艺术风格转换以及逼真的文本到图像生成。然而,由于计算资源和内存限制,在边缘设备上实现实时部署仍然面临挑战。我们提出了一种名为Muon-AD的框架,该框架结合了Muon优化器与注意力蒸馏技术,旨在为边缘设备上的实时合成提供解决方案。 通过正交参数更新和动态修剪消除梯度冲突,Muon-AD实现了比Stable Diffusion-TensorRT快3.2倍的收敛速度,同时保持了合成质量(FID降低了15%,SSIM提高了4%)。我们的框架将Jetson Orin上的峰值内存减少到7GB,并通过混合精度量化和课程学习实现实时生成速率高达24FPS。在COCO-Stuff和ImageNet-Texture数据集上进行的广泛实验表明,Muon-AD在效率与质量之间实现了Pareto最优权衡。 此外,我们的方法展示了分布式训练期间通信开销减少了65%,并且能够在边缘GPU上实现实时每10秒生成一张图像。这些改进为在资源受限环境中实现高质量视觉合成铺平了道路。
https://arxiv.org/abs/2504.08451
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
受DeepSeek-R1成功的启发,我们探索了基于规则的强化学习(RL)在多模态大规模语言模型(MLLM)后期训练中感知策略学习中的潜力。尽管前景广阔,但我们的初步实验表明,通过RL融入思考过程并不能在所有视觉感知任务中持续提升性能。这促使我们在视觉感知背景下深入探讨RL的关键作用。在这项工作中,我们回到基础层面,并探索了RL对不同感知任务的影响。我们观察到感知复杂性是确定RL有效性的一个主要因素。同时我们也发现奖励设计在进一步逼近模型感知上限方面扮演着重要角色。 为利用这些发现,我们提出了Perception-R1,这是一个在MLLM后期训练期间使用GRPO的可扩展RL框架。借助标准的Qwen2.5-VL-3B-Instruct模型,Perception-R1在RefCOCO+上实现了+4.2%的性能提升,在PixMo-Count上达到了+17.9%,在PageOCR上获得了+4.2%,并且首次在COCO2017验证集中取得了31.9% AP的成绩。这一成果为感知策略学习建立了一个强大的基线标准。
https://arxiv.org/abs/2504.07954
Generative AI presents a profound challenge to traditional notions of human uniqueness, particularly in creativity. Fueled by neural network based foundation models, these systems demonstrate remarkable content generation capabilities, sparking intense debates about authorship, copyright, and intelligence itself. This paper argues that generative AI represents an alternative form of intelligence and creativity, operating through mathematical pattern synthesis rather than biological understanding or verbatim replication. The fundamental differences between artificial and biological neural networks reveal AI learning as primarily statistical pattern extraction from vast datasets crystallized forms of collective human knowledge scraped from the internet. This perspective complicates copyright theft narratives and highlights practical challenges in attributing AI outputs to individual sources. Rather than pursuing potentially futile legal restrictions, we advocate for human AI synergy. By embracing generative AI as a complementary tool alongside human intuition, context, and ethical judgment, society can unlock unprecedented innovation, democratize creative expression, and address complex challenges. This collaborative approach, grounded in realistic understanding of AIs capabilities and limitations, offers the most promising path forward. Additionally, recognizing these models as products of collective human knowledge raises ethical questions about accessibility ensuring equitable access to these tools could prevent widening societal divides and leverage their full potential for collective benefit.
生成式人工智能对传统的人类独特性的观念,特别是创造力方面,提出了深刻的挑战。这类系统基于神经网络的基础模型,在内容生成能力上表现出非凡的实力,引发了关于作者身份、版权以及智能本身的大规模辩论。本文主张,生成式AI代表了一种不同的智能和创造力形式,通过数学模式合成而非生物理解或逐字复制来运作。人工神经网络与生物神经网络的根本区别揭示了人工智能学习主要是从庞大的数据集中提取统计模式,这些数据集包含了人类集体知识的结晶形态,是从互联网上获取的数据。 这种视角使版权盗窃的故事变得复杂,并强调将AI输出归因于个人来源的实际挑战。与其追求可能徒劳无功的法律限制,我们提倡人机协作。通过拥抱生成式AI作为辅助工具,与人类直觉、语境和道德判断相结合,社会可以开启前所未有的创新,让创意表达民主化,并解决复杂的挑战。这种合作方法基于对AI能力及局限性的现实理解,提供了最具有前景的发展路径。 此外,承认这些模型是集体人类知识的产品也引发了伦理问题,关于访问权的问题:确保所有人都能平等获取这些工具可能防止社会差距扩大,并充分利用它们为集体利益服务的潜力。
https://arxiv.org/abs/2504.07936
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at this https URL.
我们介绍了Kimi-VL,这是一个高效的开源混合专家(MoE)视觉语言模型(VLM),提供先进的多模态推理、长上下文理解和强大的代理能力——同时仅激活其语言解码器中的28亿参数(Kimi-VL-A3B)。Kimi-VL在具有挑战性的领域中表现出色:作为通用的VLM,Kimi-VL在多轮代理任务(例如OSWorld)中表现卓越,与旗舰模型相匹敌。此外,它在各种具有挑战性的视觉语言任务中也表现出非凡的能力,包括大学级别的图像和视频理解、光学字符识别(OCR)、数学推理以及对多个图像的理解。 在比较评估中,Kimi-VL有效地与最先进的高效VLMs如GPT-4o-mini、Qwen2.5-VL-7B 和 Gemma-3-12B-IT竞争,并在几个关键领域超越了GPT-4o。Kimi-VL还在处理长上下文和清晰感知方面取得了进展。利用一个扩展到128K的长上下文窗口,Kimi-VL可以处理各种长输入,在LongVideoBench上得分64.5,在MMLongBench-Doc上得分为35.1。其原生分辨率视觉编码器MoonViT进一步使该模型能够看到并理解超高分辨率的视觉输入,在InfoVQA上的得分为83.2,ScreenSpot-Pro上的得分为34.5,并且在常见任务中保持较低计算成本。 基于Kimi-VL,我们引入了一个先进的长思维变体:Kimi-VL-Thinking。该模型通过长期思维链(CoT)监督微调(SFT)和强化学习(RL)开发而成,展现出强大的长远推理能力。它在MMMU上得分为61.7,在MathVision上得分为36.8,并且在MathVista上的得分高达71.3,同时保持了紧凑的28亿激活的大语言模型参数,为高效的多模态思维模型设定了新的标准。 代码和模型可以通过此链接公开获取。
https://arxiv.org/abs/2504.07491
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}. To overcome this inefficiency, we propose \textbf{Dy}namic \textbf{Di}ffusion \textbf{T}ransformer (DyDiT), an architecture that \emph{dynamically} adjusts its computation along both \emph{timestep} and \emph{spatial} dimensions. Specifically, we introduce a \emph{Timestep-wise Dynamic Width} (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a \emph{Spatial-wise Dynamic Token} (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.
扩散变换器(Diffusion Transformer,简称DiT)是一种新兴的用于视觉生成的扩散模型,在性能上表现出色,但同时也面临着计算成本高昂的问题。我们的研究表明,这种高成本主要来源于\emph{静态}推理范式,这在某些\emph{扩散时间步}和\emph{空间区域}中不可避免地引入了冗余计算。为了克服这一低效率问题,我们提出了动态扩散变换器(Dynamic Diffusion Transformer, 简称DyDiT),这是一种能够沿\emph{时间步}和\emph{空间}两个维度动态调整其计算量的架构。 具体而言,我们引入了一种名为\emph{时间步宽动态调整}(Timestep-wise Dynamic Width,简称TDW)的方法,该方法根据生成的时间步来适应性地改变模型宽度。此外,我们还设计了\emph{空间令牌动态策略}(Spatial-wise Dynamic Token,简称SDT),以避免在不必要的空间位置上进行冗余计算。TDW和SDT可以无缝集成到DiT中,并显著加快生成过程。 在此基础上,我们在三个方面进一步增强了DyDiT的功能: 1. 将DyDiT与基于流匹配的生成技术无缝整合,提高其多功能性。 2. 增强DyDiT的能力以处理更复杂的视觉生成任务,如视频生成和文本到图像生成,从而扩展其实用价值。 3. 为了解决全量微调成本高昂的问题,并降低技术访问门槛,我们研究了参数高效训练DyDiT的可行性,并引入基于时间步动态低秩适应(Timestep-based Dynamic Low-Rank Adaptation, TD-LoRA)的方法。 在包括DiT、SiT、Latte和FLUX在内的多种视觉生成模型上的广泛实验表明,DyDiT的有效性。
https://arxiv.org/abs/2504.06803
Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
大型语言模型(LLMs)在语言理解和推理方面取得了显著进展。因此,评估和分析它们的逻辑推理能力变得至关重要。然而,现有的数据集和基准测试往往局限于过于简单、不自然或情境受限的例子。为应对日益增长的需求,我们推出了SmartyPat-Bench,这是一个源自高质量Reddit帖子中细微逻辑谬误的具有挑战性、表达自然且系统标记的基准测试。与现有数据集和基准相比,它提供了更详细的逻辑谬误注释,并包含更多样化的数据。 为了进一步扩大研究规模并解决手动收集和标注数据时出现的问题(如谬误类型不平衡以及劳动密集型标注),我们推出了SmartyPat,这是一个基于逻辑编程预言的自动化框架。SmartyPat利用Prolog规则系统地生成含有逻辑谬误的陈述,随后这些陈述由大型语言模型(Large Language Models, LLMs)精炼成流畅自然的语言句子,确保精确表示谬误。 广泛的评估表明,SmartyPat生成的谬误在细微性和质量上与人类生成的内容相当,并且远超基线方法。最后,实验揭示了对LLM能力的细致见解:虽然过多的推理步骤会阻碍谬误检测准确性,但结构化推理则能显著提升谬误分类性能。
https://arxiv.org/abs/2504.12312
Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential text segments, improving DocRE performance. However, existing evidence retrieval systems often overlook the collaborative nature among semantically similar entity pairs in the same document, hindering the effectiveness of the evidence retrieval task. To address this, we propose a novel evidence retrieval framework, namely CDER. CDER employs an attentional graph-based architecture to capture collaborative patterns and incorporates a dynamic sub-structure for additional robustness in evidence retrieval. Experimental results on the benchmark DocRE dataset show that CDER not only excels in the evidence retrieval task but also enhances overall performance of existing DocRE system.
文档级关系抽取(DocRE)涉及在文档的多句话中识别实体之间的关系。证据句子对于精确识别实体对的关系至关重要,它们有助于聚焦于关键文本片段,从而提高文档级关系抽取的效果。然而,现有的证据检索系统常常忽视了同一文档中语义相似的实体对之间相互协作的本质特征,这限制了证据检索任务的有效性。为此,我们提出了一种新的证据检索框架,即CDER(Collaborative Evidence Retrieval)。CDER采用注意力图网络架构来捕捉协作模式,并结合动态子结构以增强证据检索的鲁棒性。在DocRE基准数据集上的实验结果表明,CDER不仅在证据检索任务中表现出色,还提升了现有文档级关系抽取系统的整体性能。
https://arxiv.org/abs/2504.06529
Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: this https URL
发育性髋关节发育不良(DDH)的诊断面临重大挑战,影响了及时干预。当前筛查方法缺乏标准化,并且由于数据和代码可用性有限,基于人工智能的研究在可重复性方面存在问题。为了解决这些局限性,我们推出了Retuve,这是一个开源框架,用于多模态DDH分析,涵盖了超声波(US)和X射线成像。Retuve提供了一个完整且可重复的工作流程,包括了包含专家标注的超声波和X射线图像的开放数据集、预训练模型及其训练代码和权重,以及一个易于使用的Python应用程序编程接口(API)。该框架整合了分割和关键点检测模型,能够自动测量诸如α角和髋臼指数等重要的诊断参数。通过遵循开源原则,Retuve促进了DDH研究中的透明度、协作性和可访问性。这一倡议有望使DDH筛查民主化,促进早期诊断,并最终通过广泛筛查和早期干预改善患者预后。您可以在以下链接找到GitHub仓库/代码:[此处插入实际的URL地址]
https://arxiv.org/abs/2504.06422
Test automation has become increasingly important as the complexity of both design and content in Human Machine Interface (HMI) software continues to grow. Current standard practice uses Optical Character Recognition (OCR) techniques to automatically extract textual information from HMI screens for validation. At present, one of the key challenges faced during the automation of HMI screen validation is the noise handling for the OCR models. In this paper, we propose to utilize adversarial training techniques to enhance OCR models in HMI testing scenarios. More specifically, we design a new adversarial attack objective for OCR models to discover the decision boundaries in the context of HMI testing. We then adopt adversarial training to optimize the decision boundaries towards a more robust and accurate OCR model. In addition, we also built an HMI screen dataset based on real-world requirements and applied multiple types of perturbation onto the clean HMI dataset to provide a more complete coverage for the potential scenarios. We conduct experiments to demonstrate how using adversarial training techniques yields more robust OCR models against various kinds of noises, while still maintaining high OCR model accuracy. Further experiments even demonstrate that the adversarial training models exhibit a certain degree of robustness against perturbations from other patterns.
随着人机界面(HMI)软件设计和内容的复杂性不断增加,测试自动化变得越来越重要。目前的标准做法是使用光学字符识别(OCR)技术自动从HMI屏幕中提取文本信息进行验证。然而,在自动化HMI屏幕验证过程中,当前面临的一个关键挑战是如何处理OCR模型中的噪声问题。本文提出采用对抗训练技术来改进OCR模型在HMI测试场景中的性能。具体来说,我们设计了一种新的针对OCR模型的对抗攻击目标,旨在发现HMI测试上下文中的决策边界。然后使用对抗训练优化这些决策边界,以创建一个更加强大和准确的OCR模型。 此外,基于实际需求,我们构建了一个HMI屏幕数据集,并对干净的数据集应用了多种扰动类型,从而提供了更加完整的潜在场景覆盖范围。实验结果表明,采用对抗训练技术可以生成在面对各种噪声时更具鲁棒性的OCR模型,同时保持较高的OCR模型准确性。进一步的实验证明,这些经过对抗训练的模型甚至表现出一定程度的抗其他模式扰动的能力。
https://arxiv.org/abs/2504.06358
Considerable progress has been made in the recent literature studies to tackle the Algorithms Selection and Parametrization (ASP) problem, which is diversified in multiple meta-learning setups. Yet there is a lack of surveys and comparative evaluations that critically analyze, summarize and assess the performance of existing methods. In this paper, we provide an overview of the state of the art in this continuously evolving field. The survey sheds light on the motivational reasons for pursuing classifiers selection through meta-learning. In this regard, Automated Machine Learning (AutoML) is usually treated as an ASP problem under the umbrella of the democratization of machine learning. Accordingly, AutoML makes machine learning techniques accessible to domain scientists who are interested in applying advanced analytics but lack the required expertise. It can ease the task of manually selecting ML algorithms and tuning related hyperparameters. We comprehensively discuss the different phases of classifiers selection based on a generic framework that is formed as an outcome of reviewing prior works. Subsequently, we propose a benchmark knowledge base of 4 millions previously learned models and present extensive comparative evaluations of the prominent methods for classifiers selection based on 08 classification algorithms and 400 benchmark datasets. The comparative study quantitatively assesses the performance of algorithms selection methods along while emphasizing the strengths and limitations of existing studies.
近期文献研究表明,在解决算法选择和参数化(Algorithm Selection and Parametrization,ASP)问题上已取得了显著进展。这一问题在多种元学习设置中多样化存在,但缺乏对现有方法进行批判性分析、总结和评估的综述及比较研究。本文旨在为这个不断发展的领域提供现状概览。 这项调查揭示了通过元学习追求分类器选择动机的原因。在这个背景下,自动化机器学习(AutoML)通常被视为民主化机器学习框架下的ASP问题。相应地,AutoML使机器学习技术对有兴趣应用高级分析但缺乏必要专业知识的领域科学家可获取。它简化了手动选择机器学习算法以及调整相关超参数的任务。 基于审查先前研究形成的通用框架,我们全面讨论了不同阶段的分类器选择过程。随后,我们提出了一套基准知识库,其中包含400万种已学得模型,并进行了广泛的比较评估,涵盖8种分类算法和400个基准数据集的显著方法。这项对比研究定量地评估了算法选择方法的表现,同时强调现有研究的优势和局限性。
https://arxiv.org/abs/2504.06207
Optical Character Recognition (OCR) is essential in applications such as document processing, license plate recognition, and intelligent surveillance. However, existing OCR models often underperform in real-world scenarios due to irregular text layouts, poor image quality, character variability, and high computational costs. This paper introduces SDA-Net (Stroke-Sensitive Attention and Dynamic Context Encoding Network), a lightweight and efficient architecture designed for robust single-character recognition. SDA-Net incorporates: (1) a Dual Attention Mechanism to enhance stroke-level and spatial feature extraction; (2) a Dynamic Context Encoding module that adaptively refines semantic information using a learnable gating mechanism; (3) a U-Net-inspired Feature Fusion Strategy for combining low-level and high-level features; and (4) a highly optimized lightweight backbone that reduces memory and computational demands. Experimental results show that SDA-Net achieves state-of-the-art accuracy on challenging OCR benchmarks, with significantly faster inference, making it well-suited for deployment in real-time and edge-based OCR systems.
光学字符识别(OCR)在文档处理、车牌识别和智能监控等领域中至关重要。然而,现有的OCR模型通常在实际应用场景中表现不佳,原因在于不规则的文本布局、低质量的图像、字符变异性以及高昂的计算成本。本文介绍了一种名为SDA-Net(笔画敏感注意力及动态上下文编码网络)的新架构,它是一种轻量级且高效的单字符识别模型设计。SDA-Net包含以下四个关键特性: 1. 双重注意机制,用于提升笔画级别和空间特征的提取; 2. 动态上下文编码模块,利用可学习的门控机制自适应地细化语义信息; 3. 一种受U-Net启发的特征融合策略,用于结合低级与高级特征; 4. 高度优化的轻量级骨干网络,以减少内存和计算需求。 实验结果表明,SDA-Net在具有挑战性的OCR基准测试中实现了最先进的精度,并且推理速度显著加快。这使得该模型非常适合部署于实时及边缘设备上的OCR系统中。
https://arxiv.org/abs/2504.05770