Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
视觉生成模型在从文本提示创建逼真的图像方面取得了显著进展,但处理包含多个对象及其精确空间关系和属性的复杂提示时仍面临挑战。有效应对这些复杂提示需要对语义内容和空间布局进行明确推理。我们提出了GoT-R1框架,该框架利用强化学习来增强视觉生成中的语义-空间推理能力。基于生成链式思维方法,GoT-R1使模型能够自主发现超越预定义模板的有效推理策略,这得益于精心设计的强化学习机制。 为了实现这一点,我们提出了一种双阶段多维度奖励框架,该框架利用大规模语言模型(MLLMs)来评估推理过程和最终输出,从而在整个生成管道中提供有效的监督。奖赏系统以统一的方式评估语义一致性、空间准确性以及视觉质量。 实验结果显示,在涉及精确空间关系和属性绑定的组合任务上,GoT-R1在T2I-CompBench基准测试中取得了显著改进,尤其是在处理复杂的组成性任务方面。GoT-R1通过成功将高级推理能力转移到图像生成领域,推动了这一领域的最新研究进展。 为了促进未来的相关研究工作,我们已在[此链接](https://此链接提供具体URL)上公开发布了代码和预训练模型。
https://arxiv.org/abs/2505.17022
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.
岩石薄片的矿物学分析对于油气储层评价是一项重要任务。然而,人工分析往往主观且耗时。虽然诸如QEMSCAN(R)等技术旨在自动化矿物学测绘过程,但它们也存在高成本和耗时的问题。本研究提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的模型,用于自动分割碳酸盐岩薄片图像中的矿物区域,该方法以低成本、通用且高效的方式模拟了QEMSCAN制图流程。 为了实现这一目标,使用U-Net语义分割架构对平面偏光和交叉偏光下的岩石薄片图像进行训练,并将对应的QEMSCAN地图作为目标。这种基于QEMSCAN地图的训练方式在研究中不常被采用。模型被设定为区分方解石、白云石、镁质粘土矿物、石英、孔隙以及其他矿物相(标记为“Others”)。为了评估其泛化能力,该模型不仅对训练期间见过的岩相进行了验证,还测试了未见过的岩相。 由于图像和地图提供时分辨率不同,应用图像配准技术使它们在空间上对齐。研究表明,分割质量很大程度上取决于这些分辨率差异以及可学习岩石纹理的多样性。然而,研究结果显示出令人鼓舞的结果,特别是在固体纹理中矿物边界划定得当,并且能精确估计矿物分布。该模型预测与实际分布之间呈现出近似线性关系,对于见过和未见岩相分别获得了0.97以上的决定系数(R^2)值。 总的来说,这项工作通过利用深度学习技术提供了低成本、高效率的岩石薄片矿物学分析方法,展示了在油气储层评价中的潜在应用价值。
https://arxiv.org/abs/2505.17008
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
评估长文事实性的指标,如FactScore和VeriScore,通过将输入响应分解为原子声明,并逐一验证每个声明来工作。尽管这些方法有效且易于理解,但它们需要进行大量的大型语言模型(LLM)调用,并且可能需要长达100秒才能评估单个响应,这在大规模评估和训练场景中是不切实际的。为了应对这一挑战,我们提出了VeriFastScore,该方法利用合成数据对Llama3.1 8B进行微调,使其能够同时从给定文本中提取并根据Google搜索提供的证据验证所有可核实声明。 我们展示了这样一个任务不能通过使用封闭式LLM的少量提示来解决,因为其复杂性:模型需要接收平均约4K令牌的证据,并且需要同时分解声明、判断它们的可证实性和将它们与噪声证据进行对比。然而,我们的微调VeriFastScore模型在示例级别(r=0.80)和系统级别(r=0.94)上都显示出与原始VeriScore管道具有很强的相关性,并且比VeriScore整体速度提高了6.6倍(不包括证据检索则为9.9倍)。 为了促进未来的事实性研究,我们公开发布了我们的VeriFastScore模型和合成数据集。
https://arxiv.org/abs/2505.16973
Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is slower in practice and therefore prior methods typically process images at reduced resolutions, missing fine-grained details. To address this, we propose a more efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 90% while maintaining low memory usage, and performs on par with the default implementation with up to 95% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 50% savings for the total end-to-end model inference in memory-constrained environments. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an additional inference-time modification of the recent SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and efficiency.
最近的光流估计方法通常采用从密集的所有成对相关体积中抽取局部成本的方法。这导致计算和内存复杂度随像素数量呈二次增长。虽然存在一种记忆高效的实现方式,可以在需要时进行成本计算,但这种方法在实践中速度较慢,因此以前的方法通常会以降低图像分辨率的方式来处理图像,从而忽略了细粒度的细节。为了解决这个问题,我们提出了一种更高效的全部成对相关体积采样实现方法,在保持与RAFT定义的确切数学操作一致的同时,比按需成本计算方式快最多90%,同时保持低内存使用率,并且相比默认实现方式内存消耗减少了高达95%。由于成本采样占整个运行时间的重要部分,这可以在内存受限的环境中将总体端到端模型推理节省多达50%的内存。 我们对现有方法的评估包括一个8K超高分辨率的数据集以及最近SEA-RAFT方法在推断时的一个额外修改版本。通过这种方法,在高分辨率下,我们在准确性和效率方面均达到了最先进的成果。
https://arxiv.org/abs/2505.16942
While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.
虽然最近的文本到图像(T2I)模型在从简短描述生成图像方面表现出令人印象深刻的性能,但当面对专业应用中需要长而细节丰富的提示时,它们的表现明显下降。我们提出了DetailMaster,这是首个专门设计用于评估T2I模型处理扩展文本输入能力的全面基准测试,这些输入包含复杂的组合要求。我们的基准测试引入了四个关键评价维度:角色属性、结构化角色位置、多维场景属性和明确的空间/交互关系。该基准包括平均长度为284.89个令牌的长而细节丰富的提示,并且由专家注释者验证其高质量。在7种通用T2I模型和5种针对长文本提示优化的T2I模型上进行评估,结果显示关键性能限制:最先进的模型在属性绑定和空间推理等关键维度上的准确率仅为约50%,所有模型的表现随着提示长度增加而逐渐下降。我们的分析强调了对结构理解和细节处理能力系统的失败,并鼓励未来研究开发具有增强组合推理能力的架构。我们开源该数据集、数据整理代码及评估工具,以推进丰富详情的T2I生成并为以前由于缺乏专用基准测试而难以实现的广泛应用铺平道路。
https://arxiv.org/abs/2505.16915
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
扩散变换器(DiT)能够提供最先进的图像质量,但它们的训练过程仍然非常耗时。最近的一个解决方案——表示对齐(REPA),通过将DiT隐藏特征与非生成性教师模型(例如DINO)的特征匹配来加速早期训练阶段,但在后期要么停滞不前,要么性能下降。我们发现这种失败的原因是能力错配:一旦生成型学生开始建模联合数据分布,教师较低维度的嵌入和注意力模式就会成为一种限制而非指导。 为解决这一问题,我们提出了HASTE(分阶段终止的整体对齐高效训练),这是一种两阶段的时间表,旨在保持有益的因素并剔除有害的部分。第一阶段应用整体对齐损失,在中期层面上同时从教师模型中提取注意力图(关系先验)和特征投影(语义锚点)到DiT中,从而实现快速收敛。第二阶段则进行一次性终止操作,在达到一个简单的触发器(例如固定迭代次数)时停用对齐损失,让DiT能够专注于去噪任务并充分利用其生成能力。 HASTE能够在不改变架构的情况下加速各种DiTs的训练过程。在ImageNet 256x256的数据集上,它仅通过50个周期就达到了基础SiT-XL/2模型的FID评分,并且在经过500个周期后能够匹配REPA的最佳FID评分,这相当于优化步骤减少了28倍。此外,HASTE还在MS-COCO数据集上的文字到图像DiTs任务中进行了改进,证明了其作为一种简单而原则性的方法,在多种任务的高效扩散训练中具有广泛的应用价值。 我们的代码可在提供的链接中获取(原文中的链接未给出具体地址,请访问原始论文或官方页面查找)。
https://arxiv.org/abs/2505.16792
As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at this https URL.
随着大型语言模型的流行,它们对对抗性攻击的脆弱性仍然是一个主要担忧。虽然通常通过在特定领域的数据集上进行微调来提高模型性能,但这可能会引入底层模型中的漏洞。在这项工作中,我们研究了意外错位(Accidental Misalignment),这是一种由于微调数据特征而产生的意想不到的漏洞。首先,我们在实验数据集中识别潜在的相关因素,如语言特性、语义相似性和毒性。然后,我们评估这些经过微调的模型在对抗攻击中的表现,并评估数据集因素与攻击成功率之间的相关性。最后,我们探索可能的因果关系,为对抗防御策略提供新的见解,并强调了数据集设计在保持模型一致性方面的重要作用。我们的代码可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2505.16789
In group decision-making (GDM) scenarios, uncertainty, dynamic social structures, and vague information present major challenges for traditional opinion dynamics models. To address these issues, this study proposes a novel social network group decision-making (SNGDM) framework that integrates three-way decision (3WD) theory, dynamic network reconstruction, and linguistic opinion representation. First, the 3WD mechanism is introduced to explicitly model hesitation and ambiguity in agent judgments, thereby preventing irrational decisions. Second, a connection adjustment rule based on opinion similarity is developed, enabling agents to adaptively update their communication links and better reflect the evolving nature of social relationships. Third, linguistic terms are used to describe agent opinions, allowing the model to handle subjective, vague, or incomplete information more effectively. Finally, an integrated multi-agent decision-making framework is constructed, which simultaneously considers individual uncertainty, opinion evolution, and network dynamics. The proposed model is applied to a multi-UAV cooperative decision-making scenario, where simulation results and consensus analysis demonstrate its effectiveness. Experimental comparisons further verify the advantages of the algorithm in enhancing system stability and representing realistic decision-making behaviors.
在群体决策(GDM)场景中,不确定性、动态的社会结构和模糊信息给传统的意见动力学模型带来了重大挑战。为了解决这些问题,本研究提出了一种新的社会网络群体决策(SNGDM)框架,该框架整合了三元决策(3WD)理论、动态网络重构以及语言意见表达方法。 首先,引入了三元决策机制来明确建模代理判断中的犹豫和模糊性,从而防止做出非理性的决定。其次,开发了一种基于意见相似度的连接调整规则,使代理能够自适应地更新其通信链接,并更好地反映社会关系的发展性质。第三,使用语言术语描述代理的意见,使得模型能够更有效地处理主观、模糊或不完整的信息。最后,构建了一个集成多智能体决策框架,同时考虑个体不确定性、意见演化和网络动态。 提出的模型被应用于一个多无人机协同决策场景中,在该场景中的仿真结果和共识分析验证了其有效性。实验对比进一步证实了算法在增强系统稳定性和表现真实决策行为方面的优势。
https://arxiv.org/abs/2505.16781
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
少量样本计数通过仅使用几个标注示例就估计图像中目标对象的数量。然而,领域偏移(domain shift)严重阻碍了现有方法向未见过的情况推广。这属于单域泛化范畴,在少量样本计数任务中尚未得到探索。 为了解决这一问题,我们首先分析当前方法的主要局限性:它们通常遵循标准流程,从示例中提取对象原型,并将其与图像特征匹配以构建相关图(correlation map)。我们认为现有方法忽视了学习高度通用的原型的重要性。基于此见解,我们提出了首个单域泛化少量样本计数模型——通用表示匹配(Universal Representation Matching, URM)。 我们的主要贡献在于发现将大规模预训练视觉-语言模型中提炼出的通用视觉-语言表示集成到相关图构建过程中,能够显著提高对领域偏移的鲁棒性而不损害领域内性能。因此,URM在内部领域和新提出的领域泛化设置下均实现了最先进的表现。
https://arxiv.org/abs/2505.16778
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
远程遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)在地理信息解释、灾害监测和城市规划中扮演着关键角色,通过建立图像与文字描述之间的语义关联来实现这些目标。现有的视觉语言预训练模型(Vision-and-Language Pre-training, VLP)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法通常采用对称适配器结构来探索跨模态相关性。然而,文本模态强烈的判别性质可能会在优化过程中占主导地位,并抑制图像表示学习。因此,显著且不可忽视的跨模态不平衡优化仍然是提高模型性能的一个瓶颈。 为解决这一问题,本研究提出了一种用于RSITR任务的表征差异桥接(Representation Discrepancy Bridging, RDB)方法。一方面,设计了跨模态非对称适配器(Cross-Modal Asymmetric Adapter, CMAA),以实现模态特异化优化,并提高特征对齐能力。CMAA 包括视觉增强适配器 (Visual Enhancement Adapter, VEA) 和文本语义适配器 (Text Semantic Adapter, TSA)。VEA 通过差分注意(Differential Attention, DA)机制挖掘精细的图像特征,而TSA 则通过层次化注意力(Hierarchical Attention, HA)机制识别关键的文字语义。 另一方面,本研究将传统的单一任务检索框架扩展为双任务优化框架,并开发了双任务一致性损失 (Dual-Task Consistency Loss, DTCL)。DTCL 通过跨模态、分类和指数移动平均一致性的自适应加权组合来提高跨模态对齐的鲁棒性。 在RSICD和RSITMD数据集上的实验表明,所提出的RDB方法相比现有的最先进PEFT方法,在mR指标上提升了6%-11%,并比完全微调后的GeoRSCLIP模型提高了1.15%-2%。
https://arxiv.org/abs/2505.16756
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
嵌入式导航需要全面的场景理解和精确的空间推理能力。虽然图像-文本模型擅长解读像素级别的颜色和光照线索,而3D-文本模型则能捕捉体积结构和空间关系。然而,联合融合2D图像、3D点云以及文本指令的数据统一融合方法在处理三模态数据稀缺性和解决不同模式之间冲突信念的难题时面临挑战。为此,我们引入了CoNav,这是一个协作跨模态推理框架,在此框架中,预训练的3D-文本模型通过提供结构化的空间语义知识来明确指导图像-文本导航代理,从而在导航过程中解决模糊性问题。 具体而言,我们提出了跨模态信念对齐方法,该方法通过简单地从3D-文本模型共享文字假设给导航代理来实现这种跨模态引导。经过轻量级的微调,在一个小型2D-3D-文本语料库上训练后,导航代理可以学习将视觉线索与从3D-文本模型衍生的空间语义知识结合起来,从而在嵌入式导航中进行有效推理。 CoNav在四个标准嵌入式导航基准(R2R, CVDN, REVERIE, SOON)和两个空间推理基准(ScanQA, SQA3D)上取得了显著的改进。此外,在接近导航成功率的情况下,与其它方法相比(通过SPL测量),CoNav通常生成更短的路径,展示了在嵌入式导航中融合不同模态数据的能力及其面临的挑战。 项目主页:[此链接](https://this-url)
https://arxiv.org/abs/2505.16663
Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
基于自我视角的手部与物体运动生成对于沉浸式AR/VR和机器人模仿至关重要,但由于不稳定视点、自遮挡、透视畸变以及噪声音频自身运动等因素,这一任务仍具有挑战性。现有的方法依赖于预定义的3D对象先验知识,这限制了其对新对象的泛化能力。同时,最近的多模态方法由于从抽象文本线索生成模糊不清的问题、用于建模3D手部与物体关联的复杂管道以及开环预测中的累积误差而面临问题。 我们提出了MEgoHand,这是一种多模态框架,可以从自我视角的RGB图像、文本和初始手部姿态中合成物理上合理的手部-物体交互。MEgoHand引入了一个双层架构:高层次的“大脑”利用视觉语言模型(VLM)从视觉-文本上下文中推断运动先验,并使用单目深度估计器进行无对象偏见的空间推理,而低层次基于DiT的流匹配策略则通过时间正交滤波生成细粒度轨迹以提高稳定性。为了解决数据集不一致性的问题,我们设计了一种数据集整理范式,其中包括逆向MANO重定向网络和虚拟RGB-D渲染器,用于整理一个统一的数据集,该数据集中包含3.35M个RGB-D帧、24K次交互及1.2K个对象。 跨五个同域和两个异域数据集的广泛实验展示了MEgoHand的有效性,在手腕平移误差上实现了86.9%的显著降低,在关节旋转误差上则降低了34.1%,这突显了其在准确建模细粒度手部关节结构以及在各种场景中稳健泛化方面的能力。
https://arxiv.org/abs/2505.16602
Stock price prediction is a critical area of financial forecasting, traditionally approached by training models using the historical price data of individual stocks. While these models effectively capture single-stock patterns, they fail to leverage potential correlations among stock trends, which could improve predictive performance. Current single-stock learning methods are thus limited in their ability to provide a broader understanding of price dynamics across multiple stocks. To address this, we propose a novel method that merges local patterns into a global understanding through cross-stock pattern integration. Our strategy is inspired by Federated Learning (FL), a paradigm designed for decentralized model training. FL enables collaborative learning across distributed datasets without sharing raw data, facilitating the aggregation of global insights while preserving data privacy. In our adaptation, we train models on individual stock data and iteratively merge them to create a unified global model. This global model is subsequently fine-tuned on specific stock data to retain local relevance. The proposed strategy enables parallel training of individual stock models, facilitating efficient utilization of computational resources and reducing overall training time. We conducted extensive experiments to evaluate the proposed method, demonstrating that it outperforms benchmark models and enhances the predictive capabilities of state-of-the-art approaches. Our results highlight the efficacy of Cross-Stock Trend Integration (CSTI) in advancing stock price prediction, offering a robust alternative to traditional single-stock learning methodologies.
股票价格预测是金融市场预测的关键领域,传统方法通常采用历史股价数据来训练模型。虽然这些模型能够有效捕捉单个股票的价格模式,但它们未能利用各股之间潜在的相关性,而这可能会提升预测性能。现有的单一股票学习方法在提供多只股票间价格动态的全面理解方面存在局限。 为了解决这一问题,我们提出了一种新的方法,该方法通过跨股票趋势整合将局部模式合并到全局理解中。我们的策略受联邦学习(FL)范式的启发,这是一种针对分布式模型训练而设计的方法。联邦学习允许在不共享原始数据的情况下,在分布的数据集之间进行协作式学习,从而促进了全球见解的聚合并保持了数据隐私性。在这种适应性的方法中,我们首先使用单个股票的数据来训练模型,并通过迭代合并这些模型以创建一个统一的整体模型。随后,该整体模型会在特定股票的数据上进行微调,以便保留本地相关性。 所提出的策略能够平行地训练个体股票的模型,从而高效利用计算资源并减少总的训练时间。我们进行了广泛的实验评估了这一方法的效果,证明其在基准模型和最先进的预测方法之上提升了性能。我们的结果强调了跨股票趋势整合(CSTI)在推进股价预测方面的能力,并为传统的单一股票学习方法提供了一种稳健的替代方案。 此研究展示了联邦学习框架如何能够应用于金融市场,不仅增强了模型的整体准确性,还保持了数据隐私,这表明这种方法具有广阔的应用前景。
https://arxiv.org/abs/2505.16573
While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer, which we term as an internal bias since it is not derived through actual reasoning. When this guess conflicts with its reasoning result, the model tends to engage in reflection, leading to the waste of computational resources. Through further interpretability experiments, we find that this behavior is largely driven by the model's excessive attention to the input section, which amplifies the influence of internal bias on its decision-making process. Additionally, by masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53% across different complex reasoning tasks. Notably, in most cases, this approach also leads to improvements in accuracy. These findings demonstrate a causal relationship between internal bias and overthinking.
尽管当前的推理模型具有强大的探索能力,但它们常常因为冗余和不必要的反思而被认为过度思考。在这项工作中,我们首次揭示了推理模型中的过度思考可能源自其对输入文本的内部偏见。当遇到一个推理问题时,模型会立即对其答案形成初步猜测,我们将这种倾向称为“内部偏见”,因为它不是通过实际推理得出的结论。如果这个初步猜测与模型的最终推理结果相冲突,它往往会陷入反思之中,从而浪费了计算资源。进一步的可解释性实验表明,这种行为主要是由于模型对输入部分的过度关注引起的,这放大了内部偏见对其决策过程的影响。 此外,通过屏蔽原始输入部分,可以有效减轻内部偏见的影响,并在不同复杂的推理任务中将推理长度减少31%-53%。值得注意的是,在大多数情况下,这种方法还能提高准确性。这些发现表明了内部偏见与过度思考之间的因果关系。
https://arxiv.org/abs/2505.16448
Small object detection in intricate environments has consistently represented a major challenge in the field of object detection. In this paper, we identify that this difficulty stems from the detectors' inability to effectively learn discriminative features for objects of small size, compounded by the complexity of selecting high-quality small object samples during training, which motivates the proposal of the Multi-Clue Assignment and Feature Enhancement this http URL, MAFE R-CNN integrates two pivotal this http URL first is the Multi-Clue Sample Selection (MCSS) strategy, in which the Intersection over Union (IoU) distance, predicted category confidence, and ground truth region sizes are leveraged as informative clues in the sample selection process. This methodology facilitates the selection of diverse positive samples and ensures a balanced distribution of object sizes during training, thereby promoting effective model this http URL second is the Category-aware Feature Enhancement Mechanism (CFEM), where we propose a simple yet effective category-aware memory module to explore the relationships among object features. Subsequently, we enhance the object feature representation by facilitating the interaction between category-aware features and candidate box this http URL experiments conducted on the large-scale small object dataset SODA validate the effectiveness of the proposed method. The code will be made publicly available.
小型目标在复杂环境中的检测一直是物体检测领域的一个重大挑战。本文指出,这一难题主要源于检测器难以有效学习出小尺寸对象的判别特征,并且由于选择高质量的小型对象样本进行训练时存在困难,这促使我们提出了多线索分配与特征增强(Multi-Clue Assignment and Feature Enhancement, MAFE)方法。在该框架中,MAFE R-CNN结合了两个关键策略:首先,是多线索样本选择(MCSS)策略,其中交并比(IoU)距离、预测类别置信度以及真实区域大小被用作样本选择过程中的信息性线索。这一方法有助于挑选多样化的正样本,并确保训练过程中物体尺寸的平衡分布,从而促进模型的有效学习;其次,是类感知特征增强机制(CFEM),我们提出了一种简单而有效的类感知内存模块来探索对象特征之间的关系。随后,通过促进类别感知特征与候选框之间的交互,进一步增强了目标特征表示。 在大规模小型对象数据集SODA上进行的实验验证了所提方法的有效性。代码将公开发布。
https://arxiv.org/abs/2505.16442
Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
近期的视频修复研究集中在从低质量输入中恢复高质量视频帧上。与静态图像相比,视频修复的效果在很大程度上依赖于高效利用连续视频帧之间的时序关联性。许多技术通过基于流(flow-based)策略或递归架构来使用这种时间信息。然而,这些方法通常难以保持时间一致性,因为它们会利用受损的输入视频帧。为了解决这个问题,我们提出了一种名为“联合流和特征精细调整注意力框架”(Joint Flow and Feature Refinement using Attention, JFFRA)的新颖视频修复框架。 提出的JFFRA基于通过流(对齐)和恢复之间的协同合作迭代增强数据这一核心理念。通过利用之前改进的特性来细化流,反之亦然,JFFRA能够有效地使用时间信息进行特征增强。这种流与恢复之间的交互作用在多个尺度上执行,减少了对精确流估计的依赖性。此外,我们整合了一种具有遮挡感知的时间损失函数,以提高网络消除闪烁伪影的能力。 全面的实验验证了JFFRA在各种修复任务(如去噪、去模糊和超分辨率)上的多功能性和卓越性能。我们的方法相比最新的技术方案,在性能上提高了高达1.62分贝的改进。
https://arxiv.org/abs/2505.16434
The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.
在自然语言处理(NLP)中,程序性知识的有效沟通仍是一个重大挑战,因为纯文本指令常常无法充分传达复杂的物理动作和空间关系。为了解决这一局限性,我们提出了一种基于语言的框架,该框架将程序性文本翻译成连贯的视觉指令。我们的方法通过分解指令内容为目标陈述和顺序步骤来建模其语言结构,并根据这些语言元素进行视觉生成。 我们在研究中引入了三个关键创新: 1. 一种基于成分句法分析器的文本编码机制,即使在面对长篇指令时也能保持语义完整性。 2. 一个成对话语连贯性模型,能够维持一序列指令之间的连贯性和一致性。 3. 设计了一种专门用于程序语言到图像对齐的新评估协议。 通过三个不同数据集(HTStep、CaptainCook4D和WikiAll)的实验显示,我们的方法在生成准确反映指令语言内容及其顺序性质的视觉效果方面显著优于现有基线方法。这项工作为将程序性语言嵌入视觉内容的研究做出了贡献,并且其应用范围广泛,包括教育、任务指导以及多模态语言理解等领域。
https://arxiv.org/abs/2505.16425