The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
大型多模态模型(LMMs)的出现显著增强了大型语言模型(LLMs)处理和解释多样化数据模式(如图像和视频)的能力。然而,随着输入复杂性的增加,尤其是在长视频序列的情况下,所需的token数量大幅增长,导致计算成本呈二次方增长。这使得在保持性能完整性的前提下高效压缩LMMs中的视频token成为一个紧迫的研究挑战。 在本文中,我们介绍了CrossLMM,通过双交叉注意力机制将长视频序列从LMMs中解耦,从而显著减少了视觉token的数量,并且几乎不会对性能产生负面影响。具体来说,我们首先通过对预训练的视觉编码器使用池化方法实现大量token的减少。然后,在LLM层内,我们采用了一种视觉到视觉的交叉注意力机制,其中池化的视觉tokens作为查询与原始视觉token集合进行比较。这一模块使得更高效的token利用成为可能,并且保持了细粒度的信息保真度。 此外,我们引入了一个文本到视觉的交叉注意力机制,在该机制中,文本tokens通过与原始视觉tokens互动而增强,从而丰富了对文本tokens的理解。 全面的经验评估表明,尽管使用了显著较少的计算资源,我们的方法在各种基于视频的LMM基准测试上实现了可比或更优的表现。
https://arxiv.org/abs/2505.17020
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
大型语言模型(LLMs)主要设计用于批处理。现有将LLMs适应流处理的方法要么依赖昂贵的重新编码,要么依赖具有有限可扩展性的专用架构。本研究识别了在将面向批次的LLM调整为流模式时出现的三个关键不匹配:(1) 输入注意机制不匹配;(2) 输出注意机制不匹配;以及 (3) 位置ID不匹配。尽管通常认为后两种不匹配需要频繁重新编码,但我们的分析表明,只有输入注意机制不匹配显著影响性能,这意味着输出的重新编码在很大程度上是不必要的。为了更好地理解这种与普遍假设之间的差异,我们首次对LLMs在流处理中的位置编码影响进行了全面分析,显示保持源和目标上下文内的相对位置比维持绝对顺序更为关键。 受到上述分析的启发,我们引入了一种基于批处理架构的组位置编码范式,以增强流模式与批处理模式之间的一致性。跨语言和跨模态任务上的大量实验表明,我们的方法优于现有方法,并且无需对架构进行修改,在流模式和批处理模式下都表现出强大的泛化能力。代码可在以下网址获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2505.16983
Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。
https://arxiv.org/abs/2505.16980
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
当代扩散模型在文本到图像的生成方面表现出卓越的能力,但仍受限于固定的分辨率(例如,1024x1024)。近期的研究进展使得通过循环利用预训练的扩散模型,并借助区域去噪或膨胀采样/卷积技术来实现无需微调的高分辨率图像生成成为可能。然而,这些模型在同时保持全局语义结构和生成具有创意性的局部细节方面仍面临挑战。 为了解决这个问题,我们提出了C-Upscale,这是一种新的无需微调的图像上采样方法,它基于从给定的全局提示以及通过多模态大语言模型估算出的区域提示中提取出的全局-区域先验。在技术实现上,低分辨率图像中的低频部分被识别为全局结构先验,以鼓励高分辨率生成过程中的全局语义一致性。接下来,我们执行区域注意力控制,在区域去噪过程中筛选全局提示与每个区域之间的交叉注意,从而缓解对象重复的问题,并形成一个区域注意力先验。估算出的包含丰富描述性细节的区域提示进一步充当区域语义先验,以激发局部细节生成的创造性。 无论是定量还是定性的评估都表明,我们的C-Upscale方法能够生成超高分辨率图像(例如4096x4096和8192x8192),并具备更高的视觉保真度以及更多具有创意性的区域细节。
https://arxiv.org/abs/2505.16976
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.
开放词汇分割(OVS)因其能够将分割任务推广到预定义类别之外的能力而越来越受到关注。然而,现有的方法通常通过简单的前向推理来预测分割掩码,缺乏明确的推理过程和可解释性。这使得OVS模型在开放式环境中难以区分相似的类别,因为缺少上下文理解和具有判别性的视觉线索。 为了解决这一局限性,我们提出了一种逐步视觉推理框架,用于开放词汇分割,并将其命名为OpenSeg-R。所提出的OpenSeg-R利用大型多模态模型(LMMs)来进行分层视觉推理,在进行分割之前完成该步骤。具体而言,针对每张图像生成通用和特定于图像的推理内容,形成结构化三元组,以从粗到细的方式解释对象的视觉原因。基于这些推理步骤,我们可以合成详细的描述提示,并将其输入到分割器中,从而产生更准确的分割掩码。 据我们所知,OpenSeg-R是第一个将明确的逐步视觉推理引入OVS框架的方法。实验结果表明,在五个基准数据集上的开放词汇语义分割任务上,OpenSeg-R显著优于当前最佳方法。此外,在开放词汇全景分割的所有指标中也取得了持续性的改进。定性结果显示了我们推理引导框架在提高分割精度和可解释性方面的有效性。 我们的代码可在以下链接获取:[提供URL的地方](请替换为实际的代码公开链接)。
https://arxiv.org/abs/2505.16974
Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.
多模态大型语言模型(MLLMs)在作为服务的微调设置中越来越多地被部署,其中用户提交的数据集将通用模型适应于下游任务。然而,这种灵活性也引入了严重的安全风险,因为恶意微调可以轻松地在MLLM中植入后门。在这篇论文中,我们观察到后门触发器会系统性地破坏跨模态处理过程,导致注意力集中在非语义区域上——我们将这一现象称为注意力崩溃(attention collapse)。基于这一洞察,我们提出了“相信你的眼睛”(Believe Your Eyes, BYE),这是一个利用注意力熵模式作为自我监督信号来识别和过滤后门样本的数据过滤框架。BYE通过一个三阶段管道运行:(1) 使用微调后的模型提取注意力图;(2) 计算熵分数,并通过二模态分离对敏感层进行轮廓分析;(3) 进行无监督聚类以移除可疑样本。与先前的防御措施不同,BYE不需要干净的数据标注、辅助标签或模型修改。广泛的实验在各种数据集、模型和多样的触发类型上验证了BYE的有效性:它实现了接近零的成功攻击率,并且保持了清洁任务的表现力,为MLLM中的后门威胁提供了稳健而通用的解决方案。
https://arxiv.org/abs/2505.16916
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL
尽管视频扩散变压器(DiT)模型的生成质量十分出色,但由于其计算需求过大,实际部署受到了严重阻碍。这种低效率主要源自两个关键挑战:自注意力机制相对于标记长度呈二次复杂度增长,以及扩散模型的多步骤特性。为解决这些局限性,我们提出了Jenga,这是一种结合动态注意力裁剪与渐进式分辨率生成的新颖推理管道。我们的方法利用了以下两大洞见:(1)早期去噪阶段不需要高分辨率潜在特征,(2)后期阶段则无需密集型注意力机制。Jenga引入了一种基于块的注意力机制,在此机制中使用三维空间填充曲线动态选择相关的标记交互,并配合一种渐进式分辨率策略,在生成过程中逐步增加潜在特征的分辨率。 实验结果显示,与多种最先进的视频扩散模型相比,Jenga实现了显著的速度提升(在VBench数据集上速度提升了8.83倍且性能仅下降了0.01%),同时保持了相当的生成质量。作为即插即用解决方案,Jenga通过将推理时间从数分钟缩短至数秒,使得在现代硬件上进行高质量视频生成成为可能——而无需重新训练模型。 代码链接:[请在此处插入具体链接]
https://arxiv.org/abs/2505.16864
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
扩散变换器(DiT)能够提供最先进的图像质量,但它们的训练过程仍然非常耗时。最近的一个解决方案——表示对齐(REPA),通过将DiT隐藏特征与非生成性教师模型(例如DINO)的特征匹配来加速早期训练阶段,但在后期要么停滞不前,要么性能下降。我们发现这种失败的原因是能力错配:一旦生成型学生开始建模联合数据分布,教师较低维度的嵌入和注意力模式就会成为一种限制而非指导。 为解决这一问题,我们提出了HASTE(分阶段终止的整体对齐高效训练),这是一种两阶段的时间表,旨在保持有益的因素并剔除有害的部分。第一阶段应用整体对齐损失,在中期层面上同时从教师模型中提取注意力图(关系先验)和特征投影(语义锚点)到DiT中,从而实现快速收敛。第二阶段则进行一次性终止操作,在达到一个简单的触发器(例如固定迭代次数)时停用对齐损失,让DiT能够专注于去噪任务并充分利用其生成能力。 HASTE能够在不改变架构的情况下加速各种DiTs的训练过程。在ImageNet 256x256的数据集上,它仅通过50个周期就达到了基础SiT-XL/2模型的FID评分,并且在经过500个周期后能够匹配REPA的最佳FID评分,这相当于优化步骤减少了28倍。此外,HASTE还在MS-COCO数据集上的文字到图像DiTs任务中进行了改进,证明了其作为一种简单而原则性的方法,在多种任务的高效扩散训练中具有广泛的应用价值。 我们的代码可在提供的链接中获取(原文中的链接未给出具体地址,请访问原始论文或官方页面查找)。
https://arxiv.org/abs/2505.16792
In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field.
在这篇论文中,我们展示了Ego4D EgoSchema挑战赛(CVPR 2025,确认于2025年5月20日)的亚军解决方案。受大型模型成功的影响,我们评估并利用了领先的可访问多模态大模型,并通过少量样本学习和模型集成策略将它们适应到视频理解任务中。具体而言,系统地探索和评估了多样化的提示风格和处理范式,以有效地引导大规模模型的注意力,充分释放其强大的泛化能力和适应能力。 实验结果表明,在我们精心设计的方法下,直接利用单一多模态模型已超越了包含多个额外过程的前一代最先进(SOTA)方法。此外,还引入了一个额外阶段来促进周期性结果的合作和集成,从而实现了显著的性能改进。我们希望这项工作能为大型模型的实际应用提供有价值的参考,并激发该领域的未来研究。
https://arxiv.org/abs/2505.16784
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
远程遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)在地理信息解释、灾害监测和城市规划中扮演着关键角色,通过建立图像与文字描述之间的语义关联来实现这些目标。现有的视觉语言预训练模型(Vision-and-Language Pre-training, VLP)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法通常采用对称适配器结构来探索跨模态相关性。然而,文本模态强烈的判别性质可能会在优化过程中占主导地位,并抑制图像表示学习。因此,显著且不可忽视的跨模态不平衡优化仍然是提高模型性能的一个瓶颈。 为解决这一问题,本研究提出了一种用于RSITR任务的表征差异桥接(Representation Discrepancy Bridging, RDB)方法。一方面,设计了跨模态非对称适配器(Cross-Modal Asymmetric Adapter, CMAA),以实现模态特异化优化,并提高特征对齐能力。CMAA 包括视觉增强适配器 (Visual Enhancement Adapter, VEA) 和文本语义适配器 (Text Semantic Adapter, TSA)。VEA 通过差分注意(Differential Attention, DA)机制挖掘精细的图像特征,而TSA 则通过层次化注意力(Hierarchical Attention, HA)机制识别关键的文字语义。 另一方面,本研究将传统的单一任务检索框架扩展为双任务优化框架,并开发了双任务一致性损失 (Dual-Task Consistency Loss, DTCL)。DTCL 通过跨模态、分类和指数移动平均一致性的自适应加权组合来提高跨模态对齐的鲁棒性。 在RSICD和RSITMD数据集上的实验表明,所提出的RDB方法相比现有的最先进PEFT方法,在mR指标上提升了6%-11%,并比完全微调后的GeoRSCLIP模型提高了1.15%-2%。
https://arxiv.org/abs/2505.16756
We introduce the Dual-Flow Generative Ranking Network (DFGR), a two-stream architecture designed for recommendation systems. DFGR integrates innovative interaction patterns between real and fake flows within the QKV modules of the self-attention mechanism, enhancing both training and inference efficiency. This approach effectively addresses a key limitation observed in Meta's proposed HSTU generative recommendation approach, where heterogeneous information volumes are mapped into identical vector spaces, leading to training instability. Unlike traditional recommendation models, DFGR only relies on user history behavior sequences and minimal attribute information, eliminating the need for extensive manual feature engineering. Comprehensive evaluations on open-source and industrial datasets reveal DFGR's superior performance compared to established baselines such as DIN, DCN, DIEN, and DeepFM. We also investigate optimal parameter allocation strategies under computational constraints, establishing DFGR as an efficient and effective next-generation generate ranking paradigm.
我们介绍了双流生成排序网络(DFGR),这是一种为推荐系统设计的两路架构。DFGR在自注意力机制中的QKV模块中整合了真实和虚假流动之间的创新交互模式,从而提高了训练和推理效率。这种方法有效地解决了Meta提出的HSTU生成推荐方法的关键限制问题,即异构信息量被映射到相同的向量空间,导致训练不稳定。与传统的推荐模型不同,DFGR仅依赖于用户的历史行为序列和少量属性信息,无需进行大量的手动特征工程。 在开源数据集和工业数据集上的全面评估显示,DFGR的性能优于基准模型(如DIN、DCN、DIEN和DeepFM)。我们还研究了计算约束下的最佳参数分配策略,并将DFGR确立为一种高效的下一代生成排序范式。
https://arxiv.org/abs/2505.16752
Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.
准确预测剩余使用寿命(RUL)对于及时维护锂离子电池至关重要,这会影响依赖这些电池的电动应用的操作效率。本文提出了一种基于最近充放电循环数据来估算剩余可用循环数的RUL预测方法。该方法引入了一个新颖的信号处理管道和一个深度学习预测模型。 在信号预处理管道中,根据电流和容量信号计算出衍生容量特征。与原始容量、电压和电流一起,这些特征通过统计指标和基于增量的方法进行去噪和增强,以捕捉当前循环与前一循环之间的差异。 在预测模型中,经过处理的特征被输入到一个混合深度学习架构中,该架构由1D卷积神经网络(CNN)、注意力长短期记忆(A-LSTM)模块以及基于常微分方程的LSTM(ODE-LSTM)模块组成。这种架构设计旨在捕获局部信号特征和长期时间依赖关系,并建模电池退化过程中的连续时间动态。 该模型通过不同的学习策略和目标数据分区场景进行迁移学习进一步进行了评估,结果表明即使在有限的目标数据上微调的情况下也能保持稳健的性能。 实验结果基于两个公开的大规模数据集,证明了所提出的方法优于深度学习基线方法和机器学习技术,在RMSE(均方根误差)方面取得了101.59的成绩,突显其在实际RUL预测应用中的强大潜力。
https://arxiv.org/abs/2505.16664
Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on this https URL.
近年来,由于技术进步和方法创新的推动,高光谱影像增强(即高光谱 pansharpening)受到了广泛的关注,并且为新的应用场景打开了大门。然而,在这一领域的研究现在才刚刚开始加速发展。当前最常用的方法大多借鉴了更为成熟的多光谱 pansharpening 技术,但却忽视了将这些技术直接应用于高光谱数据融合时所面临的独特挑战,比如:i) 高光谱图像中波段数量巨大;ii) 在选定的某些光谱范围内的噪声非常大;iii) 全色与高光谱分量之间存在显著的光谱不匹配;iv) 较高的分辨率比。尤其是在数据建模方面不够精确,这尤其影响了光谱保真度。即使是最先进的方法,在特定波段范围内表现良好,但在其他波段则效果较差,无法保证在整个波段范围内的质量一致性,从而增加了生成不可靠结果的风险。 为此,我们提出了一种高光谱 pansharpening 方法,该方法明确地解决了这些问题,并确保了整个波段范围内的光谱质量的一致性。为此,使用了一个轻量级的单一神经网络,在运行过程中权重会根据每个单独波段的需求进行动态调整。在微调阶段,通过开启和关闭空间损失(类似滞回效应的方式),以使光谱损失快速收敛至期望水平。同时,还重新定义了空间损失本身,使其能够适应全色与光谱波段之间的非线性依赖关系。 总体而言,所提出的方法是完全无监督的,并且无需在外部数据上进行预训练;具有灵活性和低复杂度的特点。实验结果显示,在最近发布的基准测试工具箱上的实验证明了该方法确保了出色的锐化质量,与现有的最先进方法相当,并且这种质量在整个波段范围内是一致的。 软件代码及完整的实验结果可以在线访问(此处链接为示意性表述,请根据实际情况提供具体URL)。
https://arxiv.org/abs/2505.16658
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
最近在多模态大型语言模型(MLLMs)方面的进展显著提升了视觉问答的性能,然而这些模型经常出现幻觉问题。在这项工作中,我们将幻觉分为两大类:初始幻觉和滚雪球式幻觉。我们认为可以通过直接从令牌交互过程中提取足够的上下文信息来解决这个问题。受因果推理在解码策略中的启发,我们提出了利用因果掩码建立多模态令牌之间的信息传播的方法。我们的假设是,如果这些令牌之间缺乏充分的互动,模型可能会依赖于离群令牌,从而忽略密集且丰富的上下文线索。 因此,通过处理离群令牌来加强上下文推理的过程,我们将提出一种干预方法以增强信息传递过程。为此,我们提出了FarSight,这是一种灵活的即插即用解码策略,仅通过对因果掩码进行优化即可减少来自离群令牌的关注干扰。我们的方法的核心是有效的令牌传播。我们在因果掩码的上三角矩阵内设计了一个注意力注册结构,并动态地分配关注以捕捉被吸引到离群令牌上的注意力。 此外,我们还提出了一种基于递减屏蔽率的位置感知编码方法,使模型能够注意更早之前的令牌,特别是在视频序列任务中尤其重要。通过广泛的实验,FarSight在不同MLLM上的一系列图像和视频基准测试中均显示出显著的幻觉缓解性能,证明了其有效性。
https://arxiv.org/abs/2505.16652
Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.
近年来,视频字幕模型在捕捉时间信息方面取得了显著进展。尽管许多研究工作集中在架构改进上,如引入时间注意力机制,但在理解模型如何捕获和利用时间语义以进行有效的时间特征提取,特别是在高级驾驶辅助系统(ADAS)的背景下,仍存在明显的知识空白。 我们提出了一种自动化的基于LiDAR的字幕生成程序,该程序专注于交通参与者的时间动态。我们的方法采用基于规则的系统从物体轨迹中提取关键细节,如车道位置和相对运动,并使用模板生成字幕。研究结果表明,通过仅使用前向摄像头图像并用我们设计的包含细粒度时间行为信息的模板式字幕进行监督训练SwinBERT(一种视频字幕模型),可以在三个数据集上持续提高时间理解能力。 总之,我们的结果显示,结合基于LiDAR的字幕指导能够显著增强时间理解能力,并有效解决和减少现有最先进模型架构中的固有视觉/静态偏见。
https://arxiv.org/abs/2505.16594
We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
我们解决了单目到立体视频转换的问题,并提出了一种新颖的架构,用于对通过基于深度的重投影从输入左视图获得的扭曲右视图进行修复和细化。我们将Stable Video Diffusion (SVD) 模型扩展为利用输入左视视频、扭曲后的右视图以及遮挡排除掩码作为条件输入来生成高质量的右摄像机视图。为了有效利用相邻帧的信息来进行修补,我们修改了SVD中的注意力层以对被遮挡像素计算完全注意(full attention)。我们的模型通过最小化图像空间损失,在端到端的方式下训练以生成高质量的右视视频。在用户研究中,与四个比较方法相比,我们的方法平均排名为1.43,同时比第二名的方法快6倍。
https://arxiv.org/abs/2505.16565
We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.
我们提出了一种新颖的动态三维场景重建框架,该框架集成了三个关键组件:明确的三平面变形场、带有球谐函数(SH)注意机制的视角条件标准辐射场以及时间感知潜在扩散先验。我们的方法使用三个正交2D特征平面来编码4D场景,这些特征随着时间推移而变化,从而实现了高效且紧凑的空间-时间表示。通过一个变形偏置场将这些特征明确地变换到标准空间中,这消除了基于MLP的运动建模的需求。在标准空间内,我们用一种结构化的SH基础渲染头取代传统的MLP解码器,该渲染头通过学习频率带上的注意机制合成视图依赖的颜色,从而提高了可解释性和渲染效率。为了进一步增强保真度和时间一致性,我们引入了一个由变压器引导的潜在扩散模块,在压缩潜空间中精炼三平面和变形特征。这个生成模块在模棱两可或分布外(OOD)运动情况下对场景表示进行去噪处理,从而提高了泛化能力。我们的模型通过两个阶段进行训练:首先独立预训练扩散模块,然后联合整个流程使用图像重建、扩散降噪以及时间一致性损失进行微调。 我们在合成基准测试中展示了最先进的结果,在视觉质量、时间连贯性和稀疏视图动态输入的鲁棒性方面超过了最近的方法如HexPlane和4D高斯点置。
https://arxiv.org/abs/2505.16535
While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at this https URL.
当人类能够轻松地根据视觉对象和形状的复杂性灵活分配注意力时,现有的多模态大型语言模型(MLLMs)仍然受到刚性的令牌表示形式的限制。为弥合这一差距,我们提出了ALTo,这是一种自回归掩码生成的自适应长度标记器。为此设计了一个新颖的令牌长度预测器,并引入了长度正则化项和可微分的令牌分块策略。进一步地,我们构建了ALToLLM,它将ALTo无缝集成到MLLM中。通过群体相对政策优化(GRPO)实现对掩码质量和效率之间权衡偏好的实施。实验表明,在流行的分割基准测试上,ALToLLM在自适应令牌成本方面达到了最先进的性能。代码和模型可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.16495
Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $\mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$, where $m$ is an anchor number and $m < n$. Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
最近,视觉变换器(ViT)通过测量图像块之间的全局自注意力,在视觉任务中取得了卓越的性能。给定 $n$ 个图像块时,这类模型具有二次复杂度如 $\mathcal{O}(n^2)$,当输入图像以小颗粒度进行分割时时间成本会变得非常高。同时,关键信息往往随机分布在输入图像中的几个区域里,有些标记对于下游任务来说可能并不重要。为了解决这些问题,我们引入了一种基于锚点的高效视觉变换器(AnchorFormer),该模型通过使用锚点来学习关键信息并加速推理过程。 首先,通过估计锚点与标记之间的双部图注意力,复杂度可以从 $\mathcal{O}(n^2)$ 减少到 $\mathcal{O}(mn)$,其中 $m$ 是锚点的数量且满足 $m < n$。值得注意的是,通过用神经网络层中的神经元表示这些锚点,我们可以对它们进行可微学习,并通过马尔科夫过程近似全局自注意力。 此外,我们将提出的模型扩展到了包括分类、检测和分割在内的三个下游任务中。广泛的实验表明了我们AnchorFormer的有效性,例如,在ImageNet分类上达到了比现有基准高出最多9.0%的准确率或减少46.7%的FLOPs(浮点运算次数),在COCO检测中的mAP值也比当前基线高出了81.3%,并且是在与之相当的计算量条件下实现的。
https://arxiv.org/abs/2505.16463