In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.
在这项工作中,我们通过对扩散变压器(DiTs)进行实证研究来探索文本到图像生成领域,重点关注架构选择、文本条件策略和训练协议。我们评估了一系列基于DiT的架构——包括PixArt风格和MMDiT变体,并将其与一种标准DiT变体进行了比较,该变体直接处理拼接后的文本和噪声输入。令人惊讶的是,我们的研究发现表明,标准DiT的表现与那些专门化模型相当,同时在参数效率方面表现出色,尤其是在规模扩大时。 通过分层的参数共享策略,我们实现了相较于MMDiT架构66%的模型大小减少,并且对性能影响很小。基于对关键组件(如文本编码器和变分自动编码器(VAE))的深入分析,我们引入了DiT-Air和DiT-Air-Lite。通过监督和奖励微调,在GenEval和T2I CompBench基准测试中,DiT-Air达到了最先进的性能,而尽管其规模较小,DiT-Air-Lite仍然保持高度竞争力,并且超越了许多现有的模型。
https://arxiv.org/abs/2503.10618
Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.
将大型语言模型适应多种任务可能会导致技能间的干扰,即一个技能的改进会削弱另一个技能的表现。虽然像LoRA这样的方法在权重层施加正交性约束,但它们未能完全解决隐藏状态表示中的干扰问题。我们提出了组合子空间表示微调(Compositional Subspace Representation Fine-tuning, CS-ReFT),这是一种基于表示的新方法,它学习多个正交子空间变换,每个变换专门针对不同的技能,并通过一个轻量级路由器进行组合。通过在隐藏状态下而不是权重矩阵中隔离这些子空间编辑,CS-ReFT 更有效地防止了跨任务冲突。在 AlpacaEval 基准测试上,将 CS-ReFT 应用于 Llama-2-7B 模型实现了 93.94% 的胜率,超过了 GPT-3.5 Turbo(86.30%),同时只需要使用模型参数的 0.0098%。这些发现表明,通过简单的路由器组合的专业表示编辑显著提升了多任务指令跟随的效果,并且几乎不增加额外开销。
https://arxiv.org/abs/2503.10617
Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at this https URL.
开放词汇多目标跟踪旨在将追踪器推广到训练期间未见过的类别,从而使其能够应用于各种现实场景中。然而,现有的开放词汇追踪器由于其框架结构、孤立帧级感知以及模态互动不足的问题,在开放词汇分类和追踪方面的性能受到限制。 针对上述挑战,本文提出了OVTR(端到端开放词汇多目标跟踪与Transformer),这是首个同时建模运动、外观和类别的端到端开放词汇追踪器。为实现稳定的分类和连续追踪,我们设计了CIP(类别信息传播)策略,该策略在后续帧中建立多个高级别类别信息先验。 此外,为了增强泛化能力和深度多模态交互,我们引入了一种双分支结构,并且在解码器中加入防护策略以进一步提升性能。实验结果显示,与之前的追踪器相比,我们的方法在开放词汇MOT基准测试上取得了更好的效果,同时实现了更快的推理速度和显著减少的数据预处理需求。 此外,在将模型转移到另一个数据集上的实验中,证明了其强大的适应性。该方法的相关代码和模型已经在如下网址公开发布:[此处插入链接]。
https://arxiv.org/abs/2503.10616
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
大型语言模型已经在复杂的文本任务中展现了出色的推理能力。然而,多模态推理(需要融合视觉和文本信息)仍然是一个重大挑战。现有的视觉-语言模型往往难以有效地分析和推断视觉内容,在复杂推理任务上的表现不尽如人意。此外,缺乏全面的基准测试阻碍了对多模态推理能力的准确评估。 在本文中,我们介绍了R1-Onevision,这是一种旨在弥合视觉感知与深度推理之间差距的多模态推理模型。为此,我们提出了一种跨模态推理流水线,将图像转换为正式的语言表示形式,从而能够进行精确的语言基础推理。借助这一流水线,我们构建了R1-Onevision数据集,该数据集提供了涵盖各个领域的详细、分步骤的多模态推理注释。 为了培养先进的推理和强大的泛化能力,我们在监督微调和强化学习的基础上进一步开发了R1-Onevision模型。为全面评估不同级别的多模态推理性能,我们引入了R1-Onevision-Bench这一基准测试平台,它与人类教育阶段相吻合,并涵盖了初中到大学及更高层次的考试内容。 实验结果显示,R1-Onevision在多个具有挑战性的多模态推理基准上取得了最先进的性能表现,超越了包括GPT-4o和Qwen2.5-VL在内的模型。
https://arxiv.org/abs/2503.10615
Style transfer involves transferring the style from a reference image to the content of a target image. Recent advancements in LoRA-based (Low-Rank Adaptation) methods have shown promise in effectively capturing the style of a single image. However, these approaches still face significant challenges such as content inconsistency, style misalignment, and content leakage. In this paper, we comprehensively analyze the limitations of the standard diffusion parameterization, which learns to predict noise, in the context of style transfer. To address these issues, we introduce ConsisLoRA, a LoRA-based method that enhances both content and style consistency by optimizing the LoRA weights to predict the original image rather than noise. We also propose a two-step training strategy that decouples the learning of content and style from the reference image. To effectively capture both the global structure and local details of the content image, we introduce a stepwise loss transition strategy. Additionally, we present an inference guidance method that enables continuous control over content and style strengths during inference. Through both qualitative and quantitative evaluations, our method demonstrates significant improvements in content and style consistency while effectively reducing content leakage.
样式转换涉及将参考图像的风格转移到目标图像的内容上。最近,基于LoRA(低秩适应)的方法在捕捉单张图片的风格方面显示出巨大的潜力。然而,这些方法仍然面临诸如内容不一致、风格错位和内容泄漏等重大挑战。在这篇论文中,我们全面分析了标准扩散参数化(学习预测噪声)在样式转换中的局限性。为了解决这些问题,我们提出了ConsisLoRA,这是一种基于LoRA的方法,通过优化LoRA权重以预测原图而非噪声来增强内容和风格的一致性。此外,我们还提出了一种两步训练策略,将参考图像的内容和风格的学习过程解耦。为了有效地捕捉内容图像的全局结构和局部细节,我们引入了逐步损失转换策略。另外,我们还提出了一种推理指导方法,在推理过程中实现了对内容和风格强度的连续控制。通过定性和定量评估,我们的方法在内容和样式一致性方面表现出显著改进,并有效减少了内容泄漏。
https://arxiv.org/abs/2503.10614
Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.
文本到图像模型如稳定扩散和DALLE-3在多轮次的图片编辑任务上仍然面临挑战。我们把这类任务分解为一个代理工作流(路径),通过使用不同成本的人工智能工具来解决一系列子任务。传统的搜索算法需要昂贵的探索过程来找寻工具路径。虽然大型语言模型(LLMs)具备对子任务规划的先验知识,但它们可能缺乏准确估计各种工具的能力和成本以确定在每个子任务中应用哪些工具。我们能否结合LLM和图搜索的优势来找到高效的成本效益工具路径?为此,我们提出了一种三阶段方法"CoSTA*",利用LLMs创建一个子任务树,帮助修剪给定任务的人工智能工具图,然后对较小的子图执行A*搜索以寻找工具路径。为了更好地平衡总成本和质量,CoSTA*结合了每个子任务中各个工具的成本与质量指标来指导A*搜索。每个子任务的输出由视觉-语言模型(VLM)进行评估,如果失败,则会触发更新该子任务上工具的成本和质量信息。因此,A*搜索能够快速从错误中恢复并探索其他路径。此外,CoSTA*能够在不同子任务之间自动切换模态以更好地实现成本与质量的权衡。我们建立了一个具有挑战性的多轮次图片编辑的新基准测试,在这个测试中,CoSTA*在成本和质量上都超过了现有的图像编辑模型或代理,并且能够根据用户的偏好进行灵活的成本-质量交换。 翻译总结了论文 "CoSTA*" 的主要贡献,该方法结合了大型语言模型 (LLMs) 和图搜索算法的优点来解决多轮次图片编辑任务中的挑战。这种方法旨在提供一个更有效率的工具路径选择过程,同时优化成本和输出质量,并且可以根据用户偏好进行灵活调整。
https://arxiv.org/abs/2503.10613
Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at this https URL .
自主驾驶技术具有显著提升生产力和社会效益的潜力。确保这些安全关键系统的稳健性至关重要,尤其是在车辆必须应对训练期间未曾遇到的恶劣天气条件和传感器故障的情况下。目前的方法往往忽视了来自敌对环境或分布变化带来的不确定性,从而限制了它们在现实世界中的应用效果。我们提出了一种针对三维占用预测的不确定度估计技术的有效改进方法。我们的方法通过使用知识性不确定性(epistemic uncertainty)的估计动态校准模型置信度。 我们在各种摄像机损坏场景下进行了评估,例如大雾或摄像头缺失的情况,结果表明我们的方法能有效地量化知识性不确定性,并为未见过的数据分配更高的不确定性值。我们引入了特定区域的破坏来模拟仅影响单个摄像头的问题,并通过场景级和区域级评估验证了我们的发现。 实验结果显示,在分布外(Out-of-Distribution, OoD)检测和置信度校准方面,相较于常见的基准方法如深度集合(Deep Ensembles)和蒙特卡洛 Dropout 方法,本方法表现出更优的性能。我们的方法始终提供可靠的不确定性测量结果,表明其在增强自主驾驶系统现实世界中稳健性方面的潜力。 代码与数据集可在[此处](https://此链接需要替换为实际提供的URL)获取。
https://arxiv.org/abs/2503.10605
Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.
最近在辐射场方面的突破显著提升了自动驾驶中的三维场景重建和新颖视角合成(NVS)技术。然而,仍然存在一些关键的局限性:基于重建的方法在训练轨迹与观察角度出现较大偏差时性能会大幅下降;而基于生成的方法则面临时间一致性差以及精确场景控制难的问题。为了解决这些挑战,我们提出了MuDG框架——一个将多模态扩散模型和高斯点铺技术(GS)结合用于城市场景重建的创新方案。 MuDG利用聚合后的LiDAR点云数据、RGB信息及几何先验条件,对一个多模态视频扩散模型进行训练,从而在新视角下合成逼真的RGB图像、深度图以及语义输出。这一合成流程能够支持基于前向传播的新颖视角合成,无需针对每个场景执行计算密集型的优化步骤,并且可以提供全面的监督信号来改进3DGauss Splatting表示,增强极端视点变化下的渲染鲁棒性。 在Open Waymo数据集上的实验表明,MuDG不仅提高了重建质量,还超越了现有方法的新颖视角合成效果。
https://arxiv.org/abs/2503.10604
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40\%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.
情感模仿强度(EMI)估计作为一种关键技术,用于理解人类社会行为并增强人机交互体验。其核心挑战在于动态相关性建模和多模式时序信号的鲁棒融合。为了克服现有方法在多模式协同效应利用不足、噪声敏感以及细粒度对齐能力有限等问题,本文提出了一种双阶段跨模态对齐框架。 首先,我们基于改进的CLIP架构构建了视觉-文本和音频-文本对比学习网络,在特征空间中通过解耦预训练实现初步对齐。随后,设计了一个感知时间信息的动力融合模块,该模块结合了时序卷积网络(TCN)与门控双向LSTM,分别捕捉面部表情的宏观演变模式及声学特征的局部动态变化。 创新性地引入了一种质量导向的模态融合策略,在遮挡和噪声场景中通过可微权重分配实现模态补偿。在Hume-Vidmimic2数据集上的实验结果表明,我们的方法在六个情感维度上取得了平均皮尔逊相关系数为0.35的成绩,比最佳基线方法高出40%。消融研究进一步验证了双阶段训练策略和动态融合机制的有效性,为开放环境下的细粒度情感分析提供了新的技术路径。
https://arxiv.org/abs/2503.10603
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at this https URL.
对象幻觉(OH)已被认为是大型视觉-语言模型(LVLM)中主要的可信度挑战之一。最近在大规模语言模型(LLMs)方面的进展表明,内部状态如隐藏状态编码了生成响应的“总体真实性”。然而,目前尚不清楚LVLM中的内部状态如何工作以及它们是否可以作为“按令牌”幻觉指示器,这有助于减轻OH问题。在这篇论文中,我们首先深入探讨了LVLM内部状态与OH问题之间的关系,并发现(1)LVLM内部状态是高特异性的按令牌幻觉行为指标。(2)不同的LVLM在通用的潜在子空间中共编码了幻觉模式,表明存在由各种LVLM共享的“通用真实方向”。基于这些发现,我们提出了Truthful-Guided Pre-Intervention (TruthPrInt),它首先学习LVLM解码的真实方向,然后在LVLM解码过程中应用真实指导的推理时间干预。此外,为了增强跨LVLM和跨数据集幻觉检测的迁移能力,我们提出了ComnHallu,通过构建并对齐幻觉潜在子空间来实现这一目标。 我们在广泛的实验设置中评估了TruthPrInt,包括域内和域外场景,并且针对流行的LVLMs和OH基准进行了测试。实验结果显示,TruthPrInt在性能上显著优于当前最先进的方法。代码可在提供的网址获取(原文中的链接需手动输入)。
https://arxiv.org/abs/2503.10602
We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance of our method.
我们介绍了GroomLight,这是一种新颖的方法,用于从多视角图像生成可重新光照的头发外观模型。现有的头发捕捉方法在实现逼真的渲染和光照调整能力之间难以取得平衡。虽然分析材料模型具有物理基础,但往往无法充分捕捉外观细节。相反,神经渲染方法则擅长视图合成,但在处理新的光照条件下表现不佳。GroomLight通过结合这两种范式的优点解决了这一挑战。它使用扩展的头发BSDF(双向散射分布函数)模型来捕获主要光线传输,并采用光照感知残差模型来重构剩余细节。我们还提出了一种混合逆向渲染流水线,用于优化这两个组件,从而实现高保真度的重新光照、视图合成和材料编辑功能。在实际头发数据上的广泛评估证明了该方法处于业界领先水平。
https://arxiv.org/abs/2503.10597
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.
像素级定位(Pixel grounding),包括指代表达分割(RES)等任务,因其在视觉和语言模态之间建立联系的巨大潜力而备受关注。然而,由于现有数据集的限制,如对象类别的有限性、文本多样性的不足以及高质量标注资料的匮乏,该领域的进展受到了阻碍。为了克服这些局限性,我们推出了GroundingSuite,它包括: 1. 一个利用多个视觉-语言模型(VLM)代理实现自动数据注释的框架; 2. 一个大规模训练数据集,包含956万种多样化的指代表达及其相应的分割图; 3. 一套精心策划的评估基准,由3800张图像组成。 GroundingSuite的训练数据集有助于显著提升模型性能,使得在其上进行训练的模型能够达到最先进的水平。具体而言,在gRefCOCO上实现了68.9%的cIoU(完全交并比),在RefCOCOm上达到了55.3%的gIoU(全局交并比)。此外,GroundingSuite的数据标注框架相较于当前领先的数据注释方法展现出更高的效率,即速度提高了4.5倍,超过了GLaMM。
https://arxiv.org/abs/2503.10596
The structural analogies of ResNets and Multigrid (MG) methods such as common building blocks like convolutions and poolings where already pointed out by He et al.\ in 2016. Multigrid methods are used in the context of scientific computing for solving large sparse linear systems arising from partial differential equations. MG methods particularly rely on two main concepts: smoothing and residual restriction / coarsening. Exploiting these analogies, He and Xu developed the MgNet framework, which integrates MG schemes into the design of ResNets. In this work, we introduce a novel neural network building block inspired by polynomial smoothers from MG theory. Our polynomial block from an MG perspective naturally extends the MgNet framework to Poly-Mgnet and at the same time reduces the number of weights in MgNet. We present a comprehensive study of our polynomial block, analyzing the choice of initial coefficients, the polynomial degree, the placement of activation functions, as well as of batch normalizations. Our results demonstrate that constructing (quadratic) polynomial building blocks based on real and imaginary polynomial roots enhances Poly-MgNet's capacity in terms of accuracy. Furthermore, our approach achieves an improved trade-off of model accuracy and number of weights compared to ResNet as well as compared to specific configurations of MgNet.
ResNets与多网格(MG)方法如共卷积和池化等常见构建块之间的结构类比,早在2016年就被He等人指出。多网格方法主要用于科学计算中求解大型稀疏线性系统,这些系统来源于偏微分方程。MG方法主要依赖两个核心概念:平滑处理(smoothing)与残差限制/细化(residual restriction / coarsening)。利用这些类比关系,He和Xu开发了MgNet框架,将多网格方案融入到ResNets的设计中。 在这项工作中,我们提出了一种新的神经网络构建模块,灵感来自于MG理论中的多项式平滑器。从MG的角度来看,我们的多项式块自然扩展了MgNet框架至Poly-Mgnet,并同时减少了MgNet的权重数量。我们对这种新型多项式块进行了全面的研究,分析了初始系数的选择、多项式的度数、激活函数的位置以及批量归一化的位置。 研究结果表明,基于实部和虚部多项式根构建(二次)多项式模块可以增强Poly-MgNet在准确性方面的表现能力。此外,与ResNet相比,我们的方法在模型准确性和权重数量之间的权衡方面也有所改进,并且相对于MgNet的具体配置也是如此。
https://arxiv.org/abs/2503.10594
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
这篇论文介绍了CameraCtrl II,这是一个框架,通过相机控制的视频扩散模型实现了大规模动态场景探索。先前的基于相机条件的视频生成模型在生成具有大幅度相机移动的视频时会面临视频动态效果减弱和视角范围有限的问题。我们采取了一种逐步扩展动态场景生成的方法——首先增强单个视频片段内的动态内容,然后将这一能力扩展到创建跨越广泛视角范围的一致视频序列中。具体来说,我们在训练过程中构建了一个具有大量动态特性的数据集,并附有相机参数注释,同时设计了一个轻量级的相机注入模块和训练方案以保持预训练模型中的动态特性。在此基础上,我们通过允许用户迭代指定相机轨迹来生成连贯的视频序列,从而实现了扩展场景探索的能力。在各种不同场景下的实验表明,CameraCtrl II 实现了比以往方法更广泛的基于相机控制的动态场景合成的空间探索能力。
https://arxiv.org/abs/2503.10592
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See this https URL for more details.
最近在视频生成领域的进展可以产生逼真的、长达一分钟的单镜头视频,通过可扩展扩散转换器实现。然而,现实世界的叙事视频需要跨多个镜头保持视觉和动态一致性的多镜头场景。为此,我们介绍了长上下文调整(Long Context Tuning, LCT),这是一种训练范式,它扩大了预训练单镜头视频扩散模型的上下文窗口,以便直接从数据中学习场景级别的连贯性。我们的方法将全注意力机制从单个镜头扩展到整个场景中的所有镜头,并结合交织的3D位置嵌入和异步噪声策略,从而能够不增加额外参数的情况下进行联合生成和自回归镜头生成。在LCT之后采用双向注意的模型可以进一步通过上下文因果注意进行微调,从而支持利用高效KV缓存的自回归生成。 实验表明,在经过LCT后的单镜头模型可以产生连贯的多镜头场景,并展示了新的能力,包括组合式生成和交互式镜头扩展。这为更实用的视觉内容创作铺平了道路。有关更多详细信息,请参阅此 [链接](https://arxiv.org/pdf/2305.14613.pdf)。
https://arxiv.org/abs/2503.10589
Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source code and models are publicly available at \href{this https URL}{\texttt{this http URL}}
近期,在单一图像恢复模型上的进展已经彻底改变了通过统一框架解决各种退化问题的能力。然而,与特定任务绑定的参数在其他任务中通常保持不活跃状态,使得专家混合(MoE)架构成为自然扩展的选择。尽管如此,MoEs常常表现出不一致的行为,某些专家意外地跨任务泛化,而另一些则在其预期范围内挣扎。这阻碍了通过在推理过程中跳过无关专家来利用MoEs的计算优势。我们将这种不良行为归因于传统MoE架构的一致性和刚性。 为了应对这一挑战,我们引入了“复杂度专家”——具有不同计算复杂度和感受野(receptive field)的灵活专家模块。一个关键问题是如何将任务分配给每个专家,因为退化复杂度在提前时并不为人所知。因此,我们在执行任务时采取了一种向较低复杂度偏好的策略。令人惊讶的是,这种偏好有效地驱动了特定于任务的分配,即为具有适当复杂性的专家指派任务。 广泛的实验验证了我们的方法的有效性,在推理过程中能够跳过无关专家的同时保持优异的表现。所提出的MoCE-IR模型超越了现有最先进的方法,证实了其在效率和实际应用方面的优越性。源代码和模型公开可得于[此处链接](请将“this http URL”替换为具体的URL)。
https://arxiv.org/abs/2411.18466
Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.
尽管经典统计理论预测过度拟合会非常严重,现代的大规模过参数化神经网络依然表现出了良好的泛化能力。这种出乎意料的特性归因于所谓的“隐式偏差”,即网络倾向于收敛到那些能有效泛化的解,而非仅仅正确标记训练数据的各种可能解中的一种。我们的研究目标是从一个新的角度探索这一偏置,并重点探讨非线性激活函数如何影响其形成。 首先,我们引入了一种重新参数化方法,消除了连续权重的重缩放对称性。其次,在核(kernel)机制下,利用这种重新参数化手段来推广最近关于浅层神经网络与Radon变换之间关系的研究成果,并推导出了由广泛激活函数类别引发的隐式偏差的具体公式。 具体而言,通过将Radon变换和Fourier变换之间的联系应用到这一问题中,我们将核机制下的归纳偏置解释为一种谱半范数最小化过程,该过程通过对高频分量进行惩罚来体现激活函数的影响。最后,在自适应机制下,我们展示了局部动力学吸引子的存在性,这些吸引子促进了零输入激活功能的超平面簇的形成,从而实现了许多神经元响应函数的一致对齐。 通过模拟实验验证了上述理论结果。总体而言,我们的工作为理解过参数化神经网络泛化能力背后的机制及其与隐式偏差的关系提供了更深入的理解,并提出了设计更加高效和稳健模型的可能性路径。
https://arxiv.org/abs/2503.10587
Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.
最近基于视觉的大规模语言模型(VisionLLMs)在自动驾驶领域取得了迅速进展。然而,这种进步极度依赖于大规模高质量的标注数据,而这类数据的成本高且耗时长。为了解决这一问题,我们提出了一种利用大量未标记数据的价值,在半监督学习框架下改进语言驾驶模型的方法。具体而言,首先引入一系列基于模板的提示来提取场景信息,并根据有限标注数据训练出的模型生成伪答案以补充未标记数据。接着,我们提出了自我一致性细化方法来提高这些伪注释的质量,之后将它们用于进一步训练。通过使用预训练的VisionLLM(如InternVL),我们构建了一个强大的语言驾驶模型(LDM)用于处理驾驶场景问答任务,在DriveLM基准测试中超越了先前的最佳方法。广泛的实验表明,仅用5%标注数据时,我们的方法已经表现出色,其性能与利用完整数据集训练出的模型相当竞争。特别地,我们的LDM在有限标注数据情况下达到了44.85%的表现,在使用未标记数据后提高到54.27%,而完全数据集训练出的模型则在DriveLM基准测试中达到60.68%的成绩。
https://arxiv.org/abs/2503.10586
Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.
视觉语言模型在许多感知任务上已经取得了显著进展,然而由于高质量和多样化训练数据的缺乏,它们在需要推理的任务上的进步似乎有限。为了解决推理型多模态数据集稀缺的问题,在这项工作中我们提出了VisualWebInstruct——一种利用搜索引擎创建多样且高质量的数据集的新方法。该方法涵盖了数学、物理、金融、化学等不同学科。 从精心挑选的30,000张种子图片开始,我们使用Google图像搜索来识别包含类似图片的网站。我们收集并处理了来自超过70万个独特URL源的HTML内容。通过一个包括内容提取、过滤和合成的流程,我们建立了一个大约90万对问题-答案(Q&A)的数据集,其中40%是视觉问答配对,其余的是文本问答配对。 在VisualWebInstruct上进行微调的模型表现出了显著的优势:(1) 从Llava-OV-mid训练获得绝对点数增长为10%-20%,(2) 从MAmmoTH-VL训练则获得了5%的绝对增益。我们的最佳模型MAmmoTH-VL2在10B参数类别的MMMU-Pro-std(40.7%)、MathVerse(42.6%)和DynaMath(55.7%)基准测试中表现出色,达到了当前的最佳性能。 这些令人瞩目的成果突显了我们的数据集对于增强视觉语言模型在复杂多模态任务中的推理能力的有效性。
https://arxiv.org/abs/2503.10582
LiDAR-based 3D object detection presents significant challenges due to the inherent sparsity of LiDAR points. A common solution involves long-term temporal LiDAR data to densify the inputs. However, efficiently leveraging spatial-temporal information remains an open problem. In this paper, we propose a novel Semantic-Supervised Spatial-Temporal Fusion (ST-Fusion) method, which introduces a novel fusion module to relieve the spatial misalignment caused by the object motion over time and a feature-level semantic supervision to sufficiently unlock the capacity of the proposed fusion module. Specifically, the ST-Fusion consists of a Spatial Aggregation (SA) module and a Temporal Merging (TM) module. The SA module employs a convolutional layer with progressively expanding receptive fields to aggregate the object features from the local regions to alleviate the spatial misalignment, the TM module dynamically extracts object features from the preceding frames based on the attention mechanism for a comprehensive sequential presentation. Besides, in the semantic supervision, we propose a Semantic Injection method to enrich the sparse LiDAR data via injecting the point-wise semantic labels, using it for training a teacher model and providing a reconstruction target at the feature level supervised by the proposed object-aware loss. Extensive experiments on various LiDAR-based detectors demonstrate the effectiveness and universality of our proposal, yielding an improvement of approximately +2.8% in NDS based on the nuScenes benchmark.
基于LiDAR的3D物体检测由于LiDAR点固有的稀疏性而面临重大挑战。一种常见的解决方案是利用长时间的时序LiDAR数据来增加输入的密度。然而,有效地利用时空信息仍然是一个开放的问题。在本文中,我们提出了一种新颖的语义监督时空融合(ST-Fusion)方法,该方法引入了一个新的融合模块以缓解由于物体运动随时间产生的空间错位,并通过特征级语义监督充分解锁所提出的融合模块的能力。具体而言,ST-Fusion包括一个空间聚合(SA)模块和一个时序合并(TM)模块。SA模块采用具有逐步扩大的感受野的卷积层来从局部区域汇聚物体特征以缓解空间错位,而TM模块则基于注意力机制动态地从前一帧中提取物体特征,提供全面的时间序列表示。 此外,在语义监督方面,我们提出了一种名为“语义注入”的方法,通过注入逐点语义标签丰富稀疏的LiDAR数据,并将其用于训练教师模型以及在特征级上由所提出的物体感知损失进行监督的目标重建。在各种基于LiDAR的检测器上的广泛实验表明了我们提案的有效性和普遍性,在nuScenes基准测试中,NDS(平均精度-距离曲线)提高了大约+2.8%。
https://arxiv.org/abs/2503.10579