Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.
单图像超分辨率(SISR)由于从低分辨率输入中恢复精细细节并保持感知质量的固有难度,仍然是一项挑战。现有方法通常依赖于有限的图像先验知识,导致结果不尽如人意。我们提出了一种新颖的方法,该方法利用多模态信息中的丰富上下文信息——包括深度、分割、边缘和文本提示——在扩散模型框架内学习强大的生成性先验以解决SISR问题。我们引入了一个灵活的网络架构,能够有效地融合多种模态的信息,并且可以处理任意数量的输入模式而不需对扩散过程进行重大修改。关键的是,通过使用其他模态的空间信息来指导区域文本条件引导,我们可以减少由文本提示引入的幻觉现象。每种模态的引导强度也可以独立控制,从而将输出导向不同的方向,例如通过深度增加散景效果或通过分割调整物体的重要性。大量的实验表明,我们的模型超越了最先进的生成性SISR方法,在视觉质量和保真度方面表现出色。有关该项目页面,请参见此[链接](https://www.example.com)(请使用实际项目URL替换示例链接)。
https://arxiv.org/abs/2503.14503
Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \href{this https URL}{this https URL}.
近年来,生成式人工智能在多个领域取得了显著进展。继二维和三维内容生成的成功之后,四维生成作为一种新兴但迅速发展的研究领域应运而生,它将时间维度融入到生成任务中。本文旨在对这一新兴领域的现状进行全面的综述,系统地审视其理论基础、关键方法和技术应用,为读者提供关于四维生成当前状态和未来潜力的整体理解。我们首先介绍四维数据表示的核心概念,包括结构化和非结构化格式,并探讨它们在生成任务中的影响。在此基础上,我们将深入研究推动四维生成的技术,如时空建模的进步、神经网络表示以及生成框架的发展。此外,本文还回顾了最近使用多种控制机制和表示策略来生成四维输出的研究工作,对这些方法进行分类并总结其研究路径。我们进一步探讨四维生成技术的广泛应用领域,包括动态对象建模、场景生成、数字人类合成、四维内容编辑及自动驾驶等。最后,本文分析了四维生成所面临的挑战,如数据可用性、计算效率和时空一致性,并提出未来研究的潜在方向。 我们的代码可在以下链接获取:[此 HTTPS URL](this https URL)。
https://arxiv.org/abs/2503.14501
We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
我们提出了一种灵活的方法,旨在弥合半监督和无监督图像识别之间的差距,并且该方法在泛化的类别发现(GCD)和图像聚类方面都表现出色。尽管这些任务的动机存在重叠,但各自的方法仅限于单个任务——GCD 方法依赖于数据中的标注部分,而深度图像聚类方法没有内置的方式来高效利用标签信息。我们通过一种创新性的“基于邻域信息分类(UNIC)”方法将两者联系起来,在无监督(聚类)和半监督(GCD)设置中均能发挥作用。 当前最先进的聚类方法已经很大程度上依赖于最近邻技术,我们在两个方面对其结果进行了显著改进:首先,通过采样和清理策略来识别准确的正负邻居;其次,通过对两种类型的邻居进行采样计算聚类损失,并微调骨干网络。然后我们将此流程调整应用于 GCD 中,利用标注图像作为真实邻居。 我们的方法在聚类(ImageNet-100 和 ImageNet200 上分别提高了 3%)和 GCD(ImageNet-100 提升了 0.8%,CUB 数据集提升了 5%,SCars 数据集中提升了 2%,Aircraft 数据集中提升了 4%)方面都取得了最先进的结果。
https://arxiv.org/abs/2503.14500
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
尽管在AI基准测试方面取得了快速进展,但这些基准性能的现实意义仍然模糊不清。为了量化AI系统的人类能力水平,我们提出了一种新的度量标准:50%任务完成时间范围(50%-task-completion time horizon)。这一概念指的是人类通常完成AI模型以50%成功率完成的任务所需的时间。首先,我们在RE-Bench、HCAST以及66个新设计的较短任务组合中对具备相关领域专业知识的人类进行了计时测试。在这些任务上,当前前沿AI模型如Claude 3.7 Sonnet 的50%时间范围大约为50分钟。 此外,自2019年以来,前沿AI的时间范围每隔七个月左右就会翻一番,尽管这一趋势可能已在2024年加速。AI模型时间范围的增长似乎主要由更高的可靠性和错误适应能力推动,并结合了更强大的逻辑推理和工具使用能力。 我们讨论了研究结果的局限性——包括其外部效度的程度——以及对危险能力增加自主性的潜在影响。如果这些结果能够推广到现实世界的软件任务中,根据这种趋势进行外推预测,在未来五年内,AI系统将有能力自动化许多目前需要人类一个月时间才能完成的软件任务。
https://arxiv.org/abs/2503.14499
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at this https URL
最近,大型多模态模型(LMMs)在自动驾驶研究中引起了广泛关注,并在各种新兴基准测试中展示了其令人鼓舞的能力。专门为此领域设计的LMM们展现出了有效的感知、规划和预测技能。然而,这些方法往往未能充分利用3D空间和时间元素,主要依赖于图像数据。因此,在动态驾驶环境中它们的效果受到限制。我们提出了一种将追踪信息作为额外输入的方法来恢复图像中未有效捕获的3D空间和时间细节。为此,我们介绍了一种新颖的方法,以嵌入此追踪信息到LMMs之中,从而增强其对驾驶场景的时空理解。 通过使用一个跟踪编码器,我们将三维跟踪数据纳入视觉查询中,同时避免了处理长时间视频序列或大量3D输入所带来的计算开销。此外,我们采用自我监督方法来预训练跟踪编码器,为LMM提供额外的上下文信息,从而显著提升了其在自动驾驶感知、规划和预测任务中的表现。 实验结果表明,我们的方法十分有效,在DriveLM-nuScenes基准测试上较基线模型获得了9.5%的准确率提升、7.04点ChatGPT得分增加以及整体评分上升了9.4%,而在DriveLM-CARLA上的最终得分为3.7%。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2503.14498
DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
基于DETR的方法,通过使用多层变压器解码器迭代地细化对象查询,在三维室内物体检测中展现了令人鼓舞的性能。然而,这些方法中的场景点特征在Transformer解码器中保持不变,导致后续解码层贡献有限,从而限制了性能改进的空间。最近,状态空间模型(SSM)通过系统状态和输入之间的迭代交互展示了高效的上下文建模能力,并且具有线性复杂度。受到SSMs的启发,我们提出了一种新的三维物体检测范式——带有互动状态空间模型(DEST)的方法。在互动SSM中,设计了新颖的状态依赖参数化方法,使系统状态能够有效地作为查询参与3D室内检测任务。此外,为点云和SSM的特点量身定制了四个关键的设计:序列化及双向扫描策略使得场景中的点之间可以进行双向特性交互;跨状态注意机制用于建模状态点之间的关系,而门控前馈网络则增强了通道间的相关性。据我们所知,这是首个将查询视为系统状态并将场景点作为系统输入的方法,并且该方法能够同时以线性复杂度更新场景点特征和查询特征。在两个具有挑战性的数据集上的广泛实验表明了DEST基础方法的有效性。我们的方法在ScanNet V2(+5.3)和SUN RGB-D(+3.2)数据集中提高了GroupFree基准模型的AP50性能指标。基于VDETR基准,我们的方法在ScanNetV2和SUN RGB-D数据集上设立了新的最先进水平(SOTA)。
https://arxiv.org/abs/2503.14493
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
我们介绍了Cosmos-Transfer,这是一种条件世界生成模型,可以根据多种模式的空间控制输入(如分割、深度和边缘)来生成世界模拟。设计中采用的空间条件方案是自适应且可定制的,允许在不同空间位置对不同的条件输入赋予不同的权重。这使得高度可控的世界生成成为可能,并适用于各种“从一个世界到另一个世界”的转移用例,包括仿真到现实(Sim2Real)。我们进行了广泛的评估来分析所提出的模型,并展示了其在物理AI中的应用,包括机器人技术的Sim2Real和自动驾驶汽车数据丰富化。此外,我们还演示了一种推理扩展策略,以实现使用NVIDIA GB200 NVL72机柜进行实时世界生成。为了加速该领域的研究开发,我们在以下网址开源了我们的模型和代码:[此处提供具体的URL链接]。
https://arxiv.org/abs/2503.14492
We are interested in the construction of software that can act as scientific assistants to domain specialists. It is expected that such assistants will be needed to accelerate the identification of ways to address complex problems requiring urgent solutions. In this paper, our focus is not on a specific scientific problem, but on the software-engineering of such 'science accelerators'. Recent developments in 'No Code' techniques would seem to suggest that scientist can simply hypothesise solutions simply by conversing with a large language model (LLM). However, for complex scientific problems, this seems unlikely given the current state of LLM technology. What does appear feasible is that a software engineer can use LLMs to rapidly construct programs for use by a domain-specialist, including the specialist's requirements expressed in natural language. We propose the design of an interactive form of 'structured' inductive programming in which a software-engineer and an LLM collaboratively construct an 'assistant' for a scientific data analysis. The paper describes a simple implementation called iStrucInd that adapts a '2-way Intelligibility' protocol to implement the interaction between the software engineer and the LLM. We test the tool on two different non-trivial scientific data analysis tasks. Specifically, we compare the system constructed by iStrucInd against systems constructed manually and by Low Code/No Code methods along dimensions of: (a) program performance; (b) program quality; and (c) programming effort. The results show iStrucInd allows a software engineer to develop better programs faster suggesting interactive structured induction can play a useful role in the rapid construction of scientific assistants.
我们对构建能够作为科学助手辅助领域专家的软件感兴趣。预计这样的助手将被用于加速识别解决复杂且亟需解决方案的问题的方法。本文的重点不是特定的科学问题,而是此类“科学研究加速器”的软件工程设计。近期,“无代码”技术的发展似乎表明科学家可以仅通过与大型语言模型(LLM)对话来简单地提出解决方案假设。然而,在处理复杂的科学问题时,考虑到当前LLM的技术水平,这似乎是不太可能实现的。看起来可行的是,软件工程师可以使用LLMs快速构建程序供领域专家使用,包括表达在自然语言中的专家需求。我们提出了设计一种交互式的“结构化”归纳编程形式,其中软件工程师与LLM合作为科学数据分析构造一个“助手”。本文描述了一种简单的名为iStrucInd的实现方法,该方法采用适应了“双向理解性”的协议来实施软件工程师和LLM之间的互动。 我们在两个不同的非平凡科学数据分析任务上测试了这个工具。具体来说,我们从程序性能、程序质量和编程努力三个方面对比了由iStrucInd构建的系统与其他手动构建及低代码/无代码方法构建的系统的优劣。结果显示,iStrucInd允许软件工程师更快地开发出更高质量的程序,表明交互式结构化归纳可以为快速构建科学助手扮演一个有用的角色。
https://arxiv.org/abs/2503.14488
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: this https URL
扩散模型在各种图像生成任务中表现出色,但其性能往往受到输入处理过程中条件变化和噪声水平不同的限制。为了解决这一局限性,我们提出了一种新的方法,该方法利用了扩散过程固有的异质性。我们的方法名为DiffMoE(Diffusion Mixture of Experts),它引入了一个批量级别的全局令牌池,在训练期间允许专家访问全局令牌分布,从而促进专业化的专家行为。为了充分发挥扩散过程的潜力,DiffMoE还集成了一个容量预测器,该预测器可以根据噪声水平和样本复杂性动态分配计算资源。 通过全面评估,DiffMoE在ImageNet基准测试中取得了最先进的性能,在与具有3倍激活参数量的密集架构以及现有专家混合方法进行比较时,表现明显更优,并且仅使用1倍激活参数。我们的方法的有效性不仅限于条件生成任务,还扩展到了更具挑战性的文本到图像生成等任务上,证明了它在不同扩散模型应用中的广泛适用性。 项目页面:[请访问此链接](https://this-url.com/)以获取更多详细信息和资源。
https://arxiv.org/abs/2503.14487
Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
视频肖像重新照明仍然具有挑战性,因为结果需要既逼真又在时间上稳定。这通常要求强大的模型设计来捕捉复杂的面部反射,并且还需要在高质量的配对视频数据集(如动态一次一灯法(OLAT))上进行密集训练。在这项工作中,我们引入了Lux Post Facto,这是一种新颖的人像视频重新照明方法,可以生成既逼真又时间一致的照明效果。 从模型设计方面来看,我们创建了一个新的条件视频扩散模型,该模型基于最先进的预训练视频扩散模型,并结合了一种新的灯光注入机制以实现精确控制。这样我们可以利用强大的空间和时间生成能力来解决病态重新照明问题并产生合理的结果。我们的技术使用一个混合数据集,其中包括静态表情OLAT数据和野外人像表演视频,以便联合学习重新照明和时间建模。这避免了在不同光照条件下获取配对视频数据的需要。 我们进行了大量的实验,结果表明,我们的模型在逼真度和时间一致性方面均达到了最先进的水平。
https://arxiv.org/abs/2503.14485
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的跨人类与人工智能(AI)协作不仅依赖于AI代理遵循明确指令的能力,还在于其处理模糊性、不完整信息、无效性和无关信息的沟通能力。格赖斯对话和推理准则通过将模棱两可的指示与合作原则对齐来促进这种协作。我们提出了一种规范框架,该框架结合了格赖斯准则以及认知框架——共同知识(common ground)、相关理论(relevance theory)和心智理论(theory of mind),并将这些整合到大型语言模型(LLM)驱动的代理中。 此规范框架采用了格赖斯关于数量、质量、关联性和方式的原则,用于解析模糊指令。在该框架下,我们引入了Lamoids——基于GPT-4的协作型AI代理。为了评估格赖斯准则对人类与AI协作的影响,我们在实验中比较了一个遵循这些规范的Lamoid版本和一个不遵循规范的版本。 在实验中,Lamoid与人类合作,在一个由门、钥匙和宝石组成的网格世界环境中完成共同目标。实验过程中,该环境既包括清晰指令也包括模糊自然语言指令。我们的结果显示,使用格赖斯准则的Lamoid在任务准确性上优于不遵循这些准则的版本,并且其生成的回答更加明确、准确并且与上下文相关。 这种改进源于所提出的规范框架,它增强了代理的语用推理能力,促进了有效的跨人类和AI协作,并使基于大型语言模型的代理能够进行情境感知的沟通。
https://arxiv.org/abs/2503.14484
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at this https URL .
在这篇论文中,我们提出了一种新的多视图几何重建方法。近年来,大型视觉模型迅速发展,在各种任务上表现出色,并展示了出色的泛化能力。一些研究使用大型视觉模型进行单目深度估计,并间接应用于促进多视图重建任务。由于单目深度估计任务的不确定性,估算出的深度值通常不够准确,限制了它们在辅助多视图重建中的应用效果。我们提出将SfM(Structure from Motion)信息这一强大的多视图先验知识融入到深度估计过程中,从而提高深度预测的质量,并使其可以直接应用于多视图几何重建任务中。 实验结果表明,在公共的真实世界数据集上,我们的方法相比以往的单目深度估计工作显著提高了深度估计的质量。此外,我们还在包括室内、街景和空中视角在内的多种场景类型中评估了我们方法的重建质量,超越了当前最先进的MVS(多视图立体匹配)方法。 代码和补充材料可在[此处](https://this https URL)获取。请注意,链接中的URL部分需要替换为实际提供的网址。
https://arxiv.org/abs/2503.14483
Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.
在过去几年中,图像生成技术取得了显著进步。然而,评估图像生成模型的性能仍然是一项艰巨的任务。为此,本文提出了ICE-Bench这一统一且全面的基准测试系统,旨在严格评估图像生成模型的能力。其主要特点可以总结为以下几点: 1. **粗粒度到细粒度任务**:我们系统性地将图像生成分解成四个任务类别:无参考/有参考图像创建和编辑,根据是否存在源图或参考图进行分类,并进一步将其细化为涵盖广泛需求的31个具体任务。这些具体任务覆盖了从简单的图像生成到复杂的图像编辑等多个方面的需求,构成了一个全面的基准测试系统。 2. **多维度指标**:评估框架通过六个维度来衡量图像生成的能力:美学质量、成像质量、提示遵循度、源图一致性、参考图一致性和可控性。为了支持这种多维度评估,我们引入了11个不同的评价指标,其中包括VLLM-QA这一创新性的评估方法,该方法利用大型模型评估图像编辑任务的成功率。 3. **混合数据集**:测试的数据来源于真实场景和虚拟生成场景的结合,有效提升了数据多样性,并减少了在模型评估中的偏差问题。通过ICE-Bench,我们对现有的生成模型进行了深入分析,揭示了基准测试的挑战性以及当前模型能力与实际需求之间的差距。 为了促进该领域的进一步发展,我们将开源ICE-Bench,包括其数据集、评估代码和模型等资源,为研究社区提供宝贵的工具和支持。
https://arxiv.org/abs/2503.14482
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
为了成为有用的助手,AI代理必须了解自己的能力和局限性。这包括知道何时从参数知识中作答与使用工具之间的区别、何时信任工具输出以及何时保持谨慎或选择回避。由于这些能力难以通过监督微调来传授(因为需要构建能够反映特定代理能力的例子),我们提出了一种全新的教学方法:\emph{协作自我游戏}。我们构造了多代理合作,其中团队因集体正确地得出答案而获得奖励。这种元知识从互动结构中内置的激励机制中涌现出来。我们的重点在于拥有异构工具(针对特定语料库检索)的小规模代理社会,并且这些代理必须通过最小化自身努力来最大化成功所需的合作。 实验表明,多代理社区中的群体奖励可以诱导出在单个代理独立部署时能够\emph{转移}的策略,从而改进工具使用和选择性预测。
https://arxiv.org/abs/2503.14481
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
推理缩放赋予大型语言模型(LLM)前所未有的推理能力,强化学习是激发复杂推理的核心技术。然而,最先进的推理LLM的关键技术细节被隐藏(如OpenAI的o1博客和DeepSeek R1的技术报告),因此社区仍然难以再现它们的RL训练结果。我们提出了“解耦裁剪与动态采样策略优化”(DAPO)算法,并完全开源了一个使用Qwen2.5-32B基础模型在AIME 2024上取得50分的大型强化学习系统。不同于之前保留训练细节的做法,我们介绍了四项使大规模LLM RL成为可能的关键技术。此外,我们还开源了我们的训练代码,该代码基于verl框架,并附带了一个精心整理和处理过的数据集。这些开放源码系统的组成部分增强了可重复性,并支持未来在大规模LLM RL领域的研究。
https://arxiv.org/abs/2503.14476
The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
新颖视角合成领域通过3D高斯点阵(3D Gaussian Splatting,简称3DGS)技术得到了革命性的提升,该技术能够实现高质量的场景重建,并支持实时渲染。然而,基于3DGS的技术通常会面临较高的GPU内存和磁盘存储需求问题,这限制了它们在消费级设备上的实际应用。 我们提出了一种新颖的方法——Opti3DGS,这是一种频段调制的从粗到细优化框架,旨在最小化用于表示场景的高斯原语的数量,从而降低内存和存储需求。Opti3DGS利用图像频率调节技术,首先强制执行一个粗糙的场景表示,并通过在训练图像中逐步调整细节频率来细化这一表示。 基于基准的3DGS方法,我们展示了平均减少了62%的高斯点数量,在训练过程中将GPU内存的需求降低了40%,同时优化时间也减少了20%,而这些改进均没有牺牲视觉质量。此外,我们的方法能够无缝集成到多种基于3DGS的技术中,并且在保持甚至提升视觉质量的同时,始终减少高斯原语的数量。 值得注意的是,Opti3DGS自然地生成了不同级别的细节场景表示,无需额外成本,这得益于其优化管道的特性。我们将在未来公开发布结果和代码。
https://arxiv.org/abs/2503.14475
Different attribution-scores have been proposed to quantify the relevance of database tuples for a query answer from a database. Among them, we find Causal Responsibility, the Shapley Value, the Banzhaf Power-Index, and the Causal Effect. They have been analyzed in isolation, mainly in terms of computational properties. In this work, we start an investigation into the alignment of these scores on the basis of the queries at hand; that is, on whether they induce compatible rankings of tuples. We are able to identify vast classes of queries for which some pairs of scores are always aligned, and others for which they are not. It turns out that the presence of exogenous tuples makes a crucial difference in this regard.
不同的属性评分方法被提出用于量化数据库查询答案中元组的相关性。其中,因果责任(Causal Responsibility)、夏普利值(Shapley Value)、班扎夫权力指数(Banzhaf Power-Index)和因果效应(Causal Effect)是最常见的几种。这些评分方法通常分别进行分析,主要集中在计算属性上。在这项工作中,我们开始研究在特定查询基础上这些评分的一致性;也就是说,在它们是否能够对元组产生一致的排名方面进行了调查。我们发现了一些对于某些评分配对始终具有相同一致性的大规模查询类别,同时也有一些其评分不一致的情况。结果表明,外生元组的存在在这个问题上起着关键作用。
https://arxiv.org/abs/2503.14469
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
自动化特征工程在提高表格学习任务的预测模型性能方面扮演着关键角色。传统的自动化特征工程技术受限于其对固定、手动设计搜索空间内预定义转换方法的依赖,往往忽视了领域知识的作用。近期利用大型语言模型(LLMs)的进步使得将领域知识整合到特征工程过程中成为可能。然而,现有的基于LLM的方法要么直接采用提示技术,要么仅仅依靠验证分数来进行特征选择,未能充分利用先前特征发现实验中的见解或在特征生成与数据驱动性能之间建立有意义的推理联系。 为解决这些挑战,我们提出了LLM-FE这一创新框架,它结合了进化搜索和大型语言模型(LLMs)所提供的领域知识及推理能力,以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程问题表述为一个程序搜索问题,在此过程中,LLMs会迭代地提出新的特征变换程序,而数据驱动的反馈则指导整个搜索过程。 我们的实验结果表明,相较于最先进的基线方法,LLM-FE始终表现更优,并且在多种分类和回归基准测试中显著提升了表格预测模型的表现。
https://arxiv.org/abs/2503.14434
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
大型语言模型(LLMs)与专门的外部工具结合得越来越紧密,但许多任务需要在几乎没有或文档质量很低的情况下实现零样本工具使用。现有的解决方案依赖于手动重写或带标签的数据进行验证,在真正的零样本设置中难以应用。为了解决这些挑战,我们提出了PLAY2PROMPT,这是一个自动化的框架,系统地“玩转”每个工具以探索其输入输出行为。通过这种迭代的试错过程,PLAY2PROMPT能够无需任何标记数据就完善工具文档并生成使用示例。这些示例如何指导LLM进行推理的同时还作为验证手段来进一步提升工具的应用效率。在真实世界任务上的广泛实验表明,PLAY2PROMPT显著提高了开放模型和封闭模型中的零样本工具性能,提供了一种规模化且有效的特定领域工具集成解决方案。
https://arxiv.org/abs/2503.14432