We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at this https URL.
我们介绍了MusicInfuser,这是一种用于生成与指定音乐曲目同步的高质量舞蹈视频的方法。不同于尝试设计和训练新的跨模态音频-视频模型,我们展示了如何通过引入轻量级的音乐-视频交叉注意力机制和低秩适配器,将现有的视频扩散模型适应于音乐输入。与之前需要动作捕捉数据的工作不同,我们的方法仅在舞蹈视频上进行微调。MusicInfuser能够在保持底层模型灵活性和生成能力的同时,实现高质量的音乐驱动视频生成。我们使用Video-LLMs引入了一个评估框架来从多个维度衡量舞蹈生成质量。该项目页面和代码可在[此链接](https://example.com/)获取。
https://arxiv.org/abs/2503.14505
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at this https URL.
大型语言模型(LLMs)能够通过简单的提示处理各种通用任务,而无需特定任务的训练。基于LLM构建的多模态大型语言模型(MLLMs)在涉及视觉、听觉和文本数据的复杂任务中展示了令人印象深刻的潜力。然而,关于真实性和安全性的问题,o1类推理以及与人类偏好的一致性等问题仍未能得到充分解决。这一差距促使了各种对齐算法的出现,这些算法针对不同的应用场景和优化目标而设计。最近的研究表明,对齐算法是解决上述挑战的一种有力方法。本文旨在为MLLMs的对齐算法提供全面系统的回顾。具体而言,我们将探讨四个方面:(1)对齐算法所涵盖的应用场景,包括通用图像理解、多图象、视频和音频以及扩展的多模态应用;(2)构建对齐数据集的核心因素,包括数据来源、模型响应和偏好标注;(3)评估对齐算法所使用的基准测试;及(4)对未来对齐算法发展的潜在方向进行讨论。这项工作旨在帮助研究人员整理该领域的当前进展,并激发更好的对齐方法的灵感。本文的项目页面可在此链接找到:[https URL]。
https://arxiv.org/abs/2503.14504
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.
单图像超分辨率(SISR)由于从低分辨率输入中恢复精细细节并保持感知质量的固有难度,仍然是一项挑战。现有方法通常依赖于有限的图像先验知识,导致结果不尽如人意。我们提出了一种新颖的方法,该方法利用多模态信息中的丰富上下文信息——包括深度、分割、边缘和文本提示——在扩散模型框架内学习强大的生成性先验以解决SISR问题。我们引入了一个灵活的网络架构,能够有效地融合多种模态的信息,并且可以处理任意数量的输入模式而不需对扩散过程进行重大修改。关键的是,通过使用其他模态的空间信息来指导区域文本条件引导,我们可以减少由文本提示引入的幻觉现象。每种模态的引导强度也可以独立控制,从而将输出导向不同的方向,例如通过深度增加散景效果或通过分割调整物体的重要性。大量的实验表明,我们的模型超越了最先进的生成性SISR方法,在视觉质量和保真度方面表现出色。有关该项目页面,请参见此[链接](https://www.example.com)(请使用实际项目URL替换示例链接)。
https://arxiv.org/abs/2503.14503
Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \href{this https URL}{this https URL}.
近年来,生成式人工智能在多个领域取得了显著进展。继二维和三维内容生成的成功之后,四维生成作为一种新兴但迅速发展的研究领域应运而生,它将时间维度融入到生成任务中。本文旨在对这一新兴领域的现状进行全面的综述,系统地审视其理论基础、关键方法和技术应用,为读者提供关于四维生成当前状态和未来潜力的整体理解。我们首先介绍四维数据表示的核心概念,包括结构化和非结构化格式,并探讨它们在生成任务中的影响。在此基础上,我们将深入研究推动四维生成的技术,如时空建模的进步、神经网络表示以及生成框架的发展。此外,本文还回顾了最近使用多种控制机制和表示策略来生成四维输出的研究工作,对这些方法进行分类并总结其研究路径。我们进一步探讨四维生成技术的广泛应用领域,包括动态对象建模、场景生成、数字人类合成、四维内容编辑及自动驾驶等。最后,本文分析了四维生成所面临的挑战,如数据可用性、计算效率和时空一致性,并提出未来研究的潜在方向。 我们的代码可在以下链接获取:[此 HTTPS URL](this https URL)。
https://arxiv.org/abs/2503.14501
We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
我们提出了一种灵活的方法,旨在弥合半监督和无监督图像识别之间的差距,并且该方法在泛化的类别发现(GCD)和图像聚类方面都表现出色。尽管这些任务的动机存在重叠,但各自的方法仅限于单个任务——GCD 方法依赖于数据中的标注部分,而深度图像聚类方法没有内置的方式来高效利用标签信息。我们通过一种创新性的“基于邻域信息分类(UNIC)”方法将两者联系起来,在无监督(聚类)和半监督(GCD)设置中均能发挥作用。 当前最先进的聚类方法已经很大程度上依赖于最近邻技术,我们在两个方面对其结果进行了显著改进:首先,通过采样和清理策略来识别准确的正负邻居;其次,通过对两种类型的邻居进行采样计算聚类损失,并微调骨干网络。然后我们将此流程调整应用于 GCD 中,利用标注图像作为真实邻居。 我们的方法在聚类(ImageNet-100 和 ImageNet200 上分别提高了 3%)和 GCD(ImageNet-100 提升了 0.8%,CUB 数据集提升了 5%,SCars 数据集中提升了 2%,Aircraft 数据集中提升了 4%)方面都取得了最先进的结果。
https://arxiv.org/abs/2503.14500
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
尽管在AI基准测试方面取得了快速进展,但这些基准性能的现实意义仍然模糊不清。为了量化AI系统的人类能力水平,我们提出了一种新的度量标准:50%任务完成时间范围(50%-task-completion time horizon)。这一概念指的是人类通常完成AI模型以50%成功率完成的任务所需的时间。首先,我们在RE-Bench、HCAST以及66个新设计的较短任务组合中对具备相关领域专业知识的人类进行了计时测试。在这些任务上,当前前沿AI模型如Claude 3.7 Sonnet 的50%时间范围大约为50分钟。 此外,自2019年以来,前沿AI的时间范围每隔七个月左右就会翻一番,尽管这一趋势可能已在2024年加速。AI模型时间范围的增长似乎主要由更高的可靠性和错误适应能力推动,并结合了更强大的逻辑推理和工具使用能力。 我们讨论了研究结果的局限性——包括其外部效度的程度——以及对危险能力增加自主性的潜在影响。如果这些结果能够推广到现实世界的软件任务中,根据这种趋势进行外推预测,在未来五年内,AI系统将有能力自动化许多目前需要人类一个月时间才能完成的软件任务。
https://arxiv.org/abs/2503.14499
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at this https URL
最近,大型多模态模型(LMMs)在自动驾驶研究中引起了广泛关注,并在各种新兴基准测试中展示了其令人鼓舞的能力。专门为此领域设计的LMM们展现出了有效的感知、规划和预测技能。然而,这些方法往往未能充分利用3D空间和时间元素,主要依赖于图像数据。因此,在动态驾驶环境中它们的效果受到限制。我们提出了一种将追踪信息作为额外输入的方法来恢复图像中未有效捕获的3D空间和时间细节。为此,我们介绍了一种新颖的方法,以嵌入此追踪信息到LMMs之中,从而增强其对驾驶场景的时空理解。 通过使用一个跟踪编码器,我们将三维跟踪数据纳入视觉查询中,同时避免了处理长时间视频序列或大量3D输入所带来的计算开销。此外,我们采用自我监督方法来预训练跟踪编码器,为LMM提供额外的上下文信息,从而显著提升了其在自动驾驶感知、规划和预测任务中的表现。 实验结果表明,我们的方法十分有效,在DriveLM-nuScenes基准测试上较基线模型获得了9.5%的准确率提升、7.04点ChatGPT得分增加以及整体评分上升了9.4%,而在DriveLM-CARLA上的最终得分为3.7%。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2503.14498
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at this https URL
验证对于有效的数学推理至关重要。我们提出了一种新的时间一致性方法,其中验证者基于之前的评估结果迭代地改进他们的判断。与一轮验证或多模型辩论的方法不同,我们的方法利用一系列自我反思行为中的连贯性来提高验证的准确性。在广泛的数学过程错误识别基准测试(包括Mathcheck、ProcessBench和PRM800K)上的经验评估显示,在基线方法上持续有性能改进。当应用于最近的DeepSeek R1蒸馏模型时,我们的方法表现出强大的性能,使得7B/8B大小的蒸馏模型在ProcessBench基准测试中超越了所有70B/72B规模的模型和GPT-4o。值得注意的是,使用我们方法处理过的14B模型,在性能上达到了与DeepSeek-R1相当的水平。我们的代码可在以下网址获取:[此处应提供URL]。
https://arxiv.org/abs/2503.14495
Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.
基于流的生成模型通过遵循一个简单的原则——学习线性插值的速度表示,在多个视觉生成任务中取得了令人瞩目的成果。然而,我们观察到仅从最后一层输出训练速度会忽视丰富的跨层表示信息,这可能会阻碍模型收敛。为了解决这一限制,我们引入了DeepFlow,这是一种新的框架,通过层级间的通信增强速度表示。DeepFlow将变压器层划分为具有深度监督的平衡分支,并在相邻分支之间插入了一个轻量级的速度精炼器与加速度(VeRA)模块,该模块可以在Transformer块中对中间速度特征进行对齐。借助内部速度对齐带来的改进深度监督,DeepFlow在ImageNet上的收敛速度快了8倍,同时保持等效性能,并且在没有分类器自由指导的先前基于流的模型上进一步将FID降低了2.6分并减半了训练时间。此外,在MSCOCO和零样本GenEval评估中也显示,DeepFlow在文本到图像生成任务方面优于基线方法。
https://arxiv.org/abs/2503.14494
DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
基于DETR的方法,通过使用多层变压器解码器迭代地细化对象查询,在三维室内物体检测中展现了令人鼓舞的性能。然而,这些方法中的场景点特征在Transformer解码器中保持不变,导致后续解码层贡献有限,从而限制了性能改进的空间。最近,状态空间模型(SSM)通过系统状态和输入之间的迭代交互展示了高效的上下文建模能力,并且具有线性复杂度。受到SSMs的启发,我们提出了一种新的三维物体检测范式——带有互动状态空间模型(DEST)的方法。在互动SSM中,设计了新颖的状态依赖参数化方法,使系统状态能够有效地作为查询参与3D室内检测任务。此外,为点云和SSM的特点量身定制了四个关键的设计:序列化及双向扫描策略使得场景中的点之间可以进行双向特性交互;跨状态注意机制用于建模状态点之间的关系,而门控前馈网络则增强了通道间的相关性。据我们所知,这是首个将查询视为系统状态并将场景点作为系统输入的方法,并且该方法能够同时以线性复杂度更新场景点特征和查询特征。在两个具有挑战性的数据集上的广泛实验表明了DEST基础方法的有效性。我们的方法在ScanNet V2(+5.3)和SUN RGB-D(+3.2)数据集中提高了GroupFree基准模型的AP50性能指标。基于VDETR基准,我们的方法在ScanNetV2和SUN RGB-D数据集上设立了新的最先进水平(SOTA)。
https://arxiv.org/abs/2503.14493
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
我们介绍了Cosmos-Transfer,这是一种条件世界生成模型,可以根据多种模式的空间控制输入(如分割、深度和边缘)来生成世界模拟。设计中采用的空间条件方案是自适应且可定制的,允许在不同空间位置对不同的条件输入赋予不同的权重。这使得高度可控的世界生成成为可能,并适用于各种“从一个世界到另一个世界”的转移用例,包括仿真到现实(Sim2Real)。我们进行了广泛的评估来分析所提出的模型,并展示了其在物理AI中的应用,包括机器人技术的Sim2Real和自动驾驶汽车数据丰富化。此外,我们还演示了一种推理扩展策略,以实现使用NVIDIA GB200 NVL72机柜进行实时世界生成。为了加速该领域的研究开发,我们在以下网址开源了我们的模型和代码:[此处提供具体的URL链接]。
https://arxiv.org/abs/2503.14492
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.
我们介绍了一种名为Stable Virtual Camera (Seva)的泛化扩散模型,该模型能够在给定任意数量输入视图和目标相机的情况下生成场景的新视角。现有方法在生成大幅度视角变化或时间上平滑样本时存在困难,并且依赖于特定任务配置。我们的方法通过简单的模型设计、优化训练方案以及灵活的采样策略克服了这些限制,在测试阶段可以跨多种视图合成任务推广使用。因此,我们生成的样本保持高度一致性,无需额外基于3D表示的知识蒸馏步骤,从而简化了野外视图合成的过程。此外,我们展示了Seva能够生成长达半分钟的高质量视频,并且具有无缝循环闭合特性。广泛的基准测试表明,Seva在不同的数据集和设置下均优于现有方法。
https://arxiv.org/abs/2503.14489
We are interested in the construction of software that can act as scientific assistants to domain specialists. It is expected that such assistants will be needed to accelerate the identification of ways to address complex problems requiring urgent solutions. In this paper, our focus is not on a specific scientific problem, but on the software-engineering of such 'science accelerators'. Recent developments in 'No Code' techniques would seem to suggest that scientist can simply hypothesise solutions simply by conversing with a large language model (LLM). However, for complex scientific problems, this seems unlikely given the current state of LLM technology. What does appear feasible is that a software engineer can use LLMs to rapidly construct programs for use by a domain-specialist, including the specialist's requirements expressed in natural language. We propose the design of an interactive form of 'structured' inductive programming in which a software-engineer and an LLM collaboratively construct an 'assistant' for a scientific data analysis. The paper describes a simple implementation called iStrucInd that adapts a '2-way Intelligibility' protocol to implement the interaction between the software engineer and the LLM. We test the tool on two different non-trivial scientific data analysis tasks. Specifically, we compare the system constructed by iStrucInd against systems constructed manually and by Low Code/No Code methods along dimensions of: (a) program performance; (b) program quality; and (c) programming effort. The results show iStrucInd allows a software engineer to develop better programs faster suggesting interactive structured induction can play a useful role in the rapid construction of scientific assistants.
我们对构建能够作为科学助手辅助领域专家的软件感兴趣。预计这样的助手将被用于加速识别解决复杂且亟需解决方案的问题的方法。本文的重点不是特定的科学问题,而是此类“科学研究加速器”的软件工程设计。近期,“无代码”技术的发展似乎表明科学家可以仅通过与大型语言模型(LLM)对话来简单地提出解决方案假设。然而,在处理复杂的科学问题时,考虑到当前LLM的技术水平,这似乎是不太可能实现的。看起来可行的是,软件工程师可以使用LLMs快速构建程序供领域专家使用,包括表达在自然语言中的专家需求。我们提出了设计一种交互式的“结构化”归纳编程形式,其中软件工程师与LLM合作为科学数据分析构造一个“助手”。本文描述了一种简单的名为iStrucInd的实现方法,该方法采用适应了“双向理解性”的协议来实施软件工程师和LLM之间的互动。 我们在两个不同的非平凡科学数据分析任务上测试了这个工具。具体来说,我们从程序性能、程序质量和编程努力三个方面对比了由iStrucInd构建的系统与其他手动构建及低代码/无代码方法构建的系统的优劣。结果显示,iStrucInd允许软件工程师更快地开发出更高质量的程序,表明交互式结构化归纳可以为快速构建科学助手扮演一个有用的角色。
https://arxiv.org/abs/2503.14488
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: this https URL
扩散模型在各种图像生成任务中表现出色,但其性能往往受到输入处理过程中条件变化和噪声水平不同的限制。为了解决这一局限性,我们提出了一种新的方法,该方法利用了扩散过程固有的异质性。我们的方法名为DiffMoE(Diffusion Mixture of Experts),它引入了一个批量级别的全局令牌池,在训练期间允许专家访问全局令牌分布,从而促进专业化的专家行为。为了充分发挥扩散过程的潜力,DiffMoE还集成了一个容量预测器,该预测器可以根据噪声水平和样本复杂性动态分配计算资源。 通过全面评估,DiffMoE在ImageNet基准测试中取得了最先进的性能,在与具有3倍激活参数量的密集架构以及现有专家混合方法进行比较时,表现明显更优,并且仅使用1倍激活参数。我们的方法的有效性不仅限于条件生成任务,还扩展到了更具挑战性的文本到图像生成等任务上,证明了它在不同扩散模型应用中的广泛适用性。 项目页面:[请访问此链接](https://this-url.com/)以获取更多详细信息和资源。
https://arxiv.org/abs/2503.14487
Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
视频肖像重新照明仍然具有挑战性,因为结果需要既逼真又在时间上稳定。这通常要求强大的模型设计来捕捉复杂的面部反射,并且还需要在高质量的配对视频数据集(如动态一次一灯法(OLAT))上进行密集训练。在这项工作中,我们引入了Lux Post Facto,这是一种新颖的人像视频重新照明方法,可以生成既逼真又时间一致的照明效果。 从模型设计方面来看,我们创建了一个新的条件视频扩散模型,该模型基于最先进的预训练视频扩散模型,并结合了一种新的灯光注入机制以实现精确控制。这样我们可以利用强大的空间和时间生成能力来解决病态重新照明问题并产生合理的结果。我们的技术使用一个混合数据集,其中包括静态表情OLAT数据和野外人像表演视频,以便联合学习重新照明和时间建模。这避免了在不同光照条件下获取配对视频数据的需要。 我们进行了大量的实验,结果表明,我们的模型在逼真度和时间一致性方面均达到了最先进的水平。
https://arxiv.org/abs/2503.14485
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的跨人类与人工智能(AI)协作不仅依赖于AI代理遵循明确指令的能力,还在于其处理模糊性、不完整信息、无效性和无关信息的沟通能力。格赖斯对话和推理准则通过将模棱两可的指示与合作原则对齐来促进这种协作。我们提出了一种规范框架,该框架结合了格赖斯准则以及认知框架——共同知识(common ground)、相关理论(relevance theory)和心智理论(theory of mind),并将这些整合到大型语言模型(LLM)驱动的代理中。 此规范框架采用了格赖斯关于数量、质量、关联性和方式的原则,用于解析模糊指令。在该框架下,我们引入了Lamoids——基于GPT-4的协作型AI代理。为了评估格赖斯准则对人类与AI协作的影响,我们在实验中比较了一个遵循这些规范的Lamoid版本和一个不遵循规范的版本。 在实验中,Lamoid与人类合作,在一个由门、钥匙和宝石组成的网格世界环境中完成共同目标。实验过程中,该环境既包括清晰指令也包括模糊自然语言指令。我们的结果显示,使用格赖斯准则的Lamoid在任务准确性上优于不遵循这些准则的版本,并且其生成的回答更加明确、准确并且与上下文相关。 这种改进源于所提出的规范框架,它增强了代理的语用推理能力,促进了有效的跨人类和AI协作,并使基于大型语言模型的代理能够进行情境感知的沟通。
https://arxiv.org/abs/2503.14484
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at this https URL .
在这篇论文中,我们提出了一种新的多视图几何重建方法。近年来,大型视觉模型迅速发展,在各种任务上表现出色,并展示了出色的泛化能力。一些研究使用大型视觉模型进行单目深度估计,并间接应用于促进多视图重建任务。由于单目深度估计任务的不确定性,估算出的深度值通常不够准确,限制了它们在辅助多视图重建中的应用效果。我们提出将SfM(Structure from Motion)信息这一强大的多视图先验知识融入到深度估计过程中,从而提高深度预测的质量,并使其可以直接应用于多视图几何重建任务中。 实验结果表明,在公共的真实世界数据集上,我们的方法相比以往的单目深度估计工作显著提高了深度估计的质量。此外,我们还在包括室内、街景和空中视角在内的多种场景类型中评估了我们方法的重建质量,超越了当前最先进的MVS(多视图立体匹配)方法。 代码和补充材料可在[此处](https://this https URL)获取。请注意,链接中的URL部分需要替换为实际提供的网址。
https://arxiv.org/abs/2503.14483
Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.
在过去几年中,图像生成技术取得了显著进步。然而,评估图像生成模型的性能仍然是一项艰巨的任务。为此,本文提出了ICE-Bench这一统一且全面的基准测试系统,旨在严格评估图像生成模型的能力。其主要特点可以总结为以下几点: 1. **粗粒度到细粒度任务**:我们系统性地将图像生成分解成四个任务类别:无参考/有参考图像创建和编辑,根据是否存在源图或参考图进行分类,并进一步将其细化为涵盖广泛需求的31个具体任务。这些具体任务覆盖了从简单的图像生成到复杂的图像编辑等多个方面的需求,构成了一个全面的基准测试系统。 2. **多维度指标**:评估框架通过六个维度来衡量图像生成的能力:美学质量、成像质量、提示遵循度、源图一致性、参考图一致性和可控性。为了支持这种多维度评估,我们引入了11个不同的评价指标,其中包括VLLM-QA这一创新性的评估方法,该方法利用大型模型评估图像编辑任务的成功率。 3. **混合数据集**:测试的数据来源于真实场景和虚拟生成场景的结合,有效提升了数据多样性,并减少了在模型评估中的偏差问题。通过ICE-Bench,我们对现有的生成模型进行了深入分析,揭示了基准测试的挑战性以及当前模型能力与实际需求之间的差距。 为了促进该领域的进一步发展,我们将开源ICE-Bench,包括其数据集、评估代码和模型等资源,为研究社区提供宝贵的工具和支持。
https://arxiv.org/abs/2503.14482
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
为了成为有用的助手,AI代理必须了解自己的能力和局限性。这包括知道何时从参数知识中作答与使用工具之间的区别、何时信任工具输出以及何时保持谨慎或选择回避。由于这些能力难以通过监督微调来传授(因为需要构建能够反映特定代理能力的例子),我们提出了一种全新的教学方法:\emph{协作自我游戏}。我们构造了多代理合作,其中团队因集体正确地得出答案而获得奖励。这种元知识从互动结构中内置的激励机制中涌现出来。我们的重点在于拥有异构工具(针对特定语料库检索)的小规模代理社会,并且这些代理必须通过最小化自身努力来最大化成功所需的合作。 实验表明,多代理社区中的群体奖励可以诱导出在单个代理独立部署时能够\emph{转移}的策略,从而改进工具使用和选择性预测。
https://arxiv.org/abs/2503.14481
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on this https URL.
创造力是智力的基本方面,涉及到在不同背景下产生新颖且适当解决方案的能力。虽然大型语言模型(LLMs)的创造性能力已被广泛评估,但在这一领域的多模态大型语言模型(MLLMs)评估仍然鲜有研究。为了填补这一空白,我们引入了Creation-MMBench,这是一个专门设计用于评估MLLM在基于图像的真实世界任务中创造能力的多模态基准测试。该基准包括765个测试案例,涵盖了51项细粒度任务。为了确保严格的评估,我们为每个测试案例定义了实例特定的评价标准,以指导对一般响应质量和与视觉输入的事实一致性进行评估。 实验结果显示,目前开源的MLLM在创造性任务中显著落后于专有模型。此外,我们的分析表明,视觉微调可能会影响基础LLM的创造力。Creation-MMBench为推进MLLM的创造能力提供了宝贵的见解,并为进一步提高多模态生成智能奠定了基础。完整的数据和评估代码已在此网址发布:[此URL](请将实际链接填入)。
https://arxiv.org/abs/2503.14478