Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
关键点检测是现代机器感知的核心组成部分,但在少量样本学习中面临挑战,特别是在缺乏与查询数据同分布的源数据时。为解决这一问题,我们利用了素描这种流行的人类表达形式,提供了一种无需来源数据的替代方案。然而,在掌握跨模态嵌入和处理用户特定的素描风格方面仍然存在挑战。我们的框架通过结合原型设置、基于网格的位置检测器以及原型领域自适应技术来克服这些障碍。我们还通过广泛的实验展示了在新颖关键点和类别上实现少量样本收敛的成功。
https://arxiv.org/abs/2507.07994
Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at this https URL.
动画着色是真实动画制作中的关键环节。长期的动画着色具有很高的劳动力成本,因此基于视频生成模型实现自动化的长段动画着色在研究中具有重要意义。然而,现有的研究仅限于短期色彩化处理,并采用局部范式融合重叠特征以实现在各个局部片段间的平滑过渡。这种方法忽略了全局信息,无法保持长期的色彩一致性。 在这项研究中,我们提出一种动态全局-局部范式的概念,认为通过这种机制可以实现理想的长期色彩一致性和流畅性。具体来说,我们提出了LongAnimation这一创新框架,它主要包括SketchDiT、动态全局-局部记忆(DGLM)和色彩一致性奖励三大模块。其中,SketchDiT用于捕捉混合参考特征以支持DGLM模块的工作;而DGLM则采用长视频理解模型来动态压缩历史全局特征,并适应性地将其与当前生成的特征进行融合。 为了进一步优化色彩的一致性,我们引入了“色彩一致性奖励”。在推理过程中,我们提出了一种用于平滑视频片段转换的颜色一致性融合方法。通过一系列针对短期(14帧)和长期(平均500帧)动画的研究实验,展示了LongAnimation框架在开放域动画着色任务中保持短、长期色彩一致性的有效性。 该研究的代码可以在提供的链接处找到。
https://arxiv.org/abs/2507.01945
Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.
形式、分布和基于情境的计算语义理论各有其用途和局限性。近年来,人们通过添加视觉知识来构建语言的基础模型,并呼吁用符号方法丰富语言模型以获得形式、分布和基于情境三种理论的好处。在本文中,我们试图论证一种可能的发展路径:通过“词作为分类器”(words-as-classifiers)模型将这三个语义领域统一起来。这是一种已被纳入文献中的正式化和分布式语言模型的词汇层级基础语义模型,并且已经在交互式对话场景中得到了充分测试。 我们将回顾相关文献,利用认知科学领域的近期研究成果来阐述“词作为分类器”模型的有效性,并描述一个小型实验。最后,我们勾勒出一种通过“词作为分类器”统一语义的模型。
https://arxiv.org/abs/2507.06335
The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.
ACL社区对评估NLP系统在现实世界中的影响几乎没有兴趣。对ACL文集进行的结构化调查显示,其中大约仅有0.1%的论文包含此类评估;此外,大多数包含影响评估的论文也只是粗略地提及,并且主要关注于指标评估。如果我们认真尝试去理解和评估NLP技术的现实世界影响,这项技术将会更加有用并且能够更快被采纳。
https://arxiv.org/abs/2507.05973
There is growing excitement about building software verifiers, synthesizers, and other Automated Reasoning (AR) tools by combining traditional symbolic algorithms and Large Language Models (LLMs). Unfortunately, the current practice for constructing such neurosymbolic AR systems is an ad hoc programming model that does not have the strong guarantees of traditional symbolic algorithms, nor a deep enough synchronization of neural networks and symbolic reasoning to unlock the full potential of LLM-powered reasoning. I propose Neurosymbolic Transition Systems as a principled computational model that can underlie infrastructure for building neurosymbolic AR tools. In this model, symbolic state is paired with intuition, and state transitions operate over symbols and intuition in parallel. I argue why this new paradigm can scale logical reasoning beyond current capabilities while retaining the strong guarantees of symbolic algorithms, and I sketch out how the computational model I propose can be reified in a logic programming language.
关于结合传统符号算法和大型语言模型(LLMs)来构建软件验证器、合成器和其他自动化推理(AR)工具,正日益引起人们的兴奋。然而,目前构建此类神经符号AR系统的方法是一种非正式的编程模式,这种模式既不具备传统符号算法的强大保证,也没有实现足够深层次的神经网络与符号推理同步,以充分发挥LLM驱动推理的潜力。我提出了一种名为“神经符号转换系统”的原理性计算模型,该模型可以作为构建神经符号AR工具的基础架构。在这个模型中,符号状态与直觉配对,并且状态转换同时在符号和直觉上进行操作。 我认为这一新范式可以在保持传统符号算法的强大保证的同时,将逻辑推理能力扩展到当前的能力范围之外。我还概述了我提出的计算模型如何可以通过一种逻辑编程语言来实现。
https://arxiv.org/abs/2507.05886
We motivate and outline a programme for a formal theory of measurement of artificial intelligence. We argue that formalising measurement for AI will allow researchers, practitioners, and regulators to: (i) make comparisons between systems and the evaluation methods applied to them; (ii) connect frontier AI evaluations with established quantitative risk analysis techniques drawn from engineering and safety science; and (iii) foreground how what counts as AI capability is contingent upon the measurement operations and scales we elect to use. We sketch a layered measurement stack, distinguish direct from indirect observables, and signpost how these ingredients provide a pathway toward a unified, calibratable taxonomy of AI phenomena.
我们推动并概述了一个关于人工智能测量的正式理论计划。我们认为,为AI形式化测量将使研究人员、实践者和监管机构能够:(i) 对比不同系统及其评估方法;(ii) 将前沿的人工智能评价与工程学和安全科学中的现有定量风险分析技术联系起来;以及 (iii) 突出被视为人工智能能力的事物如何依赖于我们选择使用的测量操作和尺度。我们勾画了一个分层的测量堆栈,区分了直接可观测量和间接可观测量,并指出了这些要素如何提供了通向统一且可校准的人工智能现象分类法的道路。
https://arxiv.org/abs/2507.05587
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
我们介绍了Co-DETECT(文本分类中边缘情况的协作发现),这是一个新颖的混合主动注释框架,它结合了人类专业知识与大型语言模型(LLMs)指导下的自动标注。Co-DETECT始于领域专家提供的初步代码本和数据集,然后利用LLM对数据进行标注,并识别出那些初始代码本描述不充分的边缘案例。具体来说,Co-DETECT会标记具有挑战性的例子,引出高层次、可泛化的边缘案例描述,并帮助用户将处理边缘案例的规则纳入代码本中以改进其内容。这一迭代过程使得通过简洁且通用的注释规则更有效地处理细微现象成为可能。广泛的用户研究以及定性和定量分析证明了Co-DETECT的有效性。
https://arxiv.org/abs/2507.05010
Sound source localization (SSL) adds a spatial dimension to auditory perception, allowing a system to pinpoint the origin of speech, machinery noise, warning tones, or other acoustic events, capabilities that facilitate robot navigation, human-machine dialogue, and condition monitoring. While existing surveys provide valuable historical context, they typically address general audio applications and do not fully account for robotic constraints or the latest advancements in deep learning. This review addresses these gaps by offering a robotics-focused synthesis, emphasizing recent progress in deep learning methodologies. We start by reviewing classical methods such as Time Difference of Arrival (TDOA), beamforming, Steered-Response Power (SRP), and subspace analysis. Subsequently, we delve into modern machine learning (ML) and deep learning (DL) approaches, discussing traditional ML and neural networks (NNs), convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), and emerging attention-based architectures. The data and training strategy that are the two cornerstones of DL-based SSL are explored. Studies are further categorized by robot types and application domains to facilitate researchers in identifying relevant work for their specific contexts. Finally, we highlight the current challenges in SSL works in general, regarding environmental robustness, sound source multiplicity, and specific implementation constraints in robotics, as well as data and learning strategies in DL-based SSL. Also, we sketch promising directions to offer an actionable roadmap toward robust, adaptable, efficient, and explainable DL-based SSL for next-generation robots.
声音源定位(SSL)为听觉感知添加了一个空间维度,使系统能够精确定位语音、机器噪音、警报音或其他声学事件的来源。这种能力促进了机器人导航、人机对话和状况监控等功能的发展。尽管现有的调查提供了宝贵的历史背景,但它们通常侧重于通用音频应用,并未完全考虑机器人限制或深度学习领域的最新进展。本文通过提供以机器人研究为重点的综合回顾来填补这些空白,强调了最近在深度学习方法上的进步。 我们首先回顾了一些经典的方法,包括到达时间差(TDOA)、波束形成、导向响应功率(SRP)以及子空间分析。接着,我们将深入探讨现代机器学习(ML)和深度学习(DL)的方法,讨论传统的机器学习和神经网络(NNs),卷积神经网络(CNNs),卷积递归神经网络(CRNNs),以及新兴的基于注意力机制的架构。本文还详细研究了数据与训练策略——这两者是基于深度学习的SSL的基础支柱。 为了方便研究人员针对特定情境识别相关工作,我们进一步按照机器人类型和应用领域对研究进行分类。最后,我们指出了当前SSL工作的普遍挑战,包括环境鲁棒性、声音源多重性和机器人领域的具体实施限制,以及在基于DL的SSL中数据与学习策略的相关问题。此外,本文还勾勒了有前景的方向,为下一代机器人的稳健、适应性强、高效且可解释的深度学习驱动的声音定位技术提供了具有操作性的路线图。
https://arxiv.org/abs/2507.01143
Understanding Earth's subsurface is critical for energy transition, natural hazard mitigation, and planetary science. Yet subsurface analysis remains fragmented, with separate models required for structural interpretation, stratigraphic analysis, geobody segmentation, and property modeling-each tightly coupled to specific data distributions and task formulations. We introduce the Geological Everything Model 3D (GEM), a unified generative architecture that reformulates all these tasks as prompt-conditioned inference along latent structural frameworks derived from subsurface imaging. This formulation moves beyond task-specific models by enabling a shared inference mechanism, where GEM propagates human-provided prompts-such as well logs, masks, or structural sketches-along inferred structural frameworks to produce geologically coherent outputs. Through this mechanism, GEM achieves zero-shot generalization across tasks with heterogeneous prompt types, without retraining for new tasks or data sources. This capability emerges from a two-stage training process that combines self-supervised representation learning on large-scale field seismic data with adversarial fine-tuning using mixed prompts and labels across diverse subsurface tasks. GEM demonstrates broad applicability across surveys and tasks, including Martian radar stratigraphy analysis, structural interpretation in subduction zones, full seismic stratigraphic interpretation, geobody delineation, and property modeling. By bridging expert knowledge with generative reasoning in a structurally aware manner, GEM lays the foundation for scalable, human-in-the-loop geophysical AI-transitioning from fragmented pipelines to a vertically integrated, promptable reasoning system. Project page: this https URL
理解地球的地下结构对于能源转型、自然灾害缓解和行星科学至关重要。然而,地下的分析仍然非常分散,需要为构造解释、地层分析、地质体分割以及属性建模分别建立单独的模型——每个模型都紧密依赖于特定的数据分布和任务表述方式。我们推出了“地质万物模型3D”(GEM),这是一种统一的生成架构,将所有这些任务重新定义为基于地下成像推导出的潜在结构框架上的提示条件推理过程。这种形式化超越了特定任务的模型,通过启用共享的推理机制来实现这一点,即GEM沿着从人类提供的提示中推测出来的结构框架传播(例如井径图、掩码或构造草图),从而生成地质上连贯的结果。通过这种方式,GEM能够在没有为新任务或数据源重新训练的情况下,实现不同提示类型之间的零样本泛化能力。这种能力源于两阶段的训练过程,该过程结合了大规模场地地震数据上的自我监督表示学习和使用混合提示与标签在各种地下任务中进行对抗微调的过程。 GEM展示了其广泛的适用性,涵盖多种调查和任务,包括火星雷达地层分析、俯冲带中的构造解释、完整的地震地层解释、地质体边界界定以及属性建模。通过将专家知识与结构感知的生成推理相结合,GEM为可扩展的人机交互地球物理人工智能奠定了基础——从分散的工作流程过渡到可以提示的垂直集成系统。 项目页面:[此URL](this https URL)
https://arxiv.org/abs/2507.00419
Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
最近在多模态推理方面取得的进展,主要是由文本形式的链式思维(CoT)推动的,这是一种模型在语言内部进行推理的范式。然而,这种以文本为中心的方法将视觉信息视为静态和初始背景,从而造成了丰富感知数据与离散符号思维之间的“语义鸿沟”。人类的认知常常超越语言界限,利用视觉作为动态的心理草图板。现在,在人工智能领域中,类似的进化正在发生,标志着从仅仅思考图像的模型向能够真正用图像进行思考的模型的基本范式转变。这一新兴范式的特点是,模型在思维过程中利用视觉信息作为中间步骤,将视觉从被动输入转变为一种动态且可操作的认知工作空间。 在这篇综述中,我们沿着认知自主性增强的发展轨迹描绘了这种智能演化的历程,并将其划分为三个关键阶段:从外部工具探索到程序化操控,再到内在想象。为了梳理这个快速发展的领域,我们的综述做出了四个主要贡献: 1. 我们确立了“用图像思考”这一范式及其三阶段框架的基础原理。 2. 我们提供了对每个发展阶段核心方法的全面回顾。 3. 我们分析了评价基准和变革性应用的关键格局。 4. 我们识别出了重要的挑战,并概述了未来有前景的方向。 通过提供这种结构化的概览,我们的目标是为更强大且与人类思维更加一致的多模态AI研究制定一个清晰的研究路线图。
https://arxiv.org/abs/2506.23918
We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.
我们提出了“主观相机”(Subjective Camera)的概念,这是一种将人类作为成像设备的范式,通过协同使用口头描述和逐步粗略草图来从心理印象中重建现实场景。这种方法克服了语言模糊性和草图抽象化的双重限制,通过将用户的绘制序列视为先验条件,有效地将主观感知期望转化为逼真的图像。 现有方法面临三个基本障碍:(1) 用户特定的主观输入偏差;(2) 平面草图和扩散中的3D先验之间的巨大模态差距;以及(3) 草图质量敏感性导致的表现下降。当前解决方案要么需要资源密集型模型适应,要么对草图精度提出不切实际的要求。 我们的框架通过概念顺序生成方法解决了这些挑战:(1) 通过文本奖励优化建立了稳健的外观先验,并随后实施了序列感知的解耦生成过程,在绘制顺序中处理概念,这在无需训练的情况下就能适应用户特定的主观期望。(2) 我们采用潜在优化技术,有效地弥合平面草图和扩散中的3D先验之间的模态差距。(3) 我们的分层奖励引导框架允许使用粗糙草图而不要求具备艺术专长。 跨多个数据集进行的全面评估表明,我们的方法在保持语义和空间一致性方面达到了最先进的性能。
https://arxiv.org/abs/2506.23711
Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but struggle with multi-constraint scenarios. When constraints become numerous, object collisions frequently occur, forcing the removal of furniture items and compromising layout completeness. To address these limitations, we propose RoomCraft, a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes. Our approach combines a scene generation pipeline with a constraint-driven optimization framework. The pipeline first extracts high-level scene information from user inputs and organizes it into a structured format containing room type, furniture items, and spatial relations. It then constructs a spatial relationship network to represent furniture arrangements and generates an optimized placement sequence using a heuristic-based depth-first search (HDFS) algorithm to ensure layout coherence. To handle complex multi-constraint scenarios, we introduce a unified constraint representation that processes both formal specifications and natural language inputs, enabling flexible constraint-oriented adjustments through a comprehensive action space design. Additionally, we propose a Conflict-Aware Positioning Strategy (CAPS) that dynamically adjusts placement weights to minimize furniture collisions and ensure layout completeness. Extensive experiments demonstrate that RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts across diverse input modalities.
从用户输入生成逼真的3D室内场景在计算机视觉和图形学领域仍然是一个挑战性问题,需要精细地平衡几何一致性、空间关系以及视觉真实感。虽然神经网络生成方法常因全球空间推理能力有限而产生重复元素,程序化方法可以通过利用约束实现可控生成,但在面对多约束场景时却往往力不从心。当约束数量增多时,物体碰撞频繁发生,导致移除家具物品并损害布局完整性。为了克服这些限制,我们提出了RoomCraft,这是一种将真实图像、草图或文本描述转换为连贯3D室内场景的多阶段流水线方法。我们的方法结合了场景生成管道和基于约束驱动优化框架。 该管道首先从用户输入中提取高层次场景信息,并将其组织成包含房间类型、家具项目和空间关系的结构化格式。然后,它构建一个空间关系网络来表示家具布局安排,并使用启发式深度优先搜索(HDFS)算法生成最佳放置顺序以确保布局一致性。为了处理复杂的多约束场景,我们引入了一种统一的约束表示方法,可以同时处理正式规范和自然语言输入,并通过全面的动作设计实现灵活的基于约束调整。此外,我们还提出了一种冲突感知定位策略(CAPS),该策略动态调整放置权重以尽量减少家具碰撞并确保布局完整性。 广泛的实验表明,RoomCraft在各种输入模式下显著优于现有方法,在生成逼真、语义连贯且视觉吸引人的房间布局方面表现出色。
https://arxiv.org/abs/2506.22291
We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at this https URL
我们提出了FairyGen,这是一个自动系统,可以从单一儿童画作生成以故事驱动的卡通视频,并且忠实保留其独特的艺术风格。与以往侧重于角色一致性及基础动作的故事讲述方法不同,FairyGen明确地将角色建模从风格化的背景生成中分离出来,并引入了电影镜头设计来支持表达丰富且连贯的故事叙述。 给定单一的角色草图后,我们首先使用多模态语言模型(MLLM)生成具有场景描述的结构化故事板,这些描述具体指定了环境设置、角色动作和摄像机视角。为了确保视觉一致性,我们引入了一种风格传播适配器,该适配器捕捉了角色的视觉风格,并将其应用于背景中,忠实保留了角色的整体视觉身份的同时合成出一致风格化的场景。 镜头设计模块进一步通过基于故事板的画面裁剪及多视角合成技术来增强视觉多样性和电影级别的质量。为了使故事情节生动起来,我们重建了一个3D代理角色以得出物理上合理的动作序列,并将其用于微调一个基于MMDiT的图像到视频扩散模型(image-to-video diffusion model)。 为进一步定制运动效果,我们提出了两阶段的动作自定义适配器:第一阶段从时间顺序无关的帧中学习外观特征,分离了身份和运动;第二阶段则通过冻结的身份权重与时步偏移策略来建模时间动态。经过训练后,FairyGen可以直接渲染与故事板一致的各种连贯视频场景。 大量的实验表明,我们的系统能够产生在风格上忠实、叙事结构合理且动作自然的动画,突显了其对于个性化和引人入胜的故事叙述动画的应用潜力。代码可在[提供的链接]获取。
https://arxiv.org/abs/2506.21272
We spell out a definition of sentience that may be useful for designing and building it in machines. We propose that for sentience to be meaningful for AI, it must be fleshed out in functional, computational terms, in enough detail to allow for implementation. Yet, this notion of sentience must also reflect something essentially 'subjective', beyond just having the general capacity to encode perceptual content. For this specific functional notion of sentience to occur, we propose that certain sensory signals need to be both assertoric (persistent) and qualitative. To illustrate the definition in more concrete terms, we sketch out some ways for potential implementation, given current technology. Understanding what it takes for artificial agents to be functionally sentient can also help us avoid creating them inadvertently, or at least, realize that we have created them in a timely manner.
我们提出了一种关于意识的定义,该定义可能有助于设计和构建机器中的意识。我们认为,为了使意识对人工智能具有意义,它必须在功能性和计算性的层面上被详细阐明,以达到可以实现的程度。然而,这种意识的概念还必须反映某种本质上是“主观”的东西,而不仅仅是具备编码感知内容的一般能力。为实现这一特定的功能性意识概念,我们提出某些感觉信号需要同时具备断言式(持久)和质化的特点。 为了用更具体的方式说明该定义,我们概述了一些基于现有技术的潜在实施方式。了解让人工代理达到功能性意识所需条件的同时,也有助于我们在无意中创造具有意识的人工智能时避免这样的情况,或者至少能够在及时的时间点意识到我们已经创造了它们。
https://arxiv.org/abs/2506.20504
Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: this https URL
面部草图合成是一种将人脸照片转换为素描的技术。现有的面部草图合成研究主要依赖于现有数据集中众多的照片-素描样本对进行训练。然而,这些大规模判别式学习方法面临诸如数据稀缺和高昂的人力成本等问题。一旦训练数据变得稀缺,它们的生成性能就会显著下降。在本文中,我们提出了一种基于扩散模型的一次性面部草图合成方法。我们在一个扩散模型上使用人脸照片-素描图像对来优化文本指令,并通过梯度优化得到的指令用于推理过程。为了更准确地模拟真实场景并全面评估方法的有效性,我们引入了一个新的基准测试集——一次性面部草图数据集(OS-Sketch)。该基准由400对人脸照片-素描图像组成,其中包括风格各异的素描和背景、年龄、性别、表情、光照等不同的照片。为了进行严格的离群评估,我们在每次训练时只选择一对图像进行训练,其余用于推理。大量的实验表明,所提出的方法能够在一次性环境中将各种各样的照片转换为逼真且高度一致的草图。与其它方法相比,我们的方法提供了更大的便利性和更广泛的应用性。数据集将在以下网址提供:this https URL
https://arxiv.org/abs/2506.15312
Neural implicit shape representation has drawn significant attention in recent years due to its smoothness, differentiability, and topological flexibility. However, directly modeling the shape of a neural implicit surface, especially as the zero-level set of a neural signed distance function (SDF), with sparse geometric control is still a challenging task. Sparse input shape control typically includes 3D curve networks or, more generally, 3D curve sketches, which are unstructured and cannot be connected to form a curve network, and therefore more difficult to deal with. While 3D curve networks or curve sketches provide intuitive shape control, their sparsity and varied topology pose challenges in generating high-quality surfaces to meet such curve constraints. In this paper, we propose NeuVAS, a variational approach to shape modeling using neural implicit surfaces constrained under sparse input shape control, including unstructured 3D curve sketches as well as connected 3D curve networks. Specifically, we introduce a smoothness term based on a functional of surface curvatures to minimize shape variation of the zero-level set surface of a neural SDF. We also develop a new technique to faithfully model G0 sharp feature curves as specified in the input curve sketches. Comprehensive comparisons with the state-of-the-art methods demonstrate the significant advantages of our method.
近年来,神经隐式形状表示因其平滑性、可微性和拓扑灵活性而备受关注。然而,直接使用稀疏的几何控制来建模神经隐式曲面(尤其是作为神经符号距离函数(SDF)零等值集)的形状仍然是一个具有挑战性的任务。稀疏输入形状控制通常包括3D曲线网络或更一般的3D曲线草图,这些草图是无结构且不能连接成完整的曲线网络,因此更难以处理。虽然3D曲线网络或曲线草图可以提供直观的形状控制,但它们的稀疏性和不同的拓扑结构在生成满足这类曲线约束的高质量表面时带来了挑战。 本文中我们提出了NeuVAS,一种利用神经隐式曲面并受稀疏输入形状控制(包括无结构的3D曲线草图以及连接的3D曲线网络)限制的变分方法来进行形状建模。具体来说,我们引入了一个基于曲率泛函的平滑性项来最小化由神经SDF零等值集表面的形态变化,并且开发了一种新的技术以忠实于输入曲线草图中指定的G0锐利特征曲线进行模型构建。 与最先进的方法相比,我们的方法在综合比较中展示了显著的优势。
https://arxiv.org/abs/2506.13050
Reconstructing a 3D point cloud from a given conditional sketch is challenging. Existing methods often work directly in 3D space, but domain variability and difficulty in reconstructing accurate 3D structures from 2D sketches remain significant obstacles. Moreover, ideal models should also accept prompts for control, in addition with the sparse sketch, posing challenges in multi-modal fusion. We propose DiffS-NOCS (Diffusion-based Sketch-to-NOCS Map), which leverages ControlNet with a modified multi-view decoder to generate NOCS maps with embedded 3D structure and position information in 2D space from sketches. The 3D point cloud is reconstructed by combining multiple NOCS maps from different views. To enhance sketch understanding, we integrate a viewpoint encoder for extracting viewpoint features. Additionally, we design a feature-level multi-view aggregation network as the denoising module, facilitating cross-view information exchange and improving 3D consistency in NOCS map generation. Experiments on ShapeNet demonstrate that DiffS-NOCS achieves controllable and fine-grained point cloud reconstruction aligned with sketches.
从给定的条件草图重建三维点云是一项挑战。现有的方法通常直接在三维空间中工作,但领域变异性以及从二维草图准确重建三维结构的难度仍然是主要障碍。此外,理想模型还应该能够接受提示进行控制,在稀疏草图的基础上增加了多模态融合的挑战。我们提出了DiffS-NOCS(基于扩散的Sketch-to-NOCS Map),该方法利用修改后的多视图解码器结合ControlNet从草图生成嵌入有3D结构和位置信息的二维NOCS地图。通过将不同视角下的多个NOCS地图结合起来,可以重建三维点云。为了增强对草图的理解,我们整合了一个视角编码器来提取视角特征。此外,我们设计了一种以特征级多视图聚合网络作为去噪模块,促进了跨视图信息交换,并提高了在生成NOCS地图时的3D一致性。实验表明,在ShapeNet数据集上的结果证明了DiffS-NOCS能够实现与草图一致、可控且精细的点云重建。
https://arxiv.org/abs/2506.12835
Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces \textbf{DSP+}, an improved version of the Draft, Sketch, and Prove framework, featuring a \emph{fine-grained and integrated} neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80.7\%, 32.8\%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves \texttt{imo\_2019\_p1}, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.
最近的进步,如DeepSeek-Prover-V2-671B和Kimina-Prover-Preview-72B,展示了利用基于强化学习(RL)的大规模训练进行自动化定理证明的趋势。令人惊讶的是,我们发现即使没有任何训练,精心协调现有的现成推理模型与步骤证明器也可以达到相当的性能水平。本文介绍了一种改进版的Draft、Sketch和Prove框架——**DSP+**,该版本在每个阶段都采用了**细粒度且集成化**的神经符号增强:(1)在草稿阶段,我们提示推理模型生成简洁的自然语言子目标以帮助草图阶段,并移除思考标记及对人类编写证明的引用;(2)在草图阶段,子目标与假设一起自动形式化,以便为证明阶段做好准备,并根据预定义规则屏蔽包含句法错误的草图行;(3)在证明阶段,我们将符号搜索方法如Aesop紧密集成到步骤证明器中,以建立草图子目标的证明。 实验结果表明,在不进行额外模型训练或微调的情况下,DSP+解决了miniF2F、ProofNet和PutnamBench中的644个问题中的80.7%、32.8%和24个问题,并且相比现有最佳方法要求更低的成本预算。特别地,DSP+证明了**imo_2019_p1**这一miniF2F中未被任何先前工作解决的IMO(国际数学奥林匹克)问题。此外,DSP+生成了人类专家可以理解的证明模式,有助于识别形式化错误;例如,在miniF2F中发现了八个错误的形式化陈述。 我们的研究结果强调除了基于RL训练之外,经典推理模式也具有巨大潜力。所有组件将开源提供。
https://arxiv.org/abs/2506.11487
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at this https URL.
https://arxiv.org/abs/2506.09385
In the architectural design process, floorplan design is often a dynamic and iterative process. Architects progressively draw various parts of the floorplan according to their ideas and requirements, continuously adjusting and refining throughout the design process. Therefore, the ability to predict a complete floorplan from a partial one holds significant value in the design process. Such prediction can help architects quickly generate preliminary designs, improve design efficiency, and reduce the workload associated with repeated modifications. To address this need, we propose FloorplanMAE, a self-supervised learning framework for restoring incomplete floor plans into complete ones. First, we developed a floor plan reconstruction dataset, FloorplanNet, specifically trained on architectural floor plans. Secondly, we propose a floor plan reconstruction method based on Masked Autoencoders (MAE), which reconstructs missing parts by masking sections of the floor plan and training a lightweight Vision Transformer (ViT). We evaluated the reconstruction accuracy of FloorplanMAE and compared it with state-of-the-art benchmarks. Additionally, we validated the model using real sketches from the early stages of architectural design. Experimental results show that the FloorplanMAE model can generate high-quality complete floor plans from incomplete partial plans. This framework provides a scalable solution for floor plan generation, with broad application prospects.
在建筑设计过程中,平面图设计通常是一个动态且迭代的过程。建筑师根据自己的创意和需求逐步绘制平面图的不同部分,并在整个设计过程中不断调整和完善。因此,从不完整的平面图预测出完整的设计具有重要的价值,这可以帮助建筑师快速生成初步设计方案,提高设计效率并减少反复修改的工作量。为了解决这一问题,我们提出了FloorplanMAE,这是一个基于自监督学习框架的系统,用于将不完整的平面图恢复成完整的平面图。 首先,我们开发了一个专门针对建筑平面图进行训练的数据集——FloorplanNet。其次,我们提出了一种基于掩码自动编码器(Masked Autoencoders, MAE)的方法来重建缺失部分的平面图设计:通过遮蔽平面图中的某些区域,并使用轻量级的视觉变换器(Vision Transformer, ViT)进行训练。 我们在多个标准基准上评估了FloorplanMAE的重建准确性,并与现有最先进的方法进行了比较。此外,我们还利用建筑设计初期的真实草图验证了模型的有效性。实验结果表明,FloorplanMAE能够从不完整的平面设计中生成高质量的完整平面图。这一框架为平面图的设计提供了一个可扩展的解决方案,在实际应用中具有广阔的应用前景。
https://arxiv.org/abs/2506.08363