Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: this https URL
面部草图合成是一种将人脸照片转换为素描的技术。现有的面部草图合成研究主要依赖于现有数据集中众多的照片-素描样本对进行训练。然而,这些大规模判别式学习方法面临诸如数据稀缺和高昂的人力成本等问题。一旦训练数据变得稀缺,它们的生成性能就会显著下降。在本文中,我们提出了一种基于扩散模型的一次性面部草图合成方法。我们在一个扩散模型上使用人脸照片-素描图像对来优化文本指令,并通过梯度优化得到的指令用于推理过程。为了更准确地模拟真实场景并全面评估方法的有效性,我们引入了一个新的基准测试集——一次性面部草图数据集(OS-Sketch)。该基准由400对人脸照片-素描图像组成,其中包括风格各异的素描和背景、年龄、性别、表情、光照等不同的照片。为了进行严格的离群评估,我们在每次训练时只选择一对图像进行训练,其余用于推理。大量的实验表明,所提出的方法能够在一次性环境中将各种各样的照片转换为逼真且高度一致的草图。与其它方法相比,我们的方法提供了更大的便利性和更广泛的应用性。数据集将在以下网址提供:this https URL
https://arxiv.org/abs/2506.15312
Neural implicit shape representation has drawn significant attention in recent years due to its smoothness, differentiability, and topological flexibility. However, directly modeling the shape of a neural implicit surface, especially as the zero-level set of a neural signed distance function (SDF), with sparse geometric control is still a challenging task. Sparse input shape control typically includes 3D curve networks or, more generally, 3D curve sketches, which are unstructured and cannot be connected to form a curve network, and therefore more difficult to deal with. While 3D curve networks or curve sketches provide intuitive shape control, their sparsity and varied topology pose challenges in generating high-quality surfaces to meet such curve constraints. In this paper, we propose NeuVAS, a variational approach to shape modeling using neural implicit surfaces constrained under sparse input shape control, including unstructured 3D curve sketches as well as connected 3D curve networks. Specifically, we introduce a smoothness term based on a functional of surface curvatures to minimize shape variation of the zero-level set surface of a neural SDF. We also develop a new technique to faithfully model G0 sharp feature curves as specified in the input curve sketches. Comprehensive comparisons with the state-of-the-art methods demonstrate the significant advantages of our method.
近年来,神经隐式形状表示因其平滑性、可微性和拓扑灵活性而备受关注。然而,直接使用稀疏的几何控制来建模神经隐式曲面(尤其是作为神经符号距离函数(SDF)零等值集)的形状仍然是一个具有挑战性的任务。稀疏输入形状控制通常包括3D曲线网络或更一般的3D曲线草图,这些草图是无结构且不能连接成完整的曲线网络,因此更难以处理。虽然3D曲线网络或曲线草图可以提供直观的形状控制,但它们的稀疏性和不同的拓扑结构在生成满足这类曲线约束的高质量表面时带来了挑战。 本文中我们提出了NeuVAS,一种利用神经隐式曲面并受稀疏输入形状控制(包括无结构的3D曲线草图以及连接的3D曲线网络)限制的变分方法来进行形状建模。具体来说,我们引入了一个基于曲率泛函的平滑性项来最小化由神经SDF零等值集表面的形态变化,并且开发了一种新的技术以忠实于输入曲线草图中指定的G0锐利特征曲线进行模型构建。 与最先进的方法相比,我们的方法在综合比较中展示了显著的优势。
https://arxiv.org/abs/2506.13050
Reconstructing a 3D point cloud from a given conditional sketch is challenging. Existing methods often work directly in 3D space, but domain variability and difficulty in reconstructing accurate 3D structures from 2D sketches remain significant obstacles. Moreover, ideal models should also accept prompts for control, in addition with the sparse sketch, posing challenges in multi-modal fusion. We propose DiffS-NOCS (Diffusion-based Sketch-to-NOCS Map), which leverages ControlNet with a modified multi-view decoder to generate NOCS maps with embedded 3D structure and position information in 2D space from sketches. The 3D point cloud is reconstructed by combining multiple NOCS maps from different views. To enhance sketch understanding, we integrate a viewpoint encoder for extracting viewpoint features. Additionally, we design a feature-level multi-view aggregation network as the denoising module, facilitating cross-view information exchange and improving 3D consistency in NOCS map generation. Experiments on ShapeNet demonstrate that DiffS-NOCS achieves controllable and fine-grained point cloud reconstruction aligned with sketches.
从给定的条件草图重建三维点云是一项挑战。现有的方法通常直接在三维空间中工作,但领域变异性以及从二维草图准确重建三维结构的难度仍然是主要障碍。此外,理想模型还应该能够接受提示进行控制,在稀疏草图的基础上增加了多模态融合的挑战。我们提出了DiffS-NOCS(基于扩散的Sketch-to-NOCS Map),该方法利用修改后的多视图解码器结合ControlNet从草图生成嵌入有3D结构和位置信息的二维NOCS地图。通过将不同视角下的多个NOCS地图结合起来,可以重建三维点云。为了增强对草图的理解,我们整合了一个视角编码器来提取视角特征。此外,我们设计了一种以特征级多视图聚合网络作为去噪模块,促进了跨视图信息交换,并提高了在生成NOCS地图时的3D一致性。实验表明,在ShapeNet数据集上的结果证明了DiffS-NOCS能够实现与草图一致、可控且精细的点云重建。
https://arxiv.org/abs/2506.12835
Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces \textbf{DSP+}, an improved version of the Draft, Sketch, and Prove framework, featuring a \emph{fine-grained and integrated} neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80.7\%, 32.8\%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves \texttt{imo\_2019\_p1}, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.
最近的进步,如DeepSeek-Prover-V2-671B和Kimina-Prover-Preview-72B,展示了利用基于强化学习(RL)的大规模训练进行自动化定理证明的趋势。令人惊讶的是,我们发现即使没有任何训练,精心协调现有的现成推理模型与步骤证明器也可以达到相当的性能水平。本文介绍了一种改进版的Draft、Sketch和Prove框架——**DSP+**,该版本在每个阶段都采用了**细粒度且集成化**的神经符号增强:(1)在草稿阶段,我们提示推理模型生成简洁的自然语言子目标以帮助草图阶段,并移除思考标记及对人类编写证明的引用;(2)在草图阶段,子目标与假设一起自动形式化,以便为证明阶段做好准备,并根据预定义规则屏蔽包含句法错误的草图行;(3)在证明阶段,我们将符号搜索方法如Aesop紧密集成到步骤证明器中,以建立草图子目标的证明。 实验结果表明,在不进行额外模型训练或微调的情况下,DSP+解决了miniF2F、ProofNet和PutnamBench中的644个问题中的80.7%、32.8%和24个问题,并且相比现有最佳方法要求更低的成本预算。特别地,DSP+证明了**imo_2019_p1**这一miniF2F中未被任何先前工作解决的IMO(国际数学奥林匹克)问题。此外,DSP+生成了人类专家可以理解的证明模式,有助于识别形式化错误;例如,在miniF2F中发现了八个错误的形式化陈述。 我们的研究结果强调除了基于RL训练之外,经典推理模式也具有巨大潜力。所有组件将开源提供。
https://arxiv.org/abs/2506.11487
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at this https URL.
https://arxiv.org/abs/2506.09385
In the architectural design process, floorplan design is often a dynamic and iterative process. Architects progressively draw various parts of the floorplan according to their ideas and requirements, continuously adjusting and refining throughout the design process. Therefore, the ability to predict a complete floorplan from a partial one holds significant value in the design process. Such prediction can help architects quickly generate preliminary designs, improve design efficiency, and reduce the workload associated with repeated modifications. To address this need, we propose FloorplanMAE, a self-supervised learning framework for restoring incomplete floor plans into complete ones. First, we developed a floor plan reconstruction dataset, FloorplanNet, specifically trained on architectural floor plans. Secondly, we propose a floor plan reconstruction method based on Masked Autoencoders (MAE), which reconstructs missing parts by masking sections of the floor plan and training a lightweight Vision Transformer (ViT). We evaluated the reconstruction accuracy of FloorplanMAE and compared it with state-of-the-art benchmarks. Additionally, we validated the model using real sketches from the early stages of architectural design. Experimental results show that the FloorplanMAE model can generate high-quality complete floor plans from incomplete partial plans. This framework provides a scalable solution for floor plan generation, with broad application prospects.
在建筑设计过程中,平面图设计通常是一个动态且迭代的过程。建筑师根据自己的创意和需求逐步绘制平面图的不同部分,并在整个设计过程中不断调整和完善。因此,从不完整的平面图预测出完整的设计具有重要的价值,这可以帮助建筑师快速生成初步设计方案,提高设计效率并减少反复修改的工作量。为了解决这一问题,我们提出了FloorplanMAE,这是一个基于自监督学习框架的系统,用于将不完整的平面图恢复成完整的平面图。 首先,我们开发了一个专门针对建筑平面图进行训练的数据集——FloorplanNet。其次,我们提出了一种基于掩码自动编码器(Masked Autoencoders, MAE)的方法来重建缺失部分的平面图设计:通过遮蔽平面图中的某些区域,并使用轻量级的视觉变换器(Vision Transformer, ViT)进行训练。 我们在多个标准基准上评估了FloorplanMAE的重建准确性,并与现有最先进的方法进行了比较。此外,我们还利用建筑设计初期的真实草图验证了模型的有效性。实验结果表明,FloorplanMAE能够从不完整的平面设计中生成高质量的完整平面图。这一框架为平面图的设计提供了一个可扩展的解决方案,在实际应用中具有广阔的应用前景。
https://arxiv.org/abs/2506.08363
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
在扩散模型方面取得的进展显著提升了视频质量,但也引发了对精细化控制的关注。然而,许多现有的方法依赖于为特定任务微调大规模视频模型,这随着模型规模的增长变得越来越不切实际。在这项工作中,我们提出了帧引导(Frame Guidance),这是一种基于帧级信号(如关键帧、风格参考图像、草图或深度图)的无需训练即可实现可控视频生成的方法。为了提供实用的无训练指导,我们提出了一种简单的潜在处理方法,显著减少了内存使用,并应用了一种针对全局连贯视频生成设计的新颖潜在优化策略。 帧引导能够跨多种任务(包括关键帧引导、风格化和循环)进行有效控制,无需任何培训,并且与任何视频模型兼容。实验结果显示,帧引导可以为广泛的任务和输入信号产生高质量的可控视频。
https://arxiv.org/abs/2506.07177
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at this https URL to support future work on building more general, open-ended, and creative reasoning systems.
拼图寻宝(Puzzlehunts)是一种复杂的、多步骤的谜题类型,这些问题没有明确的问题定义。与常规推理基准中的任务清晰指令相反,拼图寻宝要求模型从多模态证据中发现潜在问题结构,并通过迭代推理解决问题,这种类型的挑战类似于现实生活中的科学发现、探索性数据分析或调查性问题解决等场景。尽管最近在基础模型方面取得了进展,但它们在这种开放设置上的表现尚未得到充分测试。 本文介绍了 PuzzleWorld,这是一个包含 667 个拼图寻宝风格问题的大规模基准测试集,旨在评估分步的、开放式和多模态创造性推理的能力。每个谜题都附有最终解决方案、详细的推理痕迹以及认知技能标签,这使得整体基准测试和细粒度诊断分析成为可能。目前最先进的模型仅能达到 1-2% 的最终答案准确性,其中表现最好的模型也只能解决 14% 的谜题,并且分步准确率为 40%。我们展示了推理标注的价值:在小模型上进行推理痕迹的微调可以将分步推理准确率从 4% 提高到 11%,而仅基于最终答案训练会导致性能下降接近于零。 我们的错误分析表明,当前的模型表现出短视性的推理方式,并受到语言基础推断限制的影响。此外,它们缺乏对于视觉和空间推理至关重要的草图绘制能力。我们已将 PuzzleWorld 在 [此链接](https://puzzleworld.org) 上公开发布,以支持未来研究中构建更通用、开放性更强和更具创造性的推理系统的工作。
https://arxiv.org/abs/2506.06211
We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at this http URL
我们提出了一种基于扩散模型的新型框架,用于自动着色Anime风格的脸部草图。我们的方法在保留输入草图结构准确性的前提下,能够有效地从参考图像中转移风格属性。与依赖于预定义噪声时间表的传统方法不同——这往往会损害感知一致性——我们的框架构建在连续时间扩散模型的基础上,并引入了SSIMBaD(带有SSIM引导平衡扩散的Sigma缩放)。SSIMBaD应用了一种sigma空间变换,通过结构相似性(SSIM)测量的方式,在感知退化中实现线性的对齐。这种比例调整确保了在整个时间节点上视觉难度的一致性,从而实现了更加均衡和忠实的重建。 在大规模Anime面部数据集上的实验表明,我们的方法在像素准确性和感知质量方面均超越了当前最佳模型,并且能够泛化到各种风格中。代码可在提供的链接处获取。
https://arxiv.org/abs/2506.04283
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.
我们提出了一种寻找跨模态空间时间对应的方法。给定两个来自不同视觉模态的图像,例如RGB图像和深度图,我们的模型能够识别哪些像素对在场景中代表相同的物理点。为了解决这个问题,我们将对比随机行走框架扩展到同时学习跨模态和内模态匹配的循环一致特征表示。该方法简单且没有显式的照片一致性假设,并且可以完全使用未标记的数据进行训练,而无需任何空间对齐的多模式图像对。 我们在几何对应和语义对应的任务上评估了我们的方法。对于几何匹配,我们考虑诸如RGB到深度图以及RGB到热成像(反之亦然)等具有挑战性的任务;对于语义匹配,我们测试了照片素描和跨风格图像对齐的性能。我们的方法在所有基准测试中都表现出色。
https://arxiv.org/abs/2506.03148
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
所有的前沿人工智能公司都已经发布了安全框架,其中定义了能力阈值和风险缓解措施,以确定它们如何安全地开发和部署模型。然而,尽管基于安全关键行业中的成熟实践来采用系统化的风险管理方法已被推荐,但当前的前沿AI公司并未详细描述任何结构化的方法来识别和分析潜在危害。STPA(系统理论过程分析)是一种系统性方法,用于识别复杂系统变得不安全的方式,进而导致各种危险的发生。通过绘制控制器与受控流程,并分析它们之间的相互作用及反馈循环,STPA能够理解有害结果可能发生的机制(Leveson & Thomas, 2018)。我们评估了STPA在扩大范围、增强可追溯性和加强前沿AI系统的安全保证稳健性方面的能力。 我们将STPA应用于《AI控制安全性案例草图》(Korbak等人,2025年)中描述的威胁模型和场景,并由此得出了一系列不安全控制行动。从中选择了一部分未缓解措施可能导致损失的情景进行探讨。我们发现,相较于无结构化危害分析方法,STPA能够识别可能被遗漏的因果因素,从而增强稳健性。我们认为,当应用于补充或检查现有AI治理技术(包括能力阈值、模型评估和紧急程序)时,STPA可以提高前沿AI的安全保证水平。 系统的应用支持了可扩展性,通过增加LLM(大型语言模型)在分析中所占的比例,减轻了对人类领域专家的负担。
https://arxiv.org/abs/2506.01782
Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs' challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.
自然语言解释在自然语言推理(NLI)中扮演着核心角色,揭示了前提如何逻辑地推导出假设。近期的研究表明,大型语言模型(LLMs)与定理证明器(TPs)的交互可以有助于验证和改进NLI解释的有效性。然而,定理证明器需要将自然语言翻译成机器可验证的形式化表示,这一过程引入了语义信息丢失及不忠实解释的风险,而这种问题因大型语言模型在捕捉关键逻辑结构时精度不足的问题而进一步加剧。此外,LLMs在形式验证框架内的严谨且稳健的证明构造能力仍存在局限性。 为了缓解这些与准确性及稳健性相关的问题,本文探讨了以下策略:(1)减轻自动形式化过程中的语义损失;(2)高效识别和纠正逻辑表示中的语法错误;(3)利用逻辑表达式引导LLMs生成结构化的证明草图;以及(4)增强LLMs理解TP反馈的能力,以实现迭代改进。我们在e-SNLI、QASC和WorldTree数据集上使用不同的大型语言模型进行了实证研究,结果显示提出的策略在自动形式化(+18.46%、+34.20%、+39.77%)和解释细化(+29.50%、+51.50%、+41.25%)方面显著优于当前最先进的模型。此外,我们还展示了针对混合LLM-TP架构的特定干预措施可以大幅提升效率,大幅减少成功验证所需的迭代次数。
https://arxiv.org/abs/2505.24264
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
随着时间的推移,草图研究作为一个整体已经成熟起来,并且它适应大规模商业化应用的前景即将来临。尽管针对照片的研究已相对成熟,但尚未有专门设计用于草图数据的有效推理方法的研究。在这篇论文中,我们首先展示了现有的最先进的高效轻量级模型在处理照片时表现良好,但在处理草图时却无法使用。接着,我们提出了两个专为草图定制的组件,这些组件可以在任何高效的图片网络上以即插即用的方式工作,从而使它们能够适应草图数据的应用需求。 具体而言,我们选择了细粒度基于草图的图像检索(Fine-Grained Sketch-Based Image Retrieval, FG-SBIR)作为演示案例,因为这是目前公认具有即时商业价值的草图问题。从技术上讲,我们首先提出了一种跨模态知识蒸馏网络,以将现有的高效照片模型转化为与草图兼容的形式,这使得计算量(FLOPs)和模型参数分别减少了97.96%和84.89%。然后,利用草图的抽象特性引入了一个基于强化学习的画布选择器,该选择器能够根据抽象程度动态调整,进一步将计算量减少了一半以上。 最终结果是在与完整网络相比时,FLOPs减少了99.37%,即从40.18G下降到了0.254G,同时保持了准确性(分别为33.03%和32.77%)。这使得对于稀疏的草图数据来说,这个高效的网络在计算量上甚至低于最佳的照片模型,并最终为处理这种类型的草图数据提供了一个高效的解决方案。
https://arxiv.org/abs/2505.23763
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
基于扩散的服装合成任务主要集中在时尚领域的设计阶段,而服装生产过程则较少被探索。为了弥合这一差距,我们提出了一项新的任务:平铺草图转真实服装图像(FS2RG),该任务通过整合平铺草图和文本引导生成逼真的服装图像。FS2RG面临两大挑战:1)布料特性仅由文本提示指导,提供的视觉监督不足,限制了扩散模型捕捉精细的布料细节的能力;2)平铺草图与文本指南可能提供冲突信息,要求模型在保留或修改服装属性的同时保持结构的一致性。 为应对这一任务,我们提出了HiGarment,这是一个包含两个核心组件的新框架:i) 多模态语义增强机制,在文本和视觉模式之间增强了布料表示;ii) 谐调跨注意力机制,动态平衡来自平铺草图和文本提示的信息,允许通过生成与草图对齐(图像偏置)或受文本引导(文本偏置)的输出来进行可控合成。此外,我们收集了多模态详细服装数据集,这是最大的开源服装生成数据集。 实验结果和用户研究表明HiGarment在服装生成方面具有有效性。代码和数据集将在未来发布。
https://arxiv.org/abs/2505.23186
The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the \textit{full model} KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by $\approx 30\%$ while achieving state of the art performance on downstream tasks.
后训练量化(PTQ)的主要目标是生成一个压缩后的模型,该模型的输出分布尽可能接近原始模型。为了实现这一目标,几乎所有的大规模语言模型(LLM)PTQ算法都通过独立地最小化即时激活误差来量化线性层。然而,这种局部化的优化目标忽略了后续层的影响,因此减少它并不一定意味着生成了更接近原模型的结果。 在本文中,我们介绍了一种新的量化算法——又一种量化算法(YAQA),这是一种自适应舍入算法,使用每个线性层相对于整个模型KL散度的Kronecker因子近似Hessian矩阵。YAQA由两个部分组成:可以为百亿参数的大规模语言模型有效计算出完整的逐层Hessian矩阵的Kronecker因子化草图,以及一种独立于量化器的舍入算法,该算法使用这些草图并具备理论保证。 在广泛的模型和量化器组合中,实验结果表明,与现有技术相比,YAQA能将压缩后的模型与原模型之间的KL散度降低约30%,同时在下游任务上达到最先进的性能。
https://arxiv.org/abs/2505.22988
This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.
这项工作提出了一种新颖的文本到矢量图形生成方法,名为Dream3DVG,该方法允许任意视角查看、渐进式细节优化以及视图依赖性遮挡感知。我们的方法是一个双分支优化框架,包含一个辅助的3D高斯点置射优化学派和一个3D矢量图形优化学派。引入的3DGS分支可以弥合文本提示与矢量图形之间的领域差距,并提供更一致的指导。此外,3DGS通过调度无分类器引导,允许渐进式细节控制,在初始阶段用粗略形状进行矢量图形引导,在后续阶段添加更多细节。我们还改进了视图依赖性遮挡问题,设计了一个可见性感知渲染模块。在3D草图和3D图标上的大量实验结果表明,该方法在不同抽象层次的细节、跨视角一致性以及基于视图遮挡的笔触剔除方面具有明显优势。
https://arxiv.org/abs/2505.21377
3D vector graphics play a crucial role in various applications including 3D shape retrieval, conceptual design, and virtual reality interactions due to their ability to capture essential structural information with minimal representation. While recent approaches have shown promise in generating 3D vector graphics, they often suffer from lengthy processing times and struggle to maintain view consistency. To address these limitations, we propose ViewCraft3D (VC3D), an efficient method that leverages 3D priors to generate 3D vector graphics. Specifically, our approach begins with 3D object analysis, employs a geometric extraction algorithm to fit 3D vector graphics to the underlying structure, and applies view-consistent refinement process to enhance visual quality. Our comprehensive experiments demonstrate that VC3D outperforms previous methods in both qualitative and quantitative evaluations, while significantly reducing computational overhead. The resulting 3D sketches maintain view consistency and effectively capture the essential characteristics of the original objects.
三维矢量图形在包括三维形状检索、概念设计和虚拟现实交互等各种应用中扮演着重要角色,因为它们能够用最少的表示捕捉到结构信息的本质。尽管最近的方法在这方面显示出了一定的进步,但往往面临着处理时间过长以及难以保持视图一致性的挑战。为了克服这些限制,我们提出了一种名为ViewCraft3D (VC3D) 的高效方法,该方法利用三维先验来生成三维矢量图形。 具体来说,我们的方法从三维物体分析开始,使用几何提取算法将三维矢量图形拟合到底层结构,并应用视图一致的细化过程以提升视觉质量。通过全面实验,我们证明VC3D在定性和定量评估中均优于先前的方法,并且显著减少了计算开销。生成的三维草图保持了视图一致性并有效捕捉到了原始物体的主要特征。
https://arxiv.org/abs/2505.19492
The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time-consuming, expert-driven methods with limited feedback. This research introduces a sketch-based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM-assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch-based virtual assistant that simplifies annotation for non-experts and advances LLM-driven labeling tools in terms of scalability, accessibility, and explainability.
训练数据的质量对于交通运输、医疗保健和机器人技术等领域的机器学习应用性能至关重要。然而,准确的图像标注通常依赖于耗时且需要专业知识的方法,并且反馈有限。这项研究引入了一种基于草图注释的方法,该方法得到了大型语言模型(LLM)的支持,以降低技术壁垒并提高可访问性。通过使用合成数据集,我们考察了草图识别特征与LLM反馈指标之间的关系,旨在提升LLM辅助标注的可靠性和解释性。此外,我们还探讨了提示策略和草图变化对反馈质量的影响。 我们的主要贡献是一款基于草图的虚拟助手,它简化了非专业人士的注释过程,并在可扩展性、易用性和可解释性方面推进了LLM驱动的标记工具的发展。
https://arxiv.org/abs/2505.19419
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.
强化学习微调(Reinforcement Learning Finetuning, RFT)极大地提升了大型语言模型(LLMs)的推理能力,使其能够进行长链思维、自我修正和有效工具使用。尽管最近的研究试图将RFT扩展到视觉-语言模型(VLMs),但这些努力大多局限于基于静态图像输入的文字推理,未能实现真正意义上的多模态响应推理。相比之下,在测试时采用的方法如Visual Sketchpad虽然包含可视步骤,但却缺乏训练机制。我们引入了VTool-R1框架,这是第一个让VLMs在训练过程中生成多模态思维链的框架,并通过穿插文字和中间视觉推理步骤实现这一目标。VTool-R1将基于Python的可视化编辑工具集成到RFT流程中,使模型能够学习何时以及如何生成有助于最终推理过程的可视推理步骤。我们的方法通过基于成果奖励而非过程监督进行训练,在不依赖于过程监督的情况下激发了战略性的视觉工具使用以支持推理。在针对图表和表格结构化视觉问答任务上的实验表明,VTool-R1通过教导模型“用图像思考”并生成带工具的多模态思维链来提升推理性能。
https://arxiv.org/abs/2505.19255
Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.
基于扩散变压器(DiT)架构,模型如Sora、CogVideoX和Wan在文本到视频、图像到视频以及视频编辑任务上取得了显著进展。尽管这些技术有了进步,但基于扩散的视频生成仍然计算密集型工作,尤其是对于高分辨率和长时间的视频而言更是如此。之前的工作通过跳过某些计算来加速推理过程,但这通常会以严重的质量下降为代价。在本文中,我们提出了SRDiffusion,这是一个新的框架,它利用大型模型和小型模型之间的协作来减少推理成本。大型模型处理高噪声步骤,确保语义和运动的保真度(草图绘制),而小型模型则在低噪声步骤中细化视觉细节(渲染)。实验结果表明,我们的方法优于现有方法,在VBench上实现了Wan模型超过3倍的速度提升且几乎无质量损失,并为CogVideoX提供了2倍的速度提升。我们的方法作为与现有加速策略不同的新方向被提出,为可扩展视频生成提供了一种实用的解决方案。
https://arxiv.org/abs/2505.19151