We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.
https://arxiv.org/abs/2604.14302
Text-to-image (T2I) systems enable rapid generation of high-fidelity imagery but are misaligned with how visual ideas develop. T2I systems generate outputs that make implicit visual decisions on behalf of the user, often introduce fine-grained details that can anchor users prematurely and limit their ability to keep options open early on, and cause unintended changes during editing that are difficult to correct and reduce users' sense of control. To address these concerns, we present Creo, a multi-stage T2I system that scaffolds image generation by progressing from rough sketches to high-resolution outputs, exposing intermediary abstractions where users can make incremental changes. Sketch-like abstractions invite user editing and allow users to keep design options open when ideas are still forming due to their provisional nature. Each stage in Creo can be modified with manual changes and AI-assisted operations, enabling fine-grained, step-wise control through a locking mechanism that preserves prior decisions so subsequent edits affect only specified regions or attributes. Users remain in the loop, making and verifying decisions across stages, while the system applies diffs instead of regenerating full images, reducing drift as fidelity increases. A comparative study with a one-shot baseline shows that participants felt stronger ownership over Creo outputs, as they were able to trace their decisions in building up the image. Furthermore, embedding-based analysis indicates that Creo outputs are less homogeneous than one-shot results. These findings suggest that multi-stage generation, combined with intermediate control and decision locking, is a key design principle for improving controllability, user agency, creativity, and output diversity in generative systems.
https://arxiv.org/abs/2604.13956
The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative "sketch-reconstruct-sketch" workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to "draw in 3D" without the rigid constraints of traditional CAD.
https://arxiv.org/abs/2604.13549
LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.
https://arxiv.org/abs/2604.12948
Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain's intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.
https://arxiv.org/abs/2604.12865
Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at this https URL.
https://arxiv.org/abs/2604.12282
Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.
https://arxiv.org/abs/2604.08042
The large language models (LLMs) might produce a persuasive argument within mathematical and logical fields, although such argument often includes some minor missteps, including the entire omission of side conditions, invalid inference patterns, or appeals to a lemma that cannot be derived logically out of the context being discussed. These omissions are infamously hard to notice solely out of the text, as even the misconstrued construction still may seem mostly accurate. Conversely, interactive theorem provers like Lean and Coq have rigorous reliability by ensuring that syntactic and semantic statements only accept statements that can pass all the syntactic and semantic steps in the program which is a small trusted kernel of the language type-checks with. Despite the fact that this technique provides strong guarantees, it comes at quite a heavy price: the evidence must be completely formalized, and the evidence user or a auxiliary search program must provide an avalanche of low-level information. This paper presents a hybrid pipeline where an LLM generates a typed proof sketch in a compact DSL and a lightweight trusted kernel expands the sketch into explicit proof obligations.
https://arxiv.org/abs/2604.06401
Recent progress in artificial intelligence (AI) is unlocking transformative capabilities for mathematics. There is great hope that AI will help solve major open problems and autonomously discover new mathematical concepts. In this essay, we further consider how AI may open a grand perspective on mathematics by forging a new route, complementary to mathematical\textbf{ logic,} to understanding the global structure of formal \textbf{proof}\textbf{s}. We begin by providing a sketch of the formal structure of mathematics in terms of universal proof and structural hypergraphs and discuss questions this raises about the foundational structure of mathematics. We then outline the main ingredients and provide a set of criteria to be satisfied for AI models capable of automated mathematical discovery. As we send AI agents to traverse Platonic mathematical worlds, we expect they will teach us about the nature of mathematics: both as a whole, and the small ribbons conducive to human understanding. Perhaps they will shed light on the old question: "Is mathematics discovered or invented?" Can we grok the terrain of these \textbf{Platonic worlds}?
https://arxiv.org/abs/2604.06107
We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system's ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.
我们推出AnyUser,这是一个统一的机器人指令系统,可通过在相机图像上自由绘制草图(可选结合语言)来实现直观的家庭任务指导。AnyUser将多模态输入(草图、视觉、语言)解析为空间-语义基元,生成无需预存地图或模型的可执行机器人动作。其创新组件包括用于理解的多模态融合和用于鲁棒动作生成的分层策略。系统效能通过广泛评估得以验证:(1) 在大规模数据集上的定量基准测试显示,系统在各类模拟家庭场景中解读多样化草图指令的准确率很高;(2) 在两个不同的真实机器人平台——静态安装的7自由度辅助机械臂(KUKA LBR iiwa)与双臂移动操作机器人(Realman RMC-AIDAL)——上完成靶向擦拭与区域清洁等代表性任务,证实系统能在物理环境中可靠地锚定指令并执行;(3) 涵盖老年人、模拟非语言使用者、低技术素养人群等多元群体的综合用户研究表明,系统显著提升了可用性与任务指定效率,任务完成率达85.7%-96.4%,用户满意度高。AnyUser弥合了先进机器人能力与非专业用户易用性需求之间的鸿沟,为适应真实人类环境的实用辅助机器人奠定了基础。
https://arxiv.org/abs/2604.04811
Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
https://arxiv.org/abs/2604.04746
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
本研究探讨了最先进的视觉语言模型(VLMs)在基础几何变换下的根本脆弱性。尽管现代VLMs在语义任务(如识别标准朝向的物体、描述复杂场景)上表现出色,但它们在更基础的层面存在系统性缺陷:缺乏可靠判断旋转、缩放及恒等变换下物体身份所需的稳健空间不变性与等变性。我们通过跨多样化视觉领域(包括符号草图、自然照片与抽象艺术)的系统评估证明了这一局限性。随着语义内容稀疏化,模型性能急剧下降,且该现象在不同架构、模型规模及提示策略中普遍存在。总体而言,我们的研究揭示了当前VLMs在语义理解与空间推理之间存在系统性差距,凸显了未来多模态系统需要更强几何基础的迫切性。
https://arxiv.org/abs/2604.01848
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
数学推理是人类智能的标志,而大语言模型(LLMs)能否进行有意义的数学推理,仍是人工智能与认知科学领域的核心问题。随着LLMs日益融入科研工作流,对其数学能力进行严格评估已成为实际需求。现有基准测试受限于合成环境与数据污染问题。我们提出LiveMathematicianBench——一个基于模型训练截止日期后最新发表的arXiv论文构建的动态多项选择基准,用于评估研究级数学推理。通过以最新发表的定理为评估基础,它提供了超越记忆模式的真实测试环境。该基准引入了包含十三个类别的定理类型逻辑分类体系(如蕴含、等价、存在性、唯一性),实现对不同推理形式的细粒度评估。它采用基于证明概要的干扰项生成流程,利用高层证明策略构建看似合理但无效的选项,以反映误导性的证明方向,从而提升对真实理解的敏感性,而非表面匹配。我们还引入抗替换机制,以区分答案识别与实质性推理。评估表明该基准远未饱和:最佳模型Gemini-3.1-pro-preview仅达到43.5%。在抗替换评估下,准确率显著下降:GPT-5.4以30.6%居首,而Gemini-3.1-pro-preview降至17.6%,低于20%的随机基线。双模式协议显示,提供证明概要可带来稳定的准确率提升,表明模型能利用高层证明策略进行推理。总体而言,LiveMathematicianBench为研究LLMs的研究级数学推理提供了可扩展、抗污染的测试平台。
https://arxiv.org/abs/2604.01754
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: this https URL
近期多模态人脸生成模型通过引入分割掩码、草图或边缘图等空间先验来增强文本条件,从而解决了文本到图像扩散模型的空间控制限制。这种多模态融合实现了与高层语义意图和底层结构布局均对齐的可控合成。然而,大多数现有方法通常通过附加辅助控制模块或拼接独立的单模态网络来扩展预训练的文本到图像流程。这些临时方案继承了架构约束、参数冗余,且在模态冲突或潜在空间不匹配时容易失效,限制了其在语义与空间领域进行协同融合的能力。我们提出了MMFace-DiT,一种专为协同多模态人脸合成设计的统一双流扩散Transformer。其核心创新在于采用双流Transformer块,并行处理空间(掩码/草图)和语义(文本)标记,并通过共享的旋转位置嵌入注意力机制进行深度融合。该设计可防止模态主导,并确保对文本和结构先验的强遵循,从而实现前所未有的空间-语义一致性可控人脸生成。此外,新颖的模态嵌入器使单一整合模型能够在不重新训练的情况下动态适应不同的空间条件。MMFace-DiT在视觉保真度和提示对齐方面较六种最先进的多模态人脸生成模型平均提升40%,为端到端可控生成建模建立了灵活的新范式。代码与数据集已公开于项目页面:https://this.url
https://arxiv.org/abs/2603.29029
A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
草图是一种视觉抽象的浓缩形式,它通过简化而富有意图的笔触传达核心概念,同时省略无关细节。尽管草图具有表现力,但量化其语义抽象的效率仍具有挑战性。现有依赖参考图像、底层视觉特征或识别准确率的评估方法无法捕捉草图的核心属性——抽象性。为克服这些局限,我们提出了SEA(草图抽象效率评估指标),这是一种无参考指标,用于评估草图在保持语义可识别性的同时,以多经济的方式表现类别定义视觉元素的能力。这些元素基于每类草图中通常描绘的特征常识知识进行提取。SEA利用视觉问答模型判定各元素的存在性,并返回一个反映视觉经济条件下语义保留程度的量化分数。为支持该指标,我们发布了CommonSketch——首个具有语义标注的草图数据集,包含300个类别的23,100幅人手绘草图,每幅草图均配有文字描述和元素级标注。实验表明,SEA与人类判断高度一致,能可靠地区分不同抽象效率水平;而CommonSketch作为基准数据集,为各类视觉-语言模型的元素级草图理解提供了系统性评估。
https://arxiv.org/abs/2603.28363
Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.
移动设备持续与蜂窝基站交互,产生海量信令记录,为理解人类移动模式提供了广泛覆盖。然而,此类记录仅提供粗略的位置线索(例如服务小区标识),因此限制了其在需要高精度GPS轨迹的应用中的直接使用。本研究聚焦Sig2GPS问题:从蜂窝信令中重建GPS轨迹。受领域专家常将信令轨迹叠加至地图并手绘对应GPS路线的启发,Sig2GPS不同于依赖复杂多阶段工程流水线或坐标回归的传统方案,而是被重构为直接在地图视觉域中操作的图像到视频生成任务:将信令轨迹渲染在地图上,并训练视频生成模型绘制连续的GPS路径。为支持这一范式,研究构建了成对信令-轨迹视频数据集以微调开源视频模型,并引入基于轨迹感知强化学习的优化方法,通过奖励机制提升生成保真度。在大型真实数据集上的实验表明,该方法相较于强大的工程方案和基于学习的基线均有显著提升,而在下一点GPS预测的额外结果则展示了其可扩展性与跨城市迁移性。总体而言,这些结果表明,地图视觉视频生成通过在地图约束下实现连续路径的直接生成与精炼,为轨迹数据挖掘提供了实用接口。
https://arxiv.org/abs/2603.26610
We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.
https://arxiv.org/abs/2603.25357
Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.
自监督学习已成为一种无需人工标注即可学习视觉表征的强大范式,但大多数方法仍局限于单一模态,因而错过了异构传感器提供的互补结构。我们提出了Le MuMo JEPA,这是一个从RGB图像和对齐的伴生模态中学习统一表征的自监督框架。在驾驶实验中,第二模态采用与相机对齐的LiDAR深度数据;我们还在Teledyne FLIR ADAS基准上评估了RGB-热成像训练与迁移效果。我们的方法通过将融合token作为共享Transformer内部模态特定patch stem之间的潜在瓶颈,将LeJEPA扩展至多模态场景。默认模型采用剪枝融合策略:在初始跨模态注意力层之后,丢弃模态特定token,迫使跨模态信息在应用草图等距高斯正则化(SIGReg)至联合多模态CLS嵌入之前,以高效潜在瓶颈的形式进入共享融合token网格。在Waymo数据集上,Le MuMo JEPA在从头开始训练的多模态基线模型中,为下游patch探测任务提供了最佳性能-效率权衡,在CenterNet检测和密集深度任务上有所提升,同时在分割任务上仍具竞争力。在nuScenes数据集上的从头训练实验中,Le MuMo JEPA依然是最强模型,并在FLIR数据集上取得最佳结果,尤其是在经Waymo初始化微调后表现更佳。本研究还显示,在显著降低计算量、内存占用和预估训练时长的条件下,Le MuMo JEPA保持了最佳的总体精度-效率平衡。
https://arxiv.org/abs/2603.24327
Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at this https URL.
扩散概率模型在生成高质量、逼真的医学图像方面已展现出巨大潜力,为医学领域长期存在的数据稀缺问题提供了前景可期的解决方案。然而,在多模态条件下生成具有解剖结构一致性的三维医学体积仍是一个复杂且未解决的难题。本文提出Sketch2CT,一个面向结构感知的三维医学体积生成的多模态扩散框架,该框架由用户提供的二维草图及描述三维几何语义的文本描述共同引导。该框架首先基于两种模态条件,从随机噪声中生成目标器官的三维分割掩模。为有效对齐与融合这些输入,我们提出两个关键模块:利用局部文本线索细化草图特征,并整合全局草图-文本表征。这些模块基于胶囊-注意力主干网络构建,通过发挥草图与文本的互补优势,生成解剖学精确的器官形态。合成的分割掩模随后指导潜在扩散模型进行三维CT体积合成,实现与用户定义草图及描述一致的器官外观逼真重建。在公开CT数据集上的大量实验表明,Sketch2CT在多模态医学体积生成方面达到卓越性能。其可控、低成本的生成流程为医学数据集的原则性高效增强提供了可能。代码已公开于此https链接。
https://arxiv.org/abs/2603.22509
Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.
https://arxiv.org/abs/2603.21978