Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
不可见水印已成为验证AI生成图像内容真伪的关键机制,各大平台已开始大规模部署水印方案。然而,评估这些方案抵御复杂移除攻击的能力仍然至关重要,以衡量其可靠性并指导稳健设计。在这项工作中,我们揭示了不可见水印的一个基本漏洞,通过将水印移除重新表述为视图合成问题来实现这一点。我们的关键见解是,生成同一语义内容的感知上一致的替代视角(类似于从稍有不同的角度重新观察一个场景),可以自然地去除嵌入的水印同时保持视觉保真度。这揭示了一个重要缺口:即使对像素空间和频域攻击具有鲁棒性的水印,在面对保持语义不变的视点变换时仍然脆弱。 我们引入了一种零样本扩散框架,该框架在潜在空间中应用受控几何变换,并通过视图引导对应注意力来维护重建过程中的结构一致性。此方法无需访问检测器或了解具体水印知识,仅基于预训练模型即可操作。我们的方法在15种不同的水印方案上实现了最先进的水印抑制效果,在超越了14种基线攻击的同时,还能在多个数据集上保持优越的感知质量。
https://arxiv.org/abs/2601.08832
Detecting anomalies in high-dimensional, time-dependent simulation data is challenging due to complex spatial and temporal dynamics. We study reconstruction-based anomaly detection for ensemble data from parameterized Kármán vortex street simulations using convolutional autoencoders. We compare a 2D autoencoder operating on individual frames with a 3D autoencoder that processes short temporal stacks. The 2D model identifies localized spatial irregularities in single time steps, while the 3D model exploits spatio-temporal context to detect anomalous motion patterns and reduces redundant detections across time. We further evaluate volumetric time-dependent data and find that reconstruction errors are strongly influenced by the spatial distribution of mass, with highly concentrated regions yielding larger errors than dispersed configurations. Our results highlight the importance of temporal context for robust anomaly detection in dynamic simulations.
在高维、时间依赖的仿真数据中检测异常具有挑战性,因为这些数据包含复杂的时空动态特性。我们研究了基于重建的集合数据异常检测方法,该方法使用卷积自编码器处理参数化的卡门涡街(Kármán vortex street)模拟数据。我们将二维(2D)自动编码器与三维(3D)自动编码器进行了比较:前者在单个时间帧上操作,后者则处理短时序栈。2D模型能够在单一时间步骤中识别局部空间异常,而3D模型利用时空上下文来检测异常运动模式,并减少不同时间点间的冗余检测。 进一步地,我们评估了体素化的时变数据,发现重建误差强烈依赖于质量的空间分布:集中区域相比分散配置会产生更大的误差。我们的结果强调了在动态模拟中进行鲁棒性异常检测时,考虑时间上下文的重要性。
https://arxiv.org/abs/2601.08659
Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
大型语言模型(LLMs)在语义理解方面表现出色,但它们从混乱输入中重构内部结构的能力尚未得到充分探索。句子级别的恢复对于自动评估而言是不适定问题,因为通常存在多种有效的词序排列。我们引入了OrderProbe,这是一个使用固定四字符表达式来测试中文、日文和韩文中结构重组能力的确定性基准。这些语言中的短语具有独特的规范顺序,因此支持精确匹配评分。此外,我们还提出了一种诊断框架,该框架评估模型在恢复准确性之外的能力,包括语义保真度、逻辑有效性、一致性、鲁棒性和信息密度。 实验结果表明,在十二个广泛使用的大型语言模型上,结构重组对于前沿系统来说仍然具有挑战性:零样本恢复的准确率经常低于35%。我们还观察到,语义召回与结构规划之间存在一致的分离现象,这表明结构性稳健性不是语义能力的自动产物。
https://arxiv.org/abs/2601.08626
Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: this http URL
用户提供的身份识别视频中人物替换仍然是一个难题,主要是由于缺乏配对的视频数据。先前的工作主要依赖于基于重建的方法,这需要逐帧分割掩码和明确的结构指导(例如骨骼、深度)。然而,这种依赖性在处理遮挡、角色与物体互动、异常姿态或复杂光照等复杂场景时极大地限制了其泛化能力,并常常导致视觉伪影及时间上的不一致性。本文中,我们提出了一种名为MoCha的开创性框架,该框架通过仅需一个任意帧掩码就能规避这些局限。 为了有效适应多模态输入条件并增强面部识别效果,我们在模型中引入了感知条件的RoPE,并采用基于RL(强化学习)的后训练阶段。此外,为解决合格配对训练数据稀少的问题,我们还提出了一套全面的数据构建流程。具体来说,我们设计了三个专门化的数据集:一个是使用虚幻引擎5(UE5)构建的高度真实感渲染数据集;另一个是通过当前肖像动画技术合成的表情驱动数据集;还有一个是从现有视频-掩码对中衍生出的增强数据集。 广泛的实验表明,我们的方法在性能上显著优于现有的最先进的方法。我们将发布代码以促进进一步的研究,请参阅我们的项目页面获取更多详细信息:[此HTTP URL](请将URL替换为实际链接)。
https://arxiv.org/abs/2601.08587
Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.
视频是一种强大的沟通和讲故事的媒介,然而重新编辑现有素材仍然具有挑战性。即使是最简单的编辑操作也往往需要专业知识、时间和仔细规划,这限制了创作者对其叙事愿景和塑造方式的构思。最近在生成式人工智能方面的进展暗示了一个新的范式:如果修改视频能够像重写文本一样简单呢?为了探讨这一点,我们提出了一项技术探测和关于文本驱动视频再创作的研究。我们的方法包括两个技术贡献:(1)一种生成性重构算法,它将视频反向工程为可编辑的文本提示;(2)一个互动式探测工具“Rewrite Kit”,允许创作者操控这些提示。对算法的技术评估揭示了人类与AI感知之间的关键差距。一项针对12位创造者的探测研究发现了新颖的应用案例,例如虚拟重新拍摄、合成连续性和美学重塑。它还强调了在这个新范式中关于连贯性、控制和创意一致性的一些核心矛盾。我们的工作提供了有关文本驱动视频再创作机会和挑战的经验见解,并为未来协同创意视频工具的设计提出了建议。
https://arxiv.org/abs/2601.08565
We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
我们介绍了一种用于分析脑电图(EEG)信号的两阶段多任务学习框架,该框架集成了去噪、动态建模和表示学习。在第一阶段,训练一个去噪自编码器来抑制伪迹并稳定时间动力学,从而提供鲁棒的信号表示。在第二阶段,一个多任务架构处理这些去噪后的信号以实现三个目标:运动想象分类、使用Lyapunov指数标签区分混沌与非混沌状态,以及利用NT-Xent损失进行自我监督对比表征学习。 该框架采用卷积骨干网络结合Transformer编码器来捕捉时空结构,而动态任务则鼓励对非线性脑动力学的敏感度。这种分阶段的设计有助于缓解重构和判别目标之间的干扰,提高数据集间的稳定性,并通过明确区分噪声减少与高级特征学习支持可重复训练。 实证研究表明,我们的框架不仅增强了鲁棒性和泛化能力,还在EEG解码方面超越了强大的基线方法以及最近的最先进方法,突显了结合去噪、动态特性及自我监督学习的有效性。
https://arxiv.org/abs/2601.08549
Tax code prediction is a crucial yet underexplored task in automating invoicing and compliance management for large-scale e-commerce platforms. Each product must be accurately mapped to a node within a multi-level taxonomic hierarchy defined by national standards, where errors lead to financial inconsistencies and regulatory risks. This paper presents Taxon, a semantically aligned and expert-guided framework for hierarchical tax code prediction. Taxon integrates (i) a feature-gating mixture-of-experts architecture that adaptively routes multi-modal features across taxonomy levels, and (ii) a semantic consistency model distilled from large language models acting as domain experts to verify alignment between product titles and official tax definitions. To address noisy supervision in real business records, we design a multi-source training pipeline that combines curated tax databases, invoice validation logs, and merchant registration data to provide both structural and semantic supervision. Extensive experiments on the proprietary TaxCode dataset and public benchmarks demonstrate that Taxon achieves state-of-the-art performance, outperforming strong baselines. Further, an additional full hierarchical paths reconstruction procedure significantly improves structural consistency, yielding the highest overall F1 scores. Taxon has been deployed in production within Alibaba's tax service system, handling an average of over 500,000 tax code queries per day and reaching peak volumes above five million requests during business event with improved accuracy, interpretability, and robustness.
税收代码预测是电子商务平台自动化发票和合规管理中的一个关键但尚未充分探索的任务。每个产品必须准确地映射到国家标准定义的多级分类体系结构中的某个节点,任何错误都可能导致财务不一致性和监管风险。本文介绍了 Taxon,这是一个语义对齐且由专家指导的框架,用于分层税收代码预测。Taxon 结合了(i)一种特征门控混合专家架构,该架构能够根据需要在分类系统的不同层级之间适应性地传输多模态特征;以及(ii)一个从大型语言模型中提炼出来的语义一致性模型,该模型用作领域专家来验证产品标题与官方税收定义之间的对齐情况。 为了解决实际商业记录中的噪声监督问题,我们设计了一种多源训练管道,将整理后的税收数据库、发票验证日志以及商户注册数据相结合,以提供结构化和语义化的监督。在专有的 TaxCode 数据集和公开基准测试上进行的广泛实验表明,Taxon 达到了最先进的性能,并超过了强大的基线模型。 此外,通过增加一个完整的分层路径重建过程,结构性一致性得到了显著改善,从而产生了最高的整体 F1 分数。 Taxon 已经部署在阿里巴巴税务服务系统中,在生产环境中每天处理超过50万次税收代码查询,并且在业务高峰期时,请求量曾达到500多万次。Taxon 在准确性、可解释性和鲁棒性方面均有显著提高。
https://arxiv.org/abs/2601.08418
Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.
实时多摄像头三维重建对于三维感知、沉浸式交互和机器人技术至关重要。现有的方法在处理多视图融合、摄像机外参不确定性和大规模摄像头设置的可扩展性方面存在困难。我们提出了一种名为SPARK的自校准实时多摄像头点云重建框架,该框架同时处理点云融合和外参不确定性问题。SPARK包含以下两部分: 1. 一种基于几何信息的在线外参估计模块,利用多视图先验并强制执行跨视角和时间上的一致性以实现稳定的自我校准。 2. 一种信心驱动的点云融合策略,在像素级和点级别建模深度可靠性和可见度,以此抑制噪声和视角依赖性不一致。 通过进行帧级别的融合而不积累,SPARK能够在动态场景中生成稳定且准确的点云,并且随着摄像头数量线性扩展。在真实世界的多摄像头系统上的大量实验表明,与现有方法相比,SPARK在外参精度、几何一致性、时间稳定性以及实时性能方面表现出色,证明了其对于大规模多摄像头三维重建的有效性和可扩展性。
https://arxiv.org/abs/2601.08414
The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.
曼巴(Mamba)架构由于其出色的适应性和强大的性能,已被广泛应用于各种低级视觉任务中。尽管曼巴架构已经被用于光谱重建,但仍面临以下两个挑战:(1) 单一的空间感知限制了对高光谱图像进行全面理解和分析的能力;(2) 单一尺度的特征提取难以捕捉高光谱图像中存在的复杂结构和细微细节。为了解决这些问题,我们提出了一种多尺度、多感知曼巴架构用于光谱重建任务,称为M3SR(Multi-scale Multi-perception Mamba for Spectral Reconstruction)。具体而言,我们设计了一个多感知融合块来增强模型对输入特征进行全面理解和分析的能力。通过将多感知融合块整合到U-Net结构中,M3SR可以有效地提取和融合全局、中间和局部特征,从而实现高光谱图像的多尺度精确重建。大量的定量和定性实验表明,所提出的M3SR在计算成本较低的情况下优于现有的最先进方法。
https://arxiv.org/abs/2601.08293
General intelligence must reorganize experience into internal structures that enable prediction and action under finite resources. Existing systems implicitly presuppose fixed primitive units -- tokens, subwords, pixels, or predefined sensor channels -- thereby bypassing the question of how representational units themselves emerge and stabilize. This paper proposes SANC(E3), an axiomatic framework in which representational units are not given a priori but instead arise as stable outcomes of competitive selection, reconstruction, and compression under finite activation capacity, governed by the explicit minimization of an energy functional E3. SANC(E3) draws a principled distinction between system tokens -- structural anchors such as {here, now, I} and sensory sources -- and tokens that emerge through self-organization during co-occurring events. Five core axioms formalize finite capacity, association from co-occurrence, similarity-based competition, confidence-based stabilization, and the reconstruction-compression-update trade-off. A key feature is a pseudo-memory-mapped I/O mechanism, through which internally replayed Gestalts are processed via the same axiomatic pathway as external sensory input. As a result, perception, imagination, prediction, planning, and action are unified within a single representational and energetic process. From the axioms, twelve propositions are derived, showing that category formation, hierarchical organization, unsupervised learning, and high-level cognitive activities can all be understood as instances of Gestalt completion under E3 minimization.
通用智能必须在有限资源的条件下,将经验重组为内部结构,从而实现预测和行动。现有的系统默认假设存在固定的原始单元——如词元、子词、像素或预定义的传感器通道——从而绕过了这些表示单位本身如何产生及稳定的这一问题。本文提出了SANC(E3),这是一种公理框架,在该框架中,表示单位不是预先给定的,而是作为有限激活容量下竞争选择、重构和压缩产生的稳定结果出现,并且由E3能量函数显式最小化来指导。SANC(E3)在系统词元(如{这里、现在、我}这样的结构锚点)与通过共现事件自我组织过程中产生的词元之间建立了一种原则性的区别。五个核心公理形式化了有限容量,基于共现的关联性,基于相似度的竞争机制,基于自信的稳定机制以及重构-压缩-更新之间的权衡。一个关键特征是伪内存映射输入/输出机制,通过此机制内部回放的格式塔(Gestalt)可通过与外部感觉输入相同的公理路径进行处理。因此,在单一表示和能量过程中统一了感知、想象、预测、规划以及行动。 从这些公理中导出了十二条命题,表明类别形成、层级组织、无监督学习以及高层次的认知活动都可以理解为在E3最小化下格式塔完成的实例。
https://arxiv.org/abs/2601.08224
Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.'s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.
角点检测在图像匹配和三维重建等众多计算机视觉任务中被广泛应用。我们的研究表明,张等人使用简单的角点模型来获取一系列角点特征时存在理论上的缺陷,因为两个相邻角点的灰度信息会相互影响。为了应对上述问题,在这项工作中采用了二阶高斯方向导数(SOGDD)滤波器对两种典型的高分辨率角度模型(即END型和L型模型)进行平滑处理。然后分别推导了这两种角点模型的SOGDD表示,发现了许多关于高分辨率角点的特点,从而说明了如何选择高斯滤波尺度以从图像中获取强度变化信息,并准确描绘相邻的角点。此外,首次提出了一种新的用于图像的高分辨率角点检测方法,能够精确地检测到相邻的角点。实验结果验证了所提出的方法在定位误差、对图像模糊变换的鲁棒性、图像匹配和三维重建等方面均优于现有最先进方法。
https://arxiv.org/abs/2601.08182
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at this https URL
一个3D虚拟人物通常具有六种基本面部表情之一。为了模拟真实的情感变化,我们应当能够从一种任意的表情过渡到另一种。这项研究提出了一套由指令驱动的面部表情生成框架,该框架可以从一张脸部图片开始,并将指定的一种面部表情转换为另一种。通过这种方法可以渲染出一个3D人脸及其在不同情感状态下的动态变化。 为了实现这一目标,我们引入了“指令驱动面部表情分解器”(Instruction-driven Facial Expression Decomposer, IFED)模块。这个模块旨在促进多模态数据学习,并捕捉文本描述与面部表情特征之间的关联性。随后,我们提出了“从指令到面部表情过渡”的方法(I2FET),这种方法利用IFED以及顶点重建损失函数来优化潜在向量的语义理解能力,从而根据给定的指令生成一系列面部表情变化。 最后,我们提出了一种面部表情转换模型,用于在不同的面部表情之间产生平滑的过渡效果。广泛的评估表明,在CK+和CelebV-HQ数据集上,我们的方法超过了现有的最佳技术方案。结果证明了框架可以根据文本指令生成面部表情轨迹的能力。 考虑到文字提示允许我们对人类情感状态进行多样化的描述,这大大扩展了面部表情及其之间转换的可能性范围。我们期望这套框架能够应用于各种实际场景中。有关我们项目的更多信息,请参阅此链接 [插入具体URL]。
https://arxiv.org/abs/2601.08179
We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation, camera pose reconstruction, and 3D mapping tasks demonstrate its state-of-the-art performance, while effectively supporting continuous scene understanding across extended sequences and multiple visits.
我们介绍了CogniMap3D,这是一个生物启发式的框架,用于动态三维场景理解和重建,模仿了人类的认知过程。我们的方法维护一个静态场景的持久记忆库,从而实现高效的空间知识存储和快速检索。CogniMap3D集成了三大核心功能:一个多阶段运动线索框架,用于识别动态对象;一个认知映射系统,能够储存、回忆并更新多次访问时的静态场景;以及一种因子图优化策略,用于精炼相机姿态。 给定图像流,我们的模型通过深度和相机姿态先验信息来利用运动线索来识别动态区域,并将静态元素与记忆库进行匹配。在重新访问熟悉地点时,CogniMap3D检索存储的场景,定位相机,并用新观察结果更新其内存。在视频深度估计、相机姿态重建以及三维地图绘制任务上的评估表明,该框架具备业界领先的性能,并且能够有效地支持跨越长时间序列和多次访问的持续场景理解。
https://arxiv.org/abs/2601.08175
Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., "an image of an [object class]"), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at this https URL.
零样本图像异常检测(ZSAD)的目标是在没有目标数据正常训练样本的情况下,检测并定位异常。尽管最近的ZSAD方法利用了诸如语言等额外模态来生成用于定位的细粒度提示,但仅基于视觉的方法仍局限于图像级分类,缺乏空间精度。在这项工作中,我们引入了一个简单而有效的无训练、仅依赖于视觉的ZSAD框架,该框架通过利用预训练的去噪扩散隐式模型(DDIM)的反演过程来规避对细粒度提示的需求。具体而言,给定一个输入图像和一个通用文本描述(例如,“一张[物体类别]的照片”),我们将图像反转以获得潜在表示,并从固定的中间时间步开始启动去噪过程以重构图像。由于底层扩散模型仅在正常数据上进行训练,因此该过程会产生看起来正常的重构结果。输入图像与重建图像之间的差异突显了可能存在的异常。 我们的方法在VISA数据集上达到了最先进的性能,在没有辅助模态的情况下展示了强大的定位能力,并促进了零样本异常检测研究摆脱提示依赖的趋势。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.08022
Data-driven flow-field reconstruction typically relies on autoencoder architectures that compress high-dimensional states into low-dimensional latent representations. However, classical approaches such as variational autoencoders (VAEs) often struggle to preserve the higher-order statistical structure of fluid flows when subjected to strong compression. We propose DiffCoder, a coupled framework that integrates a probabilistic diffusion model with a conventional convolutional ResNet encoder and trains both components end-to-end. The encoder compresses the flow field into a latent representation, while the diffusion model learns a generative prior over reconstructions conditioned on the compressed state. This design allows DiffCoder to recover distributional and spectral properties that are not strictly required for minimizing pointwise reconstruction loss but are critical for faithfully representing statistical properties of the flow field. We evaluate DiffCoder and VAE baselines across multiple model sizes and compression ratios on a challenging dataset of Kolmogorov flow fields. Under aggressive compression, DiffCoder significantly improves the spectral accuracy while VAEs exhibit substantial degradation. Although both methods show comparable relative L2 reconstruction error, DiffCoder better preserves the underlying distributional structure of the flow. At moderate compression levels, sufficiently large VAEs remain competitive, suggesting that diffusion-based priors provide the greatest benefit when information bottlenecks are severe. These results demonstrate that the generative decoding by diffusion offers a promising path toward compact, statistically consistent representations of complex flow fields.
数据驱动的流场重建通常依赖于自编码器架构,将高维状态压缩成低维潜在表示。然而,传统方法如变分自编码器(VAEs)在面对强烈压缩时,往往难以保留流体流动的高级统计结构。我们提出了DiffCoder,这是一个结合了概率扩散模型与常规卷积ResNet编码器的框架,并对其进行端到端训练。该编码器将流场压缩为潜在表示,而扩散模型则学习一个基于被压缩状态的生成先验。这种设计使DiffCoder能够在不严格要求最小化逐点重建损失的情况下恢复分布和频谱特性,这些特性对于准确表示流场的统计属性至关重要。 我们在具有挑战性的Kolmogorov流场数据集上评估了多种规模的DiffCoder和VAE基线方法,并针对不同的压缩比进行了测试。在激进的压缩下,DiffCoder显著提高了频谱精度,而VAEs的表现则明显下降。尽管两种方法在相对L2重建误差方面表现出相似性,但DiffCoder更好地保留了流场的基本分布结构。在中等程度的压缩水平下,足够大的VAE仍然具有竞争力,这表明当信息瓶颈严重时,基于扩散的先验提供了最大的优势。 这些结果证明,通过扩散进行生成解码为复杂流场提供了一种紧凑且统计一致表示的有效路径。
https://arxiv.org/abs/2601.07946
We present StdGEN++, a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs. Existing 3D generative methods often produce monolithic meshes that lack the structural flexibility required by industrial pipelines in gaming and animation. Addressing this gap, StdGEN++ is built upon a Dual-branch Semantic-aware Large Reconstruction Model (Dual-Branch S-LRM), which jointly reconstructs geometry, color, and per-component semantics in a feed-forward manner. To achieve production-level fidelity, we introduce a novel semantic surface extraction formalism compatible with hybrid implicit fields. This mechanism is accelerated by a coarse-to-fine proposal scheme, which significantly reduces memory footprint and enables high-resolution mesh generation. Furthermore, we propose a video-diffusion-based texture decomposition module that disentangles appearance into editable layers (e.g., separated iris and skin), resolving semantic confusion in facial regions. Experiments demonstrate that StdGEN++ achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement. Crucially, the resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking, making it a robust solution for automated character asset production.
我们介绍了StdGEN++,这是一个新颖且全面的系统,用于从各种输入生成高保真、语义分解的3D角色。现有的3D生成方法通常会产生缺乏工业管道所需结构灵活性的整体网格。为了弥补这一不足,StdGEN++建立在双分支语义感知大型重建模型(Dual-Branch S-LRM)的基础上,在前馈方式下同时重构几何形状、颜色和每个组件的语义信息。为了实现生产级别的保真度,我们引入了一种与混合隐式场兼容的新颖语义表面提取方法。该机制通过从粗到细的提案方案进行加速,显著减少了内存占用,并实现了高分辨率网格生成。此外,我们提出了一种基于视频扩散的纹理分解模块,将外观分离为可编辑层(例如独立的眼球和皮肤),解决了面部区域中的语义混淆问题。 实验表明,StdGEN++达到了最先进的性能,在几何精度和语义解缠方面显著优于现有方法。最重要的是,由此产生的结构独立性解锁了包括非破坏性编辑、符合物理的动画以及凝视追踪在内的高级下游能力,使其成为自动化角色资产生成的一个稳健解决方案。
https://arxiv.org/abs/2601.07660
Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity.
自动驾驶系统高度依赖多视角图像以确保准确的感知和稳健的决策制定。为了有效地开发和评估感知栈与规划算法,现实主义闭环模拟器不可或缺。虽然像高斯点阵(Gaussian Splatting)这样的三维重建技术为构建模拟器提供了有前景的方法,但渲染的新视角往往会出现伪影,尤其是在外推视图或可用观察数据稀疏的情况下。我们推出了ViewMorpher3D,这是一个基于图像扩散模型的多视角图像增强框架,旨在提升驾驶场景中的照片真实感和多视角一致性。 与单视角方法不同,ViewMorpher3D同时处理一组渲染视角,这些视角以相机姿态、三维几何先验知识以及时间上相邻或空间上重叠的参考视图为条件。这使得模型能够推断缺失细节,抑制渲染伪影,并确保跨视角的一致性。我们的框架支持可变数量的摄像头和灵活的参考/目标视角配置,使其适用于各种传感器设置。 在真实驾驶数据集上的实验表明,在图像质量指标上取得了显著提升,有效减少了伪影同时保持了几何保真度。
https://arxiv.org/abs/2601.07540
Fully convolutional networks have become the backbone of modern medical imaging due to their ability to learn multi-scale representations and perform end-to-end inference. Yet their potential for slice-to-volume reconstruction (SVR), the task of jointly estimating 3D anatomy and slice poses from misaligned 2D acquisitions, remains underexplored. We introduce a fast convolutional framework that fuses multiple orthogonal 2D slice stacks to recover coherent 3D structure and refines slice alignment through lightweight model-based optimization. Applied to fetal brain MRI, our approach reconstructs high-quality 3D volumes in under 10s, with 1s slice registration and accuracy on par with state-of-the-art iterative SVR pipelines, offering more than speedup. The framework uses non-rigid displacement fields to represent transformations, generalizing to other SVR problems like fetal body and placental MRI. Additionally, the fast inference time paves the way for real-time, scanner-side volumetric feedback during MRI acquisition.
全卷积网络(Fully Convolutional Networks,FCNs)由于其能够学习多尺度表示并执行端到端推理的能力,已经成为现代医学影像领域的支柱技术。然而,它们在切片到体积重建(Slice-to-Volume Reconstruction, SVR)方面的潜力——即从对齐不准确的2D图像中同时估计3D解剖结构和切片位置——仍然未被充分探索。我们介绍了一个快速卷积框架,该框架融合了多个正交2D切片堆栈以恢复连贯的3D结构,并通过轻量级模型优化来改进切片对齐。当应用于胎儿大脑MRI时,我们的方法能够在不到10秒的时间内重建高质量的3D体积,在1秒钟内完成切片配准,其精度与最先进的迭代SVR管道相当,提供了比传统方法更快的速度。 该框架使用非刚性位移场来表示变换,可以推广到其他SVR问题,例如胎儿身体和胎盘MRI。此外,快速推理时间还为在MRI采集过程中提供实时、扫描仪端的体积反馈铺平了道路。
https://arxiv.org/abs/2601.07519
Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at < 0.2 Mbps over WebRTC's data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of > 28 dB for novel poses, an end-to-end latency of ~ 80 ms, and > 1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at this https URL.
沉浸式远程呈现技术旨在通过增强现实(AR)/虚拟现实(VR)应用中的全息人体模型来革新人类互动,从而实现更高效的远程协作。然而,现有的系统依赖于硬件密集型多摄像头设置,并且在体积流传输方面需要高带宽,这限制了它们在移动设备上的实时性能表现。 为了克服这些挑战,我们提出了Mon3tr——一种新颖的单目3D远程呈现框架,它首次将基于3D高斯点拟合(3DGS)的人体参数化建模技术融入到了远程呈现中。Mon3tr采用了一种分摊计算策略,即将过程分为一次性离线多视角重建阶段来建立用户特定的化身和单目实时推断阶段,在此过程中进行现场远程呈现会话。 在实际使用时,仅需要一台单目RGB摄像头即可捕捉身体动作和面部表情,并驱动基于3DGS的人体参数化模型,从而显著降低系统的复杂性和成本。提取的动作和外观特征通过WebRTC的数据通道以小于0.2 Mbps的速度传输,允许系统适应网络波动的干扰。 在接收端,例如Meta Quest 3设备上,我们开发了一种轻量级的3DGS属性变形网络,用于动态生成预建化身上的校正3DGS属性调整,实现实时光线逼真的动作和外观渲染,帧率约为每秒60帧。 广泛的实验表明,我们的方法达到了业界领先的表现:对于新的姿态姿势,峰值信噪比(PSNR)超过28 dB;端到端延迟仅为约80毫秒,并且与点云流相比带宽减少了1000多倍,支持从单目输入实现实时操作的多样化场景。 有关我们的演示,请参见此链接:[提供具体网址]。
https://arxiv.org/abs/2601.07518
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
虽然混合专家(MoE)通过条件计算扩展了模型容量,但Transformer架构缺乏本机的知识查找功能,被迫通过计算来模拟低效的检索过程。为解决这一问题,我们引入了基于条件记忆作为补充稀疏轴的概念,并通过名为Engram的模块将其具体化,该模块采用现代化的经典$N$-gram嵌入技术实现O(1)查找。 通过对稀疏性分配问题进行建模,我们发现了一种U形扩展法则,这优化了神经计算(MoE)和静态内存(Engram)之间的权衡。根据这一法则指导,我们将Engram扩展到270亿参数规模,并在与同规模、同等FLOPs的基线模型对比中实现了更优性能。 值得注意的是,在预期有助于知识检索的情况下(例如MMLU +3.4;CMMLU +4.0),我们在一般推理(如BBH +5.0;ARC-Challenge +3.7)和代码/数学领域也观察到了更大的收益(如HumanEval +3.0;MATH +2.4)。机制性分析表明,Engram减轻了骨干网络早期层对静态重建的负担,有效地加深了模型对于复杂推理任务的能力。此外,通过将局部依赖关系委托给查找操作处理,它释放出注意力容量以关注全局背景信息,大幅增强了长上下文检索能力(如Multi-Query NIAH:从84.2提升到97.0)。 最后,Engram还建立了基础设施感知的效率模型:其确定性的寻址机制允许在主机内存中进行预取操作,几乎不产生额外开销。我们展望条件记忆将成为下一代稀疏模型不可或缺的基本构建块。
https://arxiv.org/abs/2601.07372