The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.
在视觉标记化器(如VAE)中,潜在空间的质量对于现代生成模型至关重要。然而,标准的基于重构的训练范式会产生偏向于低级信息的潜在空间,导致了一个基础缺陷:更好的像素级别准确性并不会带来更高质量的生成结果。这意味着将大量的计算资源投入到视觉标记化器预训练中并不能有效提升其在生成任务中的表现。我们称这一现象为“预训练扩展问题”,并提出需要一个必要的转变:为了使潜在空间对生成任务有效,它必须能够简洁地表示高级语义信息。 我们介绍了VTP(Visual Token Pre-training),这是一个统一的视觉标记化器预训练框架,开创了图像-文本对比学习、自我监督和重构损失联合优化的新方法。我们的大规模研究表明,两个主要发现是:(1) 对于生成任务来说,理解是一个关键驱动因素;(2) 大幅改善了扩展性能,在计算资源、参数数量以及用于标记化器预训练的数据量上,生成性能得到了有效提升。 在大规模预训练之后,我们的标记化器表现出具有竞争力的性能(ImageNet上的零样本精度78.2%和0.36 rFID),并且在生成任务中比先进的蒸馏方法收敛速度快4.1倍。更重要的是,它可以有效地扩展:不修改标准DiT训练规格的情况下,在预训练VTP上仅增加更多的计算量就能使下游生成性能的FID(Frechet Inception Distance)提高65.8%,而传统的自动编码器在非常早期时就停滞不前,只有1/10的计算量。我们的预训练模型可以在以下链接找到:[this https URL]。 通过这种新的方法和发现,VTP为视觉标记化器的预训练提供了一种更有效的方法,并且展示了大规模预训练如何能够显著改善生成任务的表现。
https://arxiv.org/abs/2512.13687
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{this https URL}{\texttt{this https URL}}$
最近的前向反馈重建模型,如VGGT和$\pi^3$,在图像重建质量上表现出色,但由于其二次内存复杂度,无法处理流媒体视频,这限制了它们的实际部署。尽管现有的流媒体方法通过学习记忆机制或因果注意力解决了这一问题,但这些方法需要大量的重新训练,并且可能未能充分利用最先进的离线模型中的强几何先验。 我们提出了LASER框架,这是一个无需训练的框架,它可以将一个离线重建模型转化为一个流式系统,通过在连续的时间窗口中对预测进行对齐来实现这一点。我们观察到,简单的相似性变换($\mathrm{Sim}(3)$)对齐由于层深度错位而失效:单目尺度模糊导致不同场景层的相对深度比例在不同的窗口之间不一致变化。 为了解决这个问题,我们引入了逐层尺度对齐方法,该方法将深度预测分割成离散的层次,并计算每个层次的比例因子,然后将其传播到相邻的时间窗口和时间戳上。大量的实验表明,LASER在相机姿态估计和点云重建方面的表现达到了最先进的水平,同时还能以每秒14帧的速度运行,并且在RTX A6000 GPU上的峰值内存占用仅为6GB,这使得它能够处理千米级的流媒体视频,在实际应用中具有可行性。 项目网站:[此链接](https://this https URL/)
https://arxiv.org/abs/2512.13680
This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
本文介绍了一个新的用于新颖视图合成的数据集,该数据集源自一部具有惊人真实感和复杂细节的高质量动画电影。我们的数据集捕捉了各种动态场景,包括详细的纹理、光照和运动,非常适合训练和评估前沿的4D场景重建和新颖视图生成模型。除了高保真的RGB图像外,我们还提供了多种互补模式,如深度信息、表面法线、对象分割及光流,这有助于更深入地理解场景几何结构与运动。 数据集按三个不同的基准测试情景组织:密集多视角相机设置、稀疏相机排列以及单目视频序列。这使得在不同程度的数据稀疏性上进行广泛实验和对比成为可能。凭借其视觉丰富度、高质量注释及多样化的实验设定,该数据集为推进视图合成与3D视觉领域的边界提供了独特资源。
https://arxiv.org/abs/2512.13639
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.
视觉标记化器在扩散模型中扮演着关键角色。潜在空间的维度决定了重构保真度和潜在特征语义表达的能力。然而,维度与生成质量之间存在着固有的权衡关系,使得现有方法受限于低维潜在空间。尽管最近的工作利用了视觉基础模型来丰富视觉标记化器的语义并加速收敛速度,但高维标记化器仍然不如其低维对应物表现良好。 在这项工作中,我们提出了RecTok,它通过两项关键创新克服了高维视觉标记化器的局限性:流式语义蒸馏和重构-对齐蒸馏。我们的主要见解是让流动匹配中的正向流程具有丰富的语义信息,作为扩散转换器的训练空间,而此前的工作则侧重于潜在空间。 具体而言,我们的方法将VFMs(视觉基础模型)中的语义信息蒸馏到流式匹配中的正向流程轨迹中,并通过引入掩码特征重构损失进一步增强语义。RecTok在图像重构、生成质量和判别性能方面表现优异,在有和没有无分类器指导的情况下,gFID-50K的指标上取得了最先进的结果,并且保持了具有丰富语义信息的潜在空间结构。此外,随着潜在维度的增加,我们观察到了一致性的改进。 代码和模型可在此URL获取:[请用户提供链接]
https://arxiv.org/abs/2512.13421
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
本文介绍了STARCaster,这是一种身份感知的空间-时间视频扩散模型,它在一个统一的框架内解决了基于语音驱动的人脸动画和自由视角说话人脸合成的问题,只需要一个身份嵌入或参考图像。现有的二维语音到视频扩散模型严重依赖于参考指导,导致动作多样性有限。同时,三维意识动画通常依赖于通过预训练的三平面生成器进行逆向转换,这往往会导致不完美的重建和身份漂移。我们从两个方面重新思考了基于参考和几何的方法。首先,在预训练中偏离严格的参考条件引入更柔和的身份约束;其次,我们在二维视频域内隐式地解决了三维意识问题,利用视频数据固有的多视角特性。STARCaster采用了一种组合方法,从身份感知的动作建模开始,通过唇读监督实现音视频同步,最后通过时间到空间的适应生成新视图动画。为了克服四维视听数据稀缺的问题,我们提出了一种解耦学习方法,在这种方法中,视图一致性和时间连贯性分别独立训练。一种自我强制训练方案使模型能够从比推理时产生的更长的时间上下文中进行学习,缓解了现有自回归方法中常见的过度静态动画问题。全面的评估表明,STARCaster在不同任务和身份上有效泛化,并且在不同的基准测试中始终超越先前的方法。
https://arxiv.org/abs/2512.13247
In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
在快速发展的图自监督学习领域,生成式和对比式方法已成为两种主导性策略。我们的研究专注于掩码特征重建(MFR),这是一种通过自我监督方式让模型学会恢复被屏蔽节点原始特性的生成技术。我们注意到,无论是MFR还是图对比学习(GCL)都在力求最大限度地增加相似元素之间的共识度。基于这一观察结果,我们揭示了一个新的理论见解:在特定条件下,尽管两者操作机制不同,但MFR和节点级别的GCL的目标会趋同。这种理论联系表明这些方法是互补的而非根本不同的,这促使我们探索它们融合的可能性以增强图自监督学习的效果。 本研究提出了对比掩码特征重建(CORE),这是一种新的图自监督学习框架,它将对比学习整合进MFR中。具体而言,我们只在原始和被重构的掩码节点特征之间形成正样本对,这鼓励编码器优先利用上下文信息而非节点本身的特性。此外,我们还使用掩码节点自身作为负样本,结合了MFR的重建能力和GCL的鉴别能力来更好地捕捉内在图结构。 从实验结果来看,我们的框架CORE在节点和图分类任务上显著优于MFR,展示了最先进的效果。特别是在节点分类任务中,相比GraphMAE和GraphMAE2,CORE分别提高了高达2.80%和3.72%;而在图分类任务中,则分别提高了高达3.82%和3.76%。
https://arxiv.org/abs/2512.13235
We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.
我们介绍了内在图像融合(Intrinsic Image Fusion)方法,该方法可以从多视角图像中重建高质量的基于物理原理的材料。材料重建问题高度欠约束,并且通常依赖于综合分析(analysis-by-synthesis),这需要昂贵且噪声较多的路径追踪计算。为了更好地约束优化过程,我们在重建过程中加入了单视图先验信息。我们利用了一种扩散基础的材质估计器,该估计器能够为每个视角生成多个候选分解方案,但这些方案往往并不一致。为了减少这种不一致性,我们将显式的低维参数化函数拟合到预测结果上。 接下来,我们提出了一种鲁棒的优化框架,通过软视图预测选择与基于置信度的软多视图内点集来融合最具置信视角中最一致性的预测值,在一致化的参数化材质空间中。最后,我们使用逆路径追踪技术对低维参数进行优化。 我们的方法在合成场景和真实场景中的材料分离任务上均超越了现有最先进的方法,并生成了适用于高质量重光照的清晰且干净的重建结果。
https://arxiv.org/abs/2512.13157
Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: this https URL
当前在动态场景中进行密集三维点跟踪的方法通常依赖于成对处理、需要已知的相机姿态,或者假设输入帧具有时间顺序,这些限制了它们的灵活性和适用性。此外,最近的研究成功地实现了从大规模未定位图像集合中的高效3D重建,突显了统一方法在动态场景理解方面的机会。受此启发,我们提出了DePT3R,这是一种新颖的框架,该框架能够同时利用多张图像在一个前向传播过程中完成密集点跟踪和动态场景的三维重建任务。通过强大的骨干网络提取深度时空特征,并使用密集预测头回归像素级映射,从而实现了这种多任务学习。关键的是,DePT3R在不需要相机姿态的情况下运行,这大大增强了其适应性和效率——特别是在快速变化的动态环境中尤为重要。 我们在多个涉及动态场景的具有挑战性的基准测试中验证了DePT3R,展示了强大的性能,并且与现有的最先进的方法相比,在内存效率方面取得了显著改进。数据和代码可通过开放存储库获取:[此链接](this https URL)
https://arxiv.org/abs/2512.13122
Vertical Federated Learning (VFL) enables collaborative model training across organizations that share common user samples but hold disjoint feature spaces. Despite its potential, VFL is susceptible to feature inference attacks, in which adversarial parties exploit shared confidence scores (i.e., prediction probabilities) during inference to reconstruct private input features of other participants. To counter this threat, we propose PRIVEE (PRIvacy-preserving Vertical fEderated lEarning), a novel defense mechanism named after the French word privée, meaning "private." PRIVEE obfuscates confidence scores while preserving critical properties such as relative ranking and inter-score distances. Rather than exposing raw scores, PRIVEE shares only the transformed representations, mitigating the risk of reconstruction attacks without degrading model prediction accuracy. Extensive experiments show that PRIVEE achieves a threefold improvement in privacy protection compared to state-of-the-art defenses, while preserving full predictive performance against advanced feature inference attacks.
垂直联合学习(Vertical Federated Learning,VFL)允许拥有共同用户样本但特征空间不相交的组织之间进行协作模型训练。尽管它具有潜力,但是VFL容易遭受特征推理攻击,在这种攻击中,敌对方利用推理过程中共享的信心分数(即预测概率),来重构其他参与者的私人输入特征。为应对这一威胁,我们提出了PRIVEE(PRIvacy-preserving Vertical fEderated LEarning),这是一种新颖的防御机制,名称来源于法语词privée,意为“私人的”。PRIVEE在保持相对排名和分数间距离等关键特性的同时混淆了信心分数,避免了直接暴露原始分数。通过仅分享转换后的表示形式,PRIVEE可以降低重构攻击的风险而不影响模型预测准确性。广泛的实验表明,与最先进的防御方法相比,PRIVEE的隐私保护提高了三倍,并且在面对高级特征推理攻击时仍保持完整的预测性能。
https://arxiv.org/abs/2512.12840
As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.
随着生成模型在生产高质量视觉内容方面的能力越来越强,对高效、可解释且易于编辑的图像表示的需求也大幅增长。最近,在二维高斯点绘(2D Gaussian Splatting,简称2DGS)领域的进展展现出了巨大的潜力,提供了显式控制、高度的可解释性和实时渲染能力(>1000 FPS)。然而,高质量的2DGS通常需要进行后优化处理。现有的方法采用随机或启发式策略(例如,梯度图),这些方法往往对图像复杂性不敏感,并导致收敛速度缓慢(>10秒)。较新的方法则引入可学习网络来预测初始高斯配置,但代价是增加了计算和架构的复杂度。 为了弥合这一差距,我们提出了一种轻量级框架Fast-2DGS,用于高效的高斯图像表示。具体来说,我们引入了深度高斯先验(Deep Gaussian Prior),通过条件网络捕捉不同复杂性下高斯原语的空间分布。此外,我们还提出了一个属性回归网络来预测密集的高斯特性。 实验表明,这种解耦架构能够在单次前向传递中实现高质量的重建,并随后进行最少的微调。更重要的是,我们的方法在不牺牲视觉质量的前提下显著降低了计算成本,使2DGS更接近于工业应用部署。
https://arxiv.org/abs/2512.12774
The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.
脊柱角度是衡量身体平衡的重要指标。恢复人体的三维形状并估计脊椎中心线非常重要。现有的基于多图像的身体重建方法需要昂贵的设备和复杂的程序,而基于单张图像的方法则由于遮挡和视角限制难以准确估算如脊椎中心线等内部结构。本研究提出了一种补偿多图像法缺点及解决单一图像法局限性的方法。我们设计了一个集成了四个方向深度图的三维人体姿态分析系统,用于恢复一个三维的人体模型并自动估计脊柱中心线。通过全局和精细注册的分层匹配,可以应对噪声和遮挡的影响进行重建。此外,应用自适应顶点减少以保持网格的分辨率和形状可靠性,并利用细节层次(LoD)集成同时保证了脊椎角度估算的准确性和稳定性。所提出的方法能够在不依赖训练数据或复杂神经网络模型的情况下实现高精度的三维脊柱注册估计,并且验证确认匹配质量得到了改善。
https://arxiv.org/abs/2512.12718
Implicit neural representations (INRs) have become a powerful paradigm for continuous signal modeling and 3D scene reconstruction, yet classical networks suffer from a well-known spectral bias that limits their ability to capture high-frequency details. Quantum Implicit Representation Networks (QIREN) mitigate this limitation by employing parameterized quantum circuits with inherent Fourier structures, enabling compact and expressive frequency modeling beyond classical MLPs. In this paper, we present Quantum Neural Radiance Fields (Q-NeRF), the first hybrid quantum-classical framework for neural radiance field rendering. Q-NeRF integrates QIREN modules into the Nerfacto backbone, preserving its efficient sampling, pose refinement, and volumetric rendering strategies while replacing selected density and radiance prediction components with quantum-enhanced counterparts. We systematically evaluate three hybrid configurations on standard multi-view indoor datasets, comparing them to classical baselines using PSNR, SSIM, and LPIPS metrics. Results show that hybrid quantum-classical models achieve competitive reconstruction quality under limited computational resources, with quantum modules particularly effective in representing fine-scale, view-dependent appearance. Although current implementations rely on quantum circuit simulators constrained to few-qubit regimes, the results highlight the potential of quantum encodings to alleviate spectral bias in implicit representations. Q-NeRF provides a foundational step toward scalable quantum-enabled 3D scene reconstruction and a baseline for future quantum neural rendering research.
隐式神经表示(INRs)已经成为连续信号建模和三维场景重建的强大范例,然而经典的网络由于众所周知的谱偏差而受限于捕捉高频细节的能力。量子隐式表示网络(QIREN)通过采用具有内嵌傅里叶结构的参数化量子电路来解决这一限制,这使它们能够进行比经典多层感知器(MLP)更为紧凑和丰富的频率建模。 在本文中,我们介绍了量子神经辐射场(Q-NeRF),这是第一个用于神经辐射场渲染的混合量子-经典框架。Q-NeRF将QIREN模块整合到Nerfacto骨干网络中,在保留其高效的采样、姿态细化和体素化渲染策略的同时,用改进后的量子增强组件替换选定的密度和辐射预测部分。 我们系统地评估了三种混合配置在标准多视角室内数据集上的表现,并使用峰值信噪比(PSNR)、结构相似性指数(SSIM)和感知线性判别器(LPIPS)等指标将其与经典基线进行比较。结果显示,在有限计算资源下,混合量子-经典的模型可以实现具有竞争力的重建质量,而量子模块特别擅长表示细粒度、视角依赖性的外观。 尽管当前实施依赖于仅限于少数量子位的应用程序模拟器,但结果突显了量子编码在缓解隐式表示中的谱偏差方面的潜力。Q-NeRF为可扩展的量子赋能三维场景重建提供了一个基础步骤,并且是未来量子神经渲染研究的一个基准点。
https://arxiv.org/abs/2512.12683
Vertical beam dropout in spinning LiDAR sensors triggered by hardware aging, dust, snow, fog, or bright reflections removes entire vertical slices from the point cloud and severely degrades 3D perception in autonomous vehicles. This paper proposes a Graph Attention Network (GAT)-based framework that reconstructs these missing vertical channels using only the current LiDAR frame, with no camera images or temporal information required. Each LiDAR sweep is represented as an unstructured spatial graph: points are nodes and edges connect nearby points while preserving the original beam-index ordering. A multi-layer GAT learns adaptive attention weights over local geometric neighborhoods and directly regresses the missing elevation (z) values at dropout locations. Trained and evaluated on 1,065 raw KITTI sequences with simulated channel dropout, the method achieves an average height RMSE of 11.67 cm, with 87.98% of reconstructed points falling within a 10 cm error threshold. Inference takes 14.65 seconds per frame on a single GPU, and reconstruction quality remains stable for different neighborhood sizes k. These results show that a pure graph attention model operating solely on raw point-cloud geometry can effectively recover dropped vertical beams under realistic sensor degradation.
旋转激光雷达传感器由于硬件老化、灰尘、雪、雾或强烈反射等原因导致的垂直光束脱落会从点云中移除整个垂直切片,并严重损害自动驾驶车辆的3D感知能力。本文提出了一种基于图注意力网络(GAT)的框架,该框架仅使用当前的激光雷达帧来重建缺失的垂直通道,无需相机图像或时间信息。每个激光雷达扫描都被表示为一个无结构的空间图:点是节点,边连接附近的点并保持原始光束索引顺序。多层GAT通过局部几何邻域学习自适应注意力权重,并直接回归丢失的高度(z)值。在1,065个具有模拟通道脱落的原始KITTI序列上进行训练和评估后,该方法实现了平均高度RMSE为11.67厘米,87.98%重建点误差小于10厘米的结果。推理速度为单GPU每帧14.65秒,并且对于不同的邻域大小k,重建质量保持稳定。这些结果表明,仅基于原始点云几何的纯图注意力模型能够在现实中的传感器退化情况下有效恢复脱落的垂直光束。
https://arxiv.org/abs/2512.12410
Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
人体网格重建(HMR)提供了对人体与环境互动的直接洞察,这为各种沉浸式应用铺平了道路。尽管现有的大规模HMR数据集主要依赖于视线内的RGB输入,但基于视觉的感应技术受到遮挡、光照变化和隐私问题的限制。为了克服这些限制,最近的研究开始探索使用无线电频率(RF)毫米波雷达进行非接触式的室内人体感知。然而,目前的雷达数据集由于稀疏的骨骼标签、规模有限以及仅限于在地面上的动作而存在局限性。 为推进HMR研究社区的发展,我们引入了M4Human——当前规模最大(661K帧)、比现有最大数据集大9倍的多模态基准数据集。该数据集包括高分辨率毫米波雷达、RGB和深度图像数据,并提供原始雷达张量(RT)和处理过的雷达点云(RPC),以便在不同的RF信号粒度级别上进行研究。 M4Human包含高质量的动作捕捉(MoCap)注释,包括3D网格模型和全局轨迹,并涵盖了20个主体及50种多样化的动作,涵盖在地面上、就座状态下的动作以及自由空间内的运动或康复训练。我们分别针对RT和RPC模态建立了基准测试,并且还设置了多模态融合(RGB-D)的基准。 大量的实验结果突显了M4Human对于基于雷达的人体建模的重要性,同时也揭示了在快速、不受限制的动作下持续存在的挑战。该数据集及代码将在论文发表后公开发布。
https://arxiv.org/abs/2512.12378
Computer-generated holography (CGH) presents a transformative solution for near-eye displays in augmented and virtual reality. Recent advances in deep learning have greatly improved CGH in reconstructed quality and computational efficiency. However, deploying neural CGH pipelines directly on compact, eyeglass-style devices is hindered by stringent constraints on computation and energy consumption, while cloud offloading followed by transmission with natural image codecs often distorts phase information and requires high bandwidth to maintain reconstruction quality. Neural compression methods can reduce bandwidth but impose heavy neural decoders at the edge, increasing inference latency and hardware demand. In this work, we introduce JPEG-Inspired Cloud-Edge Holography, an efficient pipeline designed around a learnable transform codec that retains the block-structured and hardware-friendly nature of JPEG. Our system shifts all heavy neural processing to the cloud, while the edge device performs only lightweight decoding without any neural inference. To further improve throughput, we implement custom CUDA kernels for entropy coding on both cloud and edge. This design achieves a peak signal-to-noise ratio of 32.15 dB at $<$ 2 bits per pixel with decode latency as low as 4.2 ms. Both numerical simulations and optical experiments confirm the high reconstruction quality of the holograms. By aligning CGH with a codec that preserves JPEG's structural efficiency while extending it with learnable components, our framework enables low-latency, bandwidth-efficient hologram streaming on resource-constrained wearable devices-using only simple block-based decoding readily supported by modern system-on-chips, without requiring neural decoders or specialized hardware.
计算机生成全息术(CGH)为增强现实和虚拟现实中的眼镜显示提供了一种变革性的解决方案。近年来,深度学习的进步大大提高了CGH的重建质量和计算效率。然而,在紧凑型眼镜式设备上直接部署神经网络CGH管道受到严格的计算和能耗限制的阻碍。而云卸载后使用自然图像编解码器传输则往往会失真相位信息,并且需要高带宽来保持重建质量。虽然神经压缩方法可以降低带宽需求,但会增加边缘端的神经解码负担,导致推理延迟增加及硬件需求提升。 在此项工作中,我们引入了“JPEG启发式的云边全息术”,这是一种围绕可学习变换编解码器设计的有效管道,该编解码器保留了JPEG块结构化和硬件友好型的本质。我们的系统将所有繁重的神经处理移至云端,并且边缘设备仅执行轻量级解码而无需任何神经推理。为了进一步提高吞吐量,我们在云和边缘端实现了定制的CUDA内核进行熵编码。 该设计在小于2比特每像素的情况下达到了32.15分贝的最大信噪比(PSNR),并具备低至4.2毫秒的解码延迟。无论是数值仿真还是光学实验都证实了全息图重建质量之高。通过将CGH与保留JPEG结构性效率并扩展为可学习组件的编解码器相结合,我们的框架能够在资源受限的可穿戴设备上使用仅基于简单块处理并且现代系统级芯片(SoC)已支持的低延迟和带宽高效的全息图流传输而无需神经解码器或专用硬件。
https://arxiv.org/abs/2512.12367
While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
尽管深度学习方法在许多视觉基准测试中取得了令人印象深刻的成就,但理解这些模型的表示和决策仍然具有挑战性。虽然视觉模型通常是基于2D输入进行训练的,但人们通常认为它们会形成一个潜在的3D场景表示(例如,能够容忍部分遮挡或推理相对深度的能力)。在此,我们引入了MRD(可微渲染生成器),这是一种使用物理基础的可微分渲染技术来探究视觉模型对生成性3D场景属性隐含理解的方法。通过找到在物理上不同但产生相同模型激活值的3D场景参数(即为模型同形体)。与以前基于像素的方法评估模型表示相比,这种方法重建的结果始终以物理场景描述为基础。这意味着我们可以例如,在保持材料和光照不变的情况下探究模型对物体形状的敏感性。 作为概念验证,我们评估了多个模型在恢复几何形状(形态)和双向反射分布函数(材质)场景参数方面的能力。结果表明,目标场景与优化后的场景之间存在高度相似的模型激活模式,并且这些视觉效果有所不同。从定性的角度来看,这种重建有助于探究模型对物理场景属性敏感或不变的特性。 MRD通过分析物理场景参数如何驱动模型响应的变化,为增进我们对计算机和人类视觉的理解提供了潜力。
https://arxiv.org/abs/2512.12307
Sparse-view Computed Tomography (CT) reconstructs images from a limited number of X-ray projections to reduce radiation and scanning time, which makes reconstruction an ill-posed inverse problem. Deep learning methods achieve high-fidelity reconstructions but often overfit to a fixed acquisition setup, failing to generalize across sampling rates and image resolutions. For example, convolutional neural networks (CNNs) use the same learned kernels across resolutions, leading to artifacts when data resolution changes. We propose Computed Tomography neural Operator (CTO), a unified CT reconstruction framework that extends to continuous function space, enabling generalization (without retraining) across sampling rates and image resolutions. CTO operates jointly in the sinogram and image domains through rotation-equivariant Discrete-Continuous convolutions parametrized in the function space, making it inherently resolution- and sampling-agnostic. Empirically, CTO enables consistent multi-sampling-rate and cross-resolution performance, with on average >4dB PSNR gain over CNNs. Compared to state-of-the-art diffusion methods, CTO is 500$\times$ faster in inference time with on average 3dB gain. Empirical results also validate our design choices behind CTO's sinogram-space operator learning and rotation-equivariant convolution. Overall, CTO outperforms state-of-the-art baselines across sampling rates and resolutions, offering a scalable and generalizable solution that makes automated CT reconstruction more practical for deployment.
稀疏视图计算断层成像(CT)是从有限数量的X射线投影中重建图像的技术,这种方法可以减少辐射剂量和扫描时间。然而,这种技术使得重建过程成为了一个病态反问题。深度学习方法虽然能实现高保真度的重建,但常常会在固定的采集设置下过拟合,无法跨采样率和图像分辨率进行泛化。例如,卷积神经网络(CNN)使用相同的学习核函数在不同分辨率间应用时会导致伪影产生。 为此,我们提出了一种新的计算断层成像神经操作器(CTO),这是一种统一的CT重建框架,它可以扩展到连续功能空间中,从而能够跨越采样率和图像分辨率进行泛化(无需重新训练)。CTO同时在投影图(sinogram)域和图像域中运行,并通过旋转等变离散-连续卷积在函数空间参数化,使其本质上对于分辨率变化和不同采样方式都具备鲁棒性。 实验结果显示,CTO能够在多级采样率和跨分辨率上保持一致的性能,并且平均而言比CNNs提高了超过4dB的峰值信噪比(PSNR)。与最先进的扩散方法相比,CTO在推断时间上快了500倍,而平均PSNR提升了3dB。实验结果也验证了我们在CTO的设计中关于投影图空间操作学习和旋转等变卷积的选择。 总的来说,无论采样率还是分辨率如何变化,CTO都能超越现有的基线方法,在自动化CT重建领域提供了可扩展且泛化的解决方案,使得其在实际部署中的应用更加实用。
https://arxiv.org/abs/2512.12236
Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
超低比特率图像压缩(低于每像素0.05位)在带宽受限和计算资源有限的编码场景中,如边缘设备上变得越来越重要。现有的框架通常依赖于大型预训练编码器(例如,变分自编码器VAEs或基于标记化的方法),并在其生成潜在空间内执行变换编码。尽管这些方法实现了令人印象深刻的感知保真度,但它们对重型编码网络的依赖使它们不适合部署在计算能力较弱的设备上。 在这项工作中,我们探讨了应用浅层编码器进行超低比特率压缩的可能性,并提出了一个新颖的非对称极端图像压缩(AEIC)框架,旨在同时追求编码简洁性和解码质量。具体而言,AEIC采用中等深度或甚至浅层的编码网络,同时利用一步扩散解码器,在极低比特率下保持高保真度和高度真实的重建效果。 为了进一步提高浅层编码器的效率,我们设计了一种双侧特征蒸馏方案,将知识从具有中等编码网络的AEIC转移到其浅层编码变体上。实验表明,AEIC不仅在超低比特率下的速率失真感知性能方面优于现有方法,而且在1080P输入图像的情况下实现了35.8 FPS的出色编码效率,并且解码速度与现有方法相比具有竞争力。
https://arxiv.org/abs/2512.12229
We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at this https URL and this https URL.
我们提供了一个大规模的、长期性的视觉数据集,该数据集涵盖了英国布里斯托尔市2021年至2025年间由22个固定角度摄像头捕捉的城市街道照明情况。此数据集中包含超过526,000张图像,并且这些图像是在各种光照条件、天气状况和季节变化下每小时收集的。每个图像都附有详细的元数据,包括时间戳、GPS坐标以及设备标识符等信息。 这一独特的现实世界数据集使得针对视觉漂移(visual drift)、异常检测(anomaly detection)以及智能城市部署中的MLOps策略进行详细研究成为可能。为了促进二次分析,我们还提供了一个基于卷积变分自动编码器(CNN-VAEs)的自我监督框架,并且模型分别在每个摄像头节点和白天/夜晚图像集合中独立训练。 我们定义了两种样本级别的漂移指标:相对中心点漂移(relative centroid drift),它捕捉到基准季度中隐含空间偏移;以及相对重建误差(relative reconstruction error),用于测量归一化后的图像领域退化程度。该数据集为长期模型稳定性、针对漂移的适应性学习及现成视觉系统的部署提供了一个现实且细致的标准。 这些图像和结构化的元数据将以JPEG和CSV格式公开发布,以支持重现性和下游应用(如街道照明监控、天气推断以及城市场景理解)。此数据集可在以下网址获取:[具体URL] 和 [具体URL]。请注意,在实际使用中需要将上述示例中的"[具体URL]"替换为实际的链接地址。
https://arxiv.org/abs/2512.12205