Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and {\em directly} deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
从原始的真实世界图像数据中学习机器人操作策略需要在物理环境中进行大量的机器人动作试验。虽然使用仿真训练提供了成本效益高的替代方案,但模拟环境与机器人工作空间之间的视觉领域差距仍然是主要限制之一。最近,高斯点绘(Gaussian Splatting)的视觉重构方法为机器人操控提供了一些新的方向,通过生成逼真的环境来帮助这一问题。在本文中,我们提出了首个仅基于RGB图像进行监督学习的人机交接策略的方法,并且无需实际机器人的训练或数据采集。该提出的策略学习者名为“使用稀疏视图高斯点绘的人机交互政策学习器”(Human-to-Robot Handover using Sparse-View Gaussian Splatting,简称H2RH-SGS),它利用稀疏视图的高斯点绘重构人与机器人交接场景来生成包含相机安装在机械抓手上拍摄到的图像动作对的机器人演示。因此,在重建的场景中模拟相机姿态的变化可以直接转换为夹爪姿态的变化。我们使用16种家庭常用物品进行实验收集示范,并**直接**将此策略部署到了实际环境中。无论是高斯点绘重构的场景还是现实世界的人机交接实验,都表明H2RH-SGS对于人与机器人交互任务提供了一种新的且有效的表示方法。 该研究的核心在于利用了稀疏视图高斯点绘技术来生成逼真的训练环境,并以此为基础从RGB图像中学习到机器人的操作策略。这种方法不仅减少了对实际物理试验的需求,而且在转换到真实环境中时能够保持较高精度和有效性,为机器人操控任务提供了新的可能路径。
https://arxiv.org/abs/2507.08726
Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
磁共振成像(MRI)能够进行非侵入性的高分辨率肌肉结构分析。然而,自动分割仍然受限于高昂的计算成本、对大规模训练数据集的依赖以及在较小肌肉分割时准确度降低的问题。基于卷积神经网络(CNN)的方法虽然强大,但常常面临显著的计算开销、泛化能力有限和解释性较差等问题,尤其是在面对多样化的人群时。 本研究提出了一种基于关键点跟踪的无需训练的分割方法,该方法结合了关键点选择与Lucas-Kanade光流算法。所提出的这种方法在不同关键点选择策略下可以达到0.6到0.7之间的平均Dice相似系数(DSC),其表现可媲美最先进的CNN基模型,同时大大降低了计算需求并增强了解释性。 该可扩展框架为临床和研究应用中的肌肉分割提供了一个稳健且易于理解的替代方案。
https://arxiv.org/abs/2507.08690
This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.
本文介绍了环境智能康复支持(AIRS)框架,这是一种基于先进人工智能的解决方案,专为家庭康复环境设计。AIRS整合了尖端技术,包括实时三维重建(RT-3DR)、智能导航和大型视觉语言模型(VLMs),以创建一个全面的机器引导物理康复系统。一般性的AIRS框架在全膝关节置换(TKR)后的康复场景中得到演示,并利用263段视频记录的数据集进行评估。 在AIRS中,智能手机被用来执行生活空间的RT-3DR,并且有一个与患者身体相匹配的虚拟角色来提供关于锻炼的视觉反馈。这一虚拟角色对于(a)优化锻炼配置至关重要,包括摄像头的位置、患者的姿势以及初始姿态,以及(b)解决隐私问题并促进遵守AI法案的要求。 该系统指导用户完成记录过程,以确保收集到准确录制的视频资料。AIRS采用了两种反馈机制:(i)视觉3D反馈,使得可以将预先录制的临床练习与患者的家庭录像进行直接比较;(ii)由VLM生成的反馈,提供对锻炼错误的详细解释和纠正建议。 该框架还支持视力障碍者和听力障碍者的康复需求,并且具有模块化设计,能够适应更广泛的康复场景。AIRS软件组件可供进一步使用及定制。
https://arxiv.org/abs/2507.08624
This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.
这项工作提出了一种统一的、全可微模型,用于多人跟踪,该模型能够在不依赖于预先计算的轨迹片段的情况下学习将检测结果关联到运动轨迹中。该模型构建了一个动态时空图,可以聚合空间信息、上下文信息和时间信息,从而能够使整个序列中的信息传播无缝进行。为了改善遮挡处理能力,此图还可以编码特定场景的信息。我们还引入了一套新的大规模数据集,包含25个部分重叠的视点,详细的场景重建以及广泛的遮挡情况。实验表明,在公共基准和新数据集中,该模型均达到了最先进的性能,并且在各种条件下都能展现出良好的适应性。此数据集和方法将公开发布,以促进多人跟踪研究的发展。
https://arxiv.org/abs/2507.08494
3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology's broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.
三维重建技术旨在恢复场景的密集三维结构,它是包括增强/虚拟现实、自动驾驶和机器人技术在内的众多应用领域的基石。传统的流水线方法如基于运动的结构(SfM)和多视角立体视觉(MVS),通过迭代优化实现高精度的同时,受限于复杂的流程、高昂的计算成本以及在无纹理区域等挑战性场景中的鲁棒性差的问题。最近,深度学习推动了三维重建领域的范式转变。以DUSt3R为代表的新型模型开创了一种前馈方法。这些模型采用统一的深层网络直接从前向传递中从一组非约束图像集中推断相机姿态和密集几何结构。本综述系统地回顾了这一新兴领域。 我们首先剖析了这些前馈模型的技术框架,包括它们基于Transformer的对应关系建模、联合的姿态与几何回归机制以及从两视图到多视图场景扩展的策略。为了突出这种新范式的颠覆性特征,我们将它与传统的流水线方法和早期的学习方法(如MVSNet)进行了对比。此外,我们还概述了相关的数据集和评估指标。 最后,我们讨论了该技术广泛的应用前景,并识别出关键的未来挑战和机遇,包括模型精度和可扩展性以及处理动态场景的能力。
https://arxiv.org/abs/2507.08448
Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.
利用预训练视觉基础模型的强大表示能力——这些模型传统上用于图像理解——我们探索了一个新颖的方向:直接在这些模型之上构建一个图像分词器,这是一个鲜少被探讨的领域。具体来说,我们将冻结状态下的视觉基础模型作为分词器的编码器。为了提高其有效性,我们引入了两个关键组件:(1)一种区域自适应量化框架,该框架能够减少预训练特征在常规2D网格上的冗余;(2)一个语义重构目标,确保分词器输出与基础模型表示一致,从而保持语义保真度。基于这些设计,我们提出的图像分词器VFMTok,在图像重建和生成质量方面实现了显著改进,并提高了标记效率。此外,它还提升了自回归(AR)生成的效果——在ImageNet基准测试中达到了gFID 2.07的成绩,同时使模型收敛速度提高三倍,并且无需分类器自由引导(CFG)就能实现高保真度的条件类合成。代码将公开发布以惠及社区。
https://arxiv.org/abs/2507.08441
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
人类能够在复杂环境中自然地识别并完成被遮挡物体的图像。然而,将这种认知能力赋予机器人仍然是一个挑战,即使使用了先进的重建技术,这些技术也将场景视为未区分的整体,并且无法从部分观察中识别出完整的对象。在本文中,我们提出了InstaScene,这是一种新的范式,旨在实现复杂场景中的整体3D感知,并以主要目标为出发点:分解任意实例并确保完整重建。 为了实现精确的分解,我们开发了一种新颖的空间对比学习方法,通过追踪每个实例跨视角的光栅化过程,显著增强了杂乱场景中的语义监督。为了避免由于观察有限而导致的不完整性,我们引入了原位生成技术,利用有价值的观察和几何线索,有效地指导3D生成模型重建与现实世界无缝对接的完整对象。 在复杂的真实世界和合成场景中进行的场景分解和物体完成实验表明,我们的方法实现了卓越的分解准确性,并产生了几何上真实且视觉完整的对象。
https://arxiv.org/abs/2507.08416
Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at this https URL
音频修复(audio inpainting)是指重建被损坏的音频记录中缺失的部分。尽管之前的方法,包括基于波形和频谱图的扩散模型,在处理短间隔时表现出了良好的结果,但当缺口超过100毫秒(ms)时,它们的质量通常会下降。在本研究中,我们介绍了一种新的修复方法,该方法基于离散扩散建模,并且是在预训练音频标记器生成的标记化音频表示上进行操作的。我们的方法直接对离散潜在空间中的生成过程进行建模,从而能够实现稳定而语义一致的缺失音频重建。 我们在MusicNet数据集上使用客观和感知指标评估了该方法,在300毫秒(ms)以内的缺口长度下进行了测试。我们还在MTG数据集上进一步验证了我们的方法,并将缺口持续时间延长至500毫秒。实验结果表明,与现有基准相比,我们的方法在较长的缺口情况下达到了竞争力或优越的表现,为修复受损音乐录音提供了稳健的解决方案。 有关我们提出的方法的音频示例,请参见此链接:[https URL](此处应填写实际的URL)。
https://arxiv.org/abs/2507.08333
Craniofacial reconstruction in forensic science is crucial for the identification of the victims of crimes and disasters. The objective is to map a given skull to its corresponding face in a corpus of faces with known identities using recent advancements in computer vision, such as deep learning. In this paper, we presented a framework for the identification of a person given the X-ray image of a skull using convolutional Siamese networks for cross-domain identity representation. Siamese networks are twin networks that share the same architecture and can be trained to discover a feature space where nearby observations that are similar are grouped and dissimilar observations are moved apart. To do this, the network is exposed to two sets of comparable and different data. The Euclidean distance is then minimized between similar pairs and maximized between dissimilar ones. Since getting pairs of skull and face images are difficult, we prepared our own dataset of 40 volunteers whose front and side skull X-ray images and optical face images were collected. Experiments were conducted on the collected cross-domain dataset to train and validate the Siamese networks. The experimental results provide satisfactory results on the identification of a person from the given skull.
在法医科学中,颅面重建对于识别犯罪和灾难受害者至关重要。本文的目标是利用计算机视觉领域的最新进展(如深度学习),将给定的头骨映射到包含已知身份的人脸图像数据库中的对应面部。为此,我们提出了一种框架,使用卷积孪生网络在跨域身份表示中从X光头骨图像识别个人。孪生网络是由相同架构构成的一对网络,经过训练后可以在特征空间中将相似的观测值聚类在一起,并将不相似的观测值分开。为了实现这一目标,网络会接触到两组可比的数据集和不同的数据集。接着,欧几里得距离被用来最小化相似成对之间的差距并最大化不同成对之间的差距。 由于难以获取头骨与面部图像的配对数据,我们为40名志愿者准备了自己的跨域数据集,收集了他们的正面和侧面X光头骨图像以及光学面部图像。在该收集的数据集中进行了实验以训练和验证孪生网络。实验结果表明,从给定的头骨识别个人身份的效果令人满意。
https://arxiv.org/abs/2507.08329
Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating this http URL, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction this http URL, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video this http URL across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is this https URL
音频驱动的虚拟人脸生成技术在电影制作中具有巨大的潜力。尽管现有的3D方法已经在运动建模和内容合成方面取得了进展,但由于无法稳定地表示精细的运动场,它们往往会产生渲染伪影,如运动模糊、时间抖动以及局部穿透等问题。 通过系统的分析,我们重新定义了虚拟人脸生成过程为一个包含三个步骤的统一框架:视频预处理、运动表征和重建。该框架支撑我们的提议方法——M2DAO-Talker,此方法通过多粒度运动解耦合及交替优化策略来解决当前存在的局限性。 为了便于运动表示,我们设计了一条创新的2D肖像预处理流水线,用于提取逐帧变形控制条件(如动态区域分割掩码和相机参数)。为改善运动建模,我们制定了一个多粒度运动解耦策略,独立地对非刚体(口腔及面部)和刚性(头部)运动进行模型建立,以提高重建质量。此外,我们开发了一个运动一致性约束来确保头-躯干的力学一致性,从而减少由运动歧义造成的穿透伪影。 通过迭代优化技术参数的方法,可以逐步完善面部与口腔动作的真实度,生成更逼真的视频内容。跨多个数据集进行的研究显示,M2DAO-Talker达到了最先进的性能,在生成质量上比TalkingGaussian方法高出2.43分贝的峰值信噪比(PSNR),且用户评估的视频真实度得分高出0.64,同时保持了150 FPS的推理速度。 您可以通过访问我们的项目主页获取更多信息。
https://arxiv.org/abs/2507.08307
Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy's effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
构建一个强大的感知模块对于视动策略学习至关重要。虽然最近的方法将预训练的2D基础模型集成到机器人感知模块中,以利用其强大的语义理解能力,但它们难以捕捉3D空间信息,并且在不同相机视角下泛化的能力有限。这些限制阻碍了政策的有效性,尤其是在精细的机器人操作场景中。为了解决这些问题,我们提出了CL3R,这是一个新颖的3D预训练框架,旨在增强机器人的操作策略。我们的方法通过使用点云Masked Autoencoder来学习丰富的3D表示,并利用对比学习将预训练2D基础模型中的语义知识高效转移,从而集成了空间意识和语义理解。 此外,我们还提出了一种用于机器人任务的3D视觉表示预训练框架。通过统一不同数据集之间的坐标系统并引入多视角点云的随机融合,我们缓解了相机视图模糊性,并提高了泛化能力,在测试时能够从新的视角实现稳健感知。 在模拟和现实世界的广泛实验中,我们的方法显示出了其优越性,突显了它在机器人操作中的视动策略学习方面的有效性。
https://arxiv.org/abs/2507.08262
We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.
我们提出了一种新颖的框架,用于从单目视频中重建动态的人与物体交互,该框架能够克服遮挡和时间不一致带来的挑战。传统的三维重建方法通常假设静态对象或完全可见的动态主体,在这些假设被违反时(特别是在相互遮挡的情况下),性能会降低。为了解决这一问题,我们的框架利用非模态完成来推断部分被遮挡区域的完整结构。与传统的方法在单独帧上操作不同,我们提出的方法整合了时间上下文,并强制执行视频序列间的连贯性,以逐步细化和稳定重建结果。这种方法无需依赖预定义模型即可适应不同的条件,从而显著提高了动态场景中复杂细节恢复的能力。我们在具有挑战性的单目视频上使用3D高斯点阵技术验证了我们的方法,证明在处理遮挡和保持时间稳定性方面优于现有技术。
https://arxiv.org/abs/2507.08137
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose \NickName, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. \NickName~achieves the \textbf{state-of-the-art performance on both ImageNet and $8$ zero-shot benchmarks} across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID $\textbf{0.49}$ v.s. $\textbf{0.91}$, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of \NickName~in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at this https URL.
向量量化变分自编码器(VQ-VAEs)是一种基础模型,用于将连续的视觉数据压缩为离散令牌。现有的方法试图改进量化策略以提高重建质量,然而,VQ-VAE与VAE之间仍然存在较大的差距。为了缩小这一差距,我们提出了\NickName(这里用“NM”代替),这是一种新的方法,旨在增强离散代码本的表示能力,使代码本更容易优化,并最小化信息损失,从而提升重建质量。具体来说,我们提出保留潜在维度以保持编码特征,并引入一组子代码本进行量化。此外,我们构建了全面的零样本基准测试,包括512p和2k分辨率,用于严格评估现有方法的重建性能。 NM在ImageNet以及8个零样本基准测试上实现了所有VQ-VAE中**最先进的性能**。值得注意的是,与SD-VAE相比,在ImageNet上的表现显著优于后者,rFID为0.49对比0.91,并且在所有零样本基准测试上均取得了更优的PSNR值。这些结果突显了NM在重建方面的优越性,并为进一步在高清图像处理任务中保持保真度铺平了道路。 代码将在以下网址公开发布:this https URL(这里需要提供实际链接)。
https://arxiv.org/abs/2507.07997
According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.
根据算法信息理论(AIT),智能表示将数据压缩为可以重建其内容的最短程序,表现出低柯尔莫哥洛夫复杂性(KC)。相比之下,大多数视觉表征学习系统使用固定长度的表示来处理所有输入,忽略了复杂度或熟悉度的变化。最近的自适应标记方法通过分配可变长度的表示解决了这个问题,但通常在测试时需要对多个编码进行搜索以找到最具有预测性的那个。受柯尔莫哥洛夫复杂性原理启发,我们提出了一个单步自适应标记器KARL,它可以在一次前向传递中为图像预测合适的标记数量,并且一旦达到了近似KC就停止。标记的数量作为最小描述长度的代理。 KARL的训练过程类似于倒置强化学习范式,因为它学会了根据所需的重建质量条件性地预测标记终止。 KARL在单步操作下即可达到最近自适应标记器的性能水平。我们为KARL提供了扩展定律分析,探讨了编码器/解码器大小、连续与离散标记化等因素的作用。此外,我们还提供了一个概念研究,将自适应图像标记和算法信息理论进行类比,考察在结构与噪声以及分布内与分布外熟悉度等轴上的预测图像复杂性(KC),揭示了与人类直觉的一致性。
https://arxiv.org/abs/2507.07995
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
合成逼真的火星景观视频对于任务排练和机器人模拟至关重要。然而,由于高质量的火星数据稀缺以及火星图像与地球图像之间的显著领域差异,这一任务面临着独特的挑战。为了解决这些挑战,我们提出了一种全面的解决方案,包括两个关键组成部分: 1. **多模态火星合成(M3arsSynth)**:这是一个数据整理流水线,从NASA行星数据系统(PDS)获取的真实立体导航图像中重建三维火星环境,并渲染高质量的多视角三维视频序列。 2. **火星地形视频生成器(MarsGen)**:该组件能够根据编码在数据中的3D结构来合成视觉上逼真且几何一致的新视频。 我们的M3arsSynth引擎涵盖了广泛的火星地貌和采集日期,从而能够在米级分辨率下生成物理准确的三维表面模型。通过微调M3arsSynth数据集,MarsGen可以根据初始图像帧(以及可选的摄像机轨迹或文本提示)合成视频,使得在新环境中进行视频生成成为可能。 实验结果表明,我们的方法优于基于地球数据集训练的视频合成模型,在视觉保真度和三维结构一致性方面表现出色。
https://arxiv.org/abs/2507.07978
The recent advances in generative models such as diffusion models have raised several risks and concerns related to privacy, copyright infringements and data stewardship. To better understand and control the risks, various researchers have created techniques, experiments and attacks that reconstruct images, or part of images, from the training set. While these techniques already establish that data from the training set can be reconstructed, they often rely on high-resources, excess to the training set as well as well-engineered and designed prompts. In this work, we devise a new attack that requires low resources, assumes little to no access to the actual training set, and identifies, seemingly, benign prompts that lead to potentially-risky image reconstruction. This highlights the risk that images might even be reconstructed by an uninformed user and unintentionally. For example, we identified that, with regard to one existing model, the prompt ``blue Unisex T-Shirt'' can generate the face of a real-life human model. Our method builds on an intuition from previous works which leverages domain knowledge and identifies a fundamental vulnerability that stems from the use of scraped data from e-commerce platforms, where templated layouts and images are tied to pattern-like prompts.
最近,生成模型(如扩散模型)的进展引发了一些与隐私、版权侵权和数据管理相关的风险和担忧。为了更好地理解和控制这些风险,各种研究人员已经创建了技术、实验和攻击方法来从训练集中重构图像或部分图像。尽管这些技术已经证明可以从训练集中重建数据,但它们通常依赖于高资源消耗、对实际训练集的访问权限以及精心设计的提示语。 在本工作中,我们设计了一种新的攻击方法,它只需要低资源,并且假设几乎没有访问实际训练集的机会,同时识别看似无害但实际上可能导致潜在风险图像重构的提示。这突显了即使是没有专业知识的用户也可能无意中重建图像的风险。例如,我们发现对于现有的某个模型而言,“蓝色男女同款T恤”的提示语可以生成一个现实生活中的模特的脸部图像。 我们的方法借鉴了之前研究的工作思路,利用领域知识并识别了一个根本性的脆弱性,该脆弱性源于从电子商务平台抓取的数据使用,其中模板布局和图像与模式化的提示语紧密关联。
https://arxiv.org/abs/2507.07947
Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.
神经音频编解码器和自编码器作为音频压缩、传输、特征提取以及潜在空间生成的多功能模型已经出现。然而,这些模型的一个关键限制是大多数都经过训练以最大化重构保真度,而往往忽略了在各种下游应用中实现最佳性能所必需的具体潜在结构。我们提出了一种简单的后处理框架,通过修改预训练自编码器的瓶颈来解决这一问题。我们的方法引入了“重新瓶颈”,这是一个仅通过潜在空间损失进行训练的内部瓶颈,以植入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先,在不牺牲重构质量的前提下强制执行潜在通道的顺序。其次,我们使潜在变量与语义嵌入对齐,并分析这对下游扩散模型的影响。第三,我们引入等变性,确保输入波形上的滤波操作在潜在空间中有直接对应的特定转换。最终,“重新瓶颈”框架为调整神经音频模型的表示提供了一种灵活而高效的方式,使其能够轻松满足不同应用的需求,同时只需进行最少的额外训练。
https://arxiv.org/abs/2507.07867
Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.
基于内容的拼图解谜方法已经得到了广泛研究,并在计算技术方面取得了显著进展。然而,这些解谜方法的评估通常缺少对实际应用至关重要的挑战性任务,例如碎片化文物或碎纸文件的重组。在这项工作中,我们探讨了最先进的基于内容的拼图解谜器面对三种类型的拼图破坏(缺失部分、边缘侵蚀和内容侵蚀)时的鲁棒性表现,并分析了解谜器处理这些破坏的能力以及它们的关键局限性。 我们的研究评估了启发式方法和基于深度学习的方法在解决上述问题中的性能,结果表明:标准拼图解谜器如果遇到更多被损坏的部分,其性能会迅速下降。然而,通过使用增强数据进行微调,基于深度学习的模型可以显著提高其鲁棒性。特别地,在我们进行的各项实验中,先进的Positional Diffusion模型表现出色,超越了其他竞争对手。 根据我们的研究发现,对于提升现实世界中文物自动重建技术,我们可以指出一些有前景的研究方向。
https://arxiv.org/abs/2507.07828
Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting'' based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.
机器遗忘技术旨在从训练好的模型中移除特定数据或类别的影响,以满足隐私、法律或伦理要求。现有的遗忘方法往往只是表面地“忘记”:它们通过调整模型的响应来假装忘记了某些内容,但内部表示仍然保留着足够恢复被遗忘的数据或行为的信息。我们通过无训练性能恢复攻击和基于梯度反转的数据重建攻击验证了这种广泛存在的表面性遗忘现象,并成功逆转了各种遗忘方法的效果。 为了从根本上解决这一漏洞,我们定义了一个理论标准“深层遗忘”,该标准基于要忘记的数据的特征表示的一点收缩(one-point-contraction)。此外,我们还提出了一种高效的近似算法,并利用它构建了一种新颖的通用遗忘算法:一点收缩法(One-Point-Contraction, OPC)。 在图像分类遗忘基准测试上的实证评估表明,OPC不仅实现了有效的遗忘性能,而且对无训练性能恢复攻击和基于梯度反转的攻击也具有更强的防御能力。这种独特的遗忘性能源自其理论基础所强制执行的深层特征遗忘,并强调了改进机器遗忘方法鲁棒性的需求。
https://arxiv.org/abs/2507.07754
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at this https URL.
数字图像中的局部运动模糊源于曝光过程中动态对象与静态成像系统之间的相对移动。现有的去模糊方法在解决这一问题时面临着挑战,原因在于它们未能有效地分配计算资源,并且不足以处理空间变化的模糊模式。为了克服这些限制,我们首先提出了一种可训练的掩码预测器来识别图像中的模糊区域。在训练过程中,我们使用模糊掩码排除清晰部分。对于推理优化,我们通过将$3\times 3$卷积转换为计算效率更高的$1\times 1$卷积实现了结构再参数化,从而允许以像素级的方式修剪清晰区来减少计算量。 其次,我们开发了一种帧内运动分析器,它可以将相对像素位移转化为运动轨迹,并据此建立适用于特定区域的自适应引导,以便于局部模糊恢复。我们的方法使用重建损失、重新模糊损失和由标注模糊掩码指导的掩模损失进行端到端训练。广泛的实验表明,在本地和全局模糊数据集上,我们所提出的方法比现有的最佳方法表现更优,并且相比最先进的模型(例如LMD-ViT)减少了49%的浮点运算次数(FLOPs)。 源代码可在此网址获取:[https URL] (请将方括号中的内容替换为实际链接地址)。
https://arxiv.org/abs/2507.07708