We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
六维(6D)姿态估计是机器人技术中的一个关键挑战,尤其是在执行抓取和操作任务时。尽管之前的研究将视觉信息与触觉(视听触觉)相结合的方法显示出了一定的潜力,但由于视触数据集有限的问题,这些方法在泛化能力上常常存在不足。本文介绍了ViTa-Zero框架,这是一个零样本学习下的视听触觉姿态估计框架。 我们的核心创新在于利用一个视觉模型作为主干,并基于来自触觉和本体感觉观察所推导出的物理约束来进行可行性检查及测试时优化。具体来说,我们将抓手-物体交互建模为弹簧质量系统,在此系统中,触觉传感器诱导吸引作用力,而本体感受则产生排斥作用力。 我们通过在实际机器人设置上的实验验证了该框架的有效性,展示出其对代表性视觉主干和操作场景(包括抓取、物体拾起及双臂交接)的适用性。与仅依赖视觉模型的方法相比,在跟踪手中物体姿态时,我们的方法克服了一些极端失败模式,并且在平均AUC增益方面,ADD-S提高了55%,ADD提高了60%,同时位置误差降低了80%(相较于FoundationPose)。
https://arxiv.org/abs/2504.13179
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
视觉语言模型是计算机视觉研究中的重要组成部分,然而许多高性能的模型仍然是闭源的,这使得它们的数据、设计和训练方案不为人知。为了解决这个问题,研究社区通过从黑盒模型中进行蒸馏来标注训练数据,并以此取得了强大的基准测试结果。但是,这样做是以可衡量的科学进步为代价的,因为没有教师模型及其数据来源的具体细节,科学研究的进步仍然难以衡量。 在本文中,我们致力于在一个完全开放且可重复的框架内构建感知语言模型(PLM),以促进图像和视频理解方面的透明研究。我们在不使用专有模型蒸馏的情况下分析了标准训练流程,并探索大规模合成数据来识别详细视频理解中的关键数据缺口。为了填补这些缺口,我们发布了一个包含280万个由人工标注的精细粒度视频问答对及时空定位视频字幕的数据集。 此外,我们还推出了PLM-VideoBench,这是一个专注于评测复杂视频理解任务(特别是关于“什么”、“哪里”、“何时”和“如何”的推理能力)的一套评估工具。为了确保研究工作的可重复性,我们公开了数据、训练方案、代码及模型。
https://arxiv.org/abs/2504.13180
We propose spatial polarization multiplexing (SPM) for reconstructing object shape and reflectance from a single polarimetric image and demonstrate its application to dynamic surface recovery. Although single-pattern structured light enables single-shot shape reconstruction, the reflectance is challenging to recover due to the lack of angular sampling of incident light and the entanglement of the projected pattern and the surface color texture. We design a spatially multiplexed pattern of polarization that can be robustly and uniquely decoded for shape reconstruction by quantizing the AoLP values. At the same time, our spatial-multiplexing enables single-shot ellipsometry of linear polarization by projecting differently polarized light within a local region, which separates the specular and diffuse reflections for BRDF estimation. We achieve this spatial polarization multiplexing with a constrained de Bruijn sequence. Unlike single-pattern structured light with intensity and color, our polarization pattern is invisible to the naked eye and retains the natural surface appearance which is essential for accurate appearance modeling and also interaction with people. We experimentally validate our method on real data. The results show that our method can recover the shape, the Mueller matrix, and the BRDF from a single-shot polarimetric image. We also demonstrate the application of our method to dynamic surfaces.
我们提出了一种空间偏振复用(SPM)技术,用于从单个偏振图像中重建物体的形状和反射率,并展示了其在动态表面恢复中的应用。虽然单一图案结构光能够实现一次成像的形状重构,但由于入射光角度采样不足以及投影图案与表面颜色纹理纠缠在一起的原因,很难恢复反射率信息。 我们设计了一种空间复用偏振模式,通过量化AoLP(平均偏振方向)值可以对其进行稳健且唯一的解码以进行形状重建。同时,我们的空间复用技术可以通过在局部区域内投射不同偏振态的光来实现一次成像线性偏振椭偏测量,从而分离镜面反射和漫反射以便于BRDF(双向反射分布函数)估算。 我们通过受限制的德布林序列实现了这种空间偏振复用。与单一图案结构光强度和颜色相比,我们的偏振模式对人眼不可见,并保留了自然表面外观,这对于准确的外观建模以及与人的交互至关重要。我们在真实数据上实验验证了该方法的有效性。结果表明,我们的方法可以从单次拍摄的偏振图像中恢复形状、穆勒矩阵和BRDF。 此外,我们还展示了该方法在动态表面中的应用。
https://arxiv.org/abs/2504.13177
This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at this https URL.
本文介绍了IMAGGarment-1,这是一个精细的服装生成(FGG)框架,能够实现高保真度的服装合成,并且可以精确控制轮廓、颜色和标志放置。与现有的仅限于单一条件输入的方法不同,IMAGGarment-1解决了个性化时尚设计和个人数字服饰应用中多条件可控性的挑战。具体来说,IMAGGarment-1采用两阶段训练策略来分别建模全局外观和局部细节,并通过端到端推理实现统一且可控制的生成。 在第一阶段,我们提出了一种全球外观模型,该模型使用混合注意力模块和颜色适配器共同编码轮廓和颜色。第二阶段,我们介绍了一个带有自适应外观感知模块的局部增强模型,用于注入用户定义的标志和空间约束,从而实现准确的位置放置和视觉一致性。 为了支持这一任务,我们发布了GarmentBench,这是一个大规模的数据集,包含超过18万件服装样本及多级设计条件,包括草图、颜色参考、标志位置以及文本提示。广泛的实验表明,我们的方法优于现有基准模型,在结构稳定性、颜色保真度和局部可控性性能方面表现出色。 该代码和模型可在此网址获取:[此URL链接](请根据实际情况提供实际的链接地址)。
https://arxiv.org/abs/2504.13176
Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from imprecise physical simulation caused by inaccurate geometric reconstruction. This paper introduces RoboSplat, a novel method that generates diverse, visually realistic demonstrations by directly manipulating 3D Gaussians. Specifically, we reconstruct the scene through 3D Gaussian Splatting (3DGS), directly edit the reconstructed scene, and augment data across six types of generalization with five techniques: 3D Gaussian replacement for varying object types, scene appearance, and robot embodiments; equivariant transformations for different object poses; visual attribute editing for various lighting conditions; novel view synthesis for new camera perspectives; and 3D content generation for diverse object types. Comprehensive real-world experiments demonstrate that RoboSplat significantly enhances the generalization of visuomotor policies under diverse disturbances. Notably, while policies trained on hundreds of real-world demonstrations with additional 2D data augmentation achieve an average success rate of 57.2%, RoboSplat attains 87.8% in one-shot settings across six types of generalization in the real world.
从远程操作演示中学到的视动策略面临诸如数据收集时间长、成本高和数据多样性有限等挑战。现有方法通过在RGB空间中增强图像观测或使用基于物理模拟器的Real-to-Sim-to-Real流水线来解决这些问题。然而,前者仅限于2D数据增强,而后者则因几何重建不准确而导致物理仿真不够精确。本文介绍了RoboSplat这一新方法,它能生成多样且视觉逼真的演示,通过直接操作3D高斯分布实现。具体来说,我们通过三维高斯点绘(3DGS)重构场景、直接编辑重构后的场景,并利用五种技术在六类泛化中进行数据增强:使用3D高斯替换改变对象类型、场景外观和机器人形态;使用等变变换以处理不同物体姿态的变化;采用视觉属性编辑来适应不同的光照条件;进行新视角合成以生成新的摄像机视图;以及通过三维内容生成实现多样的物体类型变化。全面的现实世界实验表明,RoboSplat显著提高了在各种扰动下的视动策略泛化能力。值得注意的是,在利用2D数据增强进行额外训练后,基于数百个真实世界演示学习到的策略平均成功率仅为57.2%,而使用RoboSplat在同一项设置下却实现了87.8%的成功率,这一性能跨越了六类泛化的测试环境。
https://arxiv.org/abs/2504.13175
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
设计高效且有效的架构骨干是提升基础模型能力的核心研究方向。受人类认知现象“注意力偏差”——即优先关注某些事件或刺激的自然倾向启发,我们将神经网络架构(包括Transformer、Titans以及现代线性循环神经网络)重新构想为关联记忆模块,这些模块利用内部目标学习键值映射,我们将其称为“注意力偏置”。令人惊讶的是,我们观察到大多数现有的序列模型要么依赖点积相似度,要么采用L2回归作为其注意力偏置。超越这些目标,本文提出了一套替代的注意力偏置配置及其有效近似方法以稳定训练过程。此外,我们将现代深度学习架构中的遗忘机制重新解释为一种保留正则化形式,并为此类序列模型提供了一组新的遗忘门。 基于上述洞察,我们提出了Miras框架,该框架用于设计深度学习架构并提供了四种选择:(i)关联记忆结构;(ii)注意力偏置目标;(iii)保留门;以及(iv)内存学习算法。随后,本文介绍了三种新颖的序列模型——Moneta、Yaad和Memora,这些模型在超越现有线性RNN能力的同时保持了快速可并行化的训练过程。实验结果表明,在Miras框架下做出的不同设计选择会产出具有不同优势的模型。例如,某些Miras实例在诸如语言建模、常识推理以及记忆密集型任务等特定任务上表现出色,甚至超越了Transformer和其他现代线性循环模型的表现。
https://arxiv.org/abs/2504.13173
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
跨模态检索(CMR)是多媒体研究中的一个基本任务,旨在通过不同模态检索语义相关的项目。传统方法通常通过基于嵌入的相似性计算将文本和图像进行匹配,而最近在预训练生成模型方面的进展则推动了生成式检索作为一种有潜力的替代方案的发展。这种范例为每个目标分配唯一标识符,并利用生成模型直接预测与输入查询对应的标识符,无需显式的索引操作。尽管这种方法有很大的潜力,但目前的生成式跨模态检索方法在标识符构建和生成过程中的语义信息仍然不足。 为了克服这些局限性,我们提出了一种新的统一框架——增强语义的跨模态生成检索(Semantic-enhanced generative Cross-mOdal REtrieval,简称SemCORE),旨在释放生成式跨模态检索任务中对语义理解的能力。具体来说,首先构建了一个结构化的自然语言标识符(SID),该标识符能够有效地将目标标识符与优化为理解和生成自然语言的模型进行对齐。此外,我们引入了一种生成语义验证(GSV)策略,以实现细粒度的目标识别。 据我们所知,SemCORE是第一个同时考虑文本到图像和图像到文本检索任务的生成式跨模态检索框架。广泛的实验表明,我们的框架在最先进的生成式跨模态检索方法中表现出色,在基准数据集上实现了显著改进,特别是在文本到图像检索中的Recall@1指标平均提高了8.65个百分点。
https://arxiv.org/abs/2504.13172
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
在提高大型语言模型(LLMs)解决复杂问题的能力方面,扩展测试时间计算已成为关键要素之一。然而,这种方式伴随着高延迟和推理成本的问题。我们引入了一种名为“睡眠时间计算”的方法:这种方法允许模型在查询提出之前离线思考关于上下文的内容,并通过预测用户可能提出的查询并预先计算有用的量来显著减少测试时所需的计算需求。 为了展示该方法的有效性,我们在两个推理任务的修改版本——有状态GSM-Symbolic和有状态AIME上进行了实验。我们发现,在实现相同准确率的情况下,“睡眠时间计算”可以将Stateful GSM-Symbolic和Stateful AIME所需的测试时间计算量减少约5倍,并且通过扩展“睡眠时间计算”,我们可以进一步提高准确性,最多可使Stateful GSM-Symbolic的准确性提升13%,而Stateful AIME则为18%。 此外,我们还引入了多查询GSM-Symbolic,它通过在每个上下文中包括多个相关查询来扩展GSM-Symbolic。使用Multi-Query GSM-Symbolic将“睡眠时间计算”摊销到关于同一上下文的多个相关查询上,我们可以使每条查询的成本平均降低2.5倍。 最后,我们进行了一系列额外分析以了解何时“睡眠时间计算”最为有效,发现用户查询的可预测性与其有效性密切相关。此外,还进行了一个案例研究,探讨了将“睡眠时间计算”应用于现实中的代理软件工程任务的情况。
https://arxiv.org/abs/2504.13171
We introduce a semidefinite relaxation for optimal control of linear systems with time scaling. These problems are inherently nonconvex, since the system dynamics involves bilinear products between the discretization time step and the system state and controls. The proposed relaxation is closely related to the standard second-order semidefinite relaxation for quadratic constraints, but we carefully select a subset of the possible bilinear terms and apply a change of variables to achieve empirically tight relaxations while keeping the computational load light. We further extend our method to handle piecewise-affine (PWA) systems by formulating the PWA optimal-control problem as a shortest-path problem in a graph of convex sets (GCS). In this GCS, different paths represent different mode sequences for the PWA system, and the convex sets model the relaxed dynamics within each mode. By combining a tight convex relaxation of the GCS problem with our semidefinite relaxation with time scaling, we can solve PWA optimal-control problems through a single semidefinite program.
我们提出了一种用于线性系统时间尺度最优控制的半定松弛方法。这些问题本质上是非凸问题,因为系统动力学包含了离散化时间步长与系统状态和控制之间的双线性乘积。所提出的松弛方法与标准二次约束的二阶半定松弛密切相关,但我们精心选择了一部分可能存在的双线性项,并通过变量变换来实现经验上较为紧致的松弛处理,同时保持计算负载较低。 我们进一步将该方法扩展到分段仿射(PWA)系统。为此,我们将PWA最优控制问题表述为凸集图(GCS)中的最短路径问题。在这一GCS中,不同的路径代表了PWA系统的不同模式序列,并且凸集模型描述了每种模式下的松弛动力学。 通过结合GCS问题的紧致凸松弛方法与我们的带有时间尺度调整的半定松弛方法,我们可以用单一的半定规划解决PWA最优控制问题。
https://arxiv.org/abs/2504.13170
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: this https URL.
视觉语言模型(VLMs)在视觉理解方面表现出色,但在生成不存在的物体、动作或概念描述时往往会出现幻觉现象。这种问题在安全关键的应用场景中带来了重大风险。现有减少幻觉的方法通常遵循两种范式:一种是通过调整解码行为使文本与视觉输入一致来修正生成内容;另一种则是事后验证方法,其中外部模型评估并纠正输出结果。尽管这些方法有效,但生成调整方法往往依赖于启发式方法且缺乏有效的校正机制,而事后验证则较为复杂,并通常需要多个模型,而且更倾向于拒绝而非精炼输出。 在本研究中,我们介绍了一种名为REVERSE的统一框架,该框架结合了幻觉感知训练和实时自验证。通过使用一个新的包含超过130万半合成样本的幻觉验证数据集以及一种新颖的推理时间追溯重采样技术,我们的方法使VLM能够在生成过程中检测到幻觉,并动态修正这些幻觉。 我们的评估表明,REVERSE在减少幻觉方面达到了最先进的水平,在CHAIR-MSCOCO上比现有最佳方法高出12%,在HaloQuest上高出28%。我们的数据集、模型和代码可在以下链接获取:this https URL。
https://arxiv.org/abs/2504.13169
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.
从单目野外视频创建逼真的场景和人体重建在人类中心的3D世界的感知中占据重要地位。最近的神经渲染技术进步已实现了完整的人体场景重建,但需要预先校准的摄像机和人体姿态,并且需要数天的训练时间。在这项工作中,我们引入了一个新颖的统一框架,在线同时执行相机跟踪、人体姿态估计和人体场景重建。3D高斯点阵(Gaussian Splatting)被用来高效地学习用于人类和场景的高斯基元,并设计了基于重构的摄像机跟踪和人体姿态估计算法模块,以实现整体理解并有效地分离姿势和外观。具体而言,我们设计了一个人体变形模块来重建细节并提高对分布外姿态的一致性和泛化能力。为了准确学习人与场景之间的空间相关性,我们引入了感知遮挡的人体轮廓渲染和单目几何先验,这进一步提高了重构的质量。在EMDB和NeuMan数据集上的实验表明,在摄像机跟踪、人体姿态估计、新视图合成和运行时间方面,我们的方法的表现优于或与现有方法相当。我们的项目页面位于此链接:[https URL]。 请注意,最后的URL被省略了具体的网址内容,请确认并提供完整的链接地址以便查阅详细信息。
https://arxiv.org/abs/2504.13167
Dexterous manipulation is a fundamental capability for robotic systems, yet progress has been limited by hardware trade-offs between precision, compactness, strength, and affordability. Existing control methods impose compromises on hand designs and applications. However, learning-based approaches present opportunities to rethink these trade-offs, particularly to address challenges with tendon-driven actuation and low-cost materials. This work presents RUKA, a tendon-driven humanoid hand that is compact, affordable, and capable. Made from 3D-printed parts and off-the-shelf components, RUKA has 5 fingers with 15 underactuated degrees of freedom enabling diverse human-like grasps. Its tendon-driven actuation allows powerful grasping in a compact, human-sized form factor. To address control challenges, we learn joint-to-actuator and fingertip-to-actuator models from motion-capture data collected by the MANUS glove, leveraging the hand's morphological accuracy. Extensive evaluations demonstrate RUKA's superior reachability, durability, and strength compared to other robotic hands. Teleoperation tasks further showcase RUKA's dexterous movements. The open-source design and assembly instructions of RUKA, code, and data are available at this https URL.
灵巧的操作是机器人系统的一项基本能力,但进展受到硬件在精度、紧凑性、力量和成本之间的权衡限制。现有的控制方法对机械手设计及应用提出了妥协要求。然而,基于学习的方法为重新思考这些取舍带来了机会,特别是解决肌腱驱动致动器和低成本材料所面临的挑战。本工作介绍了一种名为RUKA的新型机器人手,这是一种采用肌腱驱动的人形手,具有紧凑、经济实惠且功能强大的特点。RUKA由3D打印部件及现成组件制成,拥有5个手指以及15个欠驱动自由度,能够实现多种类似人类的手部抓握方式。它的肌腱驱动致动器允许在紧凑而人尺寸的外形中进行强力抓取。 为解决控制挑战,我们从MANUS手套采集到的动作捕捉数据中学习关节至执行器和指尖至执行器模型,利用手部形态的准确性。广泛评估表明,RUKA在可达性、耐久性和力量方面优于其他机器人手。远程操作任务进一步展示了RUKA灵巧的操作能力。 RUKA的设计及其组装说明、代码及数据均开源并可在此链接访问:[请参见原文链接]。 该段落总结了名为RUKA的肌腱驱动型机械手的研发成果,强调其在紧凑性、成本效益和功能上的优势,并详细介绍了通过动作捕捉数据学习控制模型的方法。此外还指出了远程操作中表现出的操作灵活性及其开源性质。
https://arxiv.org/abs/2504.13165
Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
个性化图像合成已成为文本到图像生成领域的一个关键应用,能够创建包含特定主题的多样背景图片。尽管扩散模型在这一领域占据了主导地位,但自回归模型凭借其统一处理文本和图像建模的架构,在个性化图像生成方面仍较少被探索。本文研究了优化自回归模型以用于个性化图像合成的潜力,利用其固有的多模态能力来执行这项任务。我们提出了一种两阶段训练策略,结合了对文本嵌入的优化和转换器层的微调。我们在自回归模型上的实验表明,该方法在主题忠实度和遵循提示方面与领先的基于扩散的方法相当。研究结果突显了自回归模型在个性化图像生成中的有效性,并为这一领域的未来研究提供了新的方向。
https://arxiv.org/abs/2504.13162
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: this https URL
预训练数据集通常从网络内容中收集,缺乏内在的领域划分。例如,广泛使用的Common Crawl等数据集中没有明确的领域标签,而像The Pile这样的手动标记的数据集则需要大量的人力资源来创建。因此,在预训练性能方面虽然具有显著好处,但确定最佳预训练数据混合仍然是一个挑战性问题。为了解决这些问题,我们提出了基于聚类迭代数据混合引导(CLIMB)的自动化框架,该框架在预训练环境中能够发现、评估和改进数据混合。 具体而言,CLIMB 将大规模的数据集嵌入到语义空间中并进行聚类,然后使用较小的代理模型和预测器来反复寻找最佳混合物。当我们的10亿参数模型连续在这类混合物上以4000亿个标记进行预训练时,在性能上超越了最先进的Llama-3.2-1B 模型达2.0%。此外,我们观察到针对特定领域(如社会科学)优化可以比随机采样提升5%的性能。 最后,我们介绍了ClimbLab,这是一个经过过滤的、包含20个类别集群的1.2万亿标记语料库,作为研究的游乐场;还有ClimbMix,一个紧凑而强大的4000亿标记数据集,专为高效的预训练设计,在相同的标记预算下表现出色。我们分析了最终的数据混合物,并解释了一个最优数据混合所具有的特征。 我们的数据可从以下链接获取:[this https URL]
https://arxiv.org/abs/2504.13161
This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: this https URL
这项调查探讨了从视频生成数字孪生的最新发展。这样的数字孪生可用于机器人应用、媒体内容创作或设计和建设工作。我们分析了各种方法,包括3D高斯点阵、生成式修复、语义分割以及基础模型,并强调它们各自的优点和局限性。此外,我们还讨论了遮挡、光照变化和可扩展性的挑战,以及未来研究的潜在方向。这项调查旨在提供当前最先进的方法及其对实际应用影响的全面概述。 附:详情参见此链接: [https URL] (请将[https URL]替换为实际链接地址)
https://arxiv.org/abs/2504.13159
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
我们研究了从地面和空中视角混合拍摄的图像进行几何重建的任务。当前最先进的基于学习的方法无法处理空地图像对之间的极端视点变化。我们的假设是,缺乏高质量、同步标注的空地数据集用于训练是导致这一失败的关键原因。这样的数据很难准确组装,因为难以以可扩展的方式进行重建。为了解决这个挑战,我们提出了一种结合伪合成渲染(从3D城市级网格生成,例如Google Earth)和真实的地面级众包图像(如MegaDepth)的可扩展框架。伪合成数据模拟了广泛的空中视点范围,而真实、众包图像则有助于提高地面图像的视觉保真度,在这种情况下基于网格的渲染缺乏足够的细节,有效地弥合了真实图像与伪合成渲染之间的领域差距。使用这个混合数据集,我们对几种最先进的算法进行了微调,并在现实世界中实现了零样本空中-地面任务的重大改进。例如,我们观察到基线DUSt3R将不到5%的空地对在相机旋转误差为5度以内进行定位,而通过我们的数据进行微调后,精度几乎提高到了56%,解决了处理大视角变化的主要失败点。除了相机估计和场景重建外,我们的数据集还提高了下游任务(如具有挑战性的空中-地面场景中的新视图合成)的性能,展示了我们方法在现实世界应用中的实用价值。
https://arxiv.org/abs/2504.13157
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at this https URL.
将自然语言与三维几何结合是实现灵活的语言驱动场景理解的关键步骤。尽管最近在三维高斯点绘(3D Gaussian Splatting,简称3DGS)方面取得了进展,实现了快速且高质量的场景重建,研究还探索了将开放词汇理解纳入3DGS的方法。然而,大多数现有方法需要对每个视图的二维语义特征图进行迭代优化,这不仅导致效率低下,而且还导致不同视图之间的三维语义不一致。为了解决这些限制,我们介绍了一种无需训练的框架,该框架直接从高斯原语构建超级点图。超级点图将场景划分为空间紧凑且语义连贯的区域,形成视角一致的3D实体,并提供开放词汇理解的结构化基础。 基于这种图结构,我们设计了一个高效的重投影策略,将二维语义特征提升到超级点上,避免了昂贵的多视图迭代训练过程。由此产生的表示确保了强大的三维语义一致性,并自然支持层次化的理解能力,从而能够在统一的语义场中实现粗粒度和细粒度的开放词汇感知。 广泛的实验表明,我们的方法在开放词汇分割性能方面达到了最先进的水平,同时实现了超过30倍的速度提升,完成语义字段重建。我们的代码将在提供的链接中公开获取。
https://arxiv.org/abs/2504.13153
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
动态三维重建和视频中的点跟踪通常被视为独立的任务,尽管它们之间有着深刻的联系。我们提出了St4RTrack,这是一个前馈框架,它能够从RGB输入中同时在世界坐标系下重构和追踪动态视频内容。这是通过为两个不同时间捕获的帧预测两个适当定义的点图(pointmaps)来实现的。具体来说,我们在同一时刻、同一个世界内预测这两个点图,捕捉静态和动态场景几何结构的同时保持三维对应关系。这些预测沿着视频序列相对于参考帧进行链接,自然地计算出长距离对应的3D重建与3D追踪相结合的有效方法。不同于以往的方法依赖于4D的地面实况监督,我们采用了一种基于重投影损失的新适应方案。 我们建立了新的世界坐标系下的重构和跟踪广泛基准测试,证明了我们的统一数据驱动框架的有效性和效率。我们的代码、模型以及基准测试将公开发布。
https://arxiv.org/abs/2504.13152
How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.
我们如何知道新的机制解释性方法是否真正取得了改进?为了追求有意义且持久的评估标准,我们提出了MIB(Mechanistic Interpretability Benchmark),这是一个涵盖四项任务和五种模型的双轨基准测试。MIB倾向于支持能够精确而简洁地恢复神经语言模型中相关因果路径或特定因果变量的方法。 电路定位轨道比较那些确定执行任务时最重要的模型组件及其之间连接的方法,例如属性补丁或信息流路线。 因果变量定位轨道则比较那些对隐藏向量进行特征化的技术(如稀疏自动编码器(SAE)或分布式对齐搜索(DAS)),并定位与任务相关的因果变量的模型特征。 使用MIB,我们发现归因和掩码优化方法在电路定位上表现最佳。对于因果变量定位,我们发现监督下的DAS方法表现最佳,而SAE特征并不比神经元更好,即隐藏向量的标准维度。这些发现表明,MIB使得不同方法之间的有意义比较成为可能,并增强了我们对这一领域确有实质进步的信心。
https://arxiv.org/abs/2504.13151