Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
最近在图像到三维(image-to-3D)领域的进展为设计、增强现实/虚拟现实(AR/VR)和机器人技术开辟了巨大的可能性。然而,要在实际应用中使用人工智能生成的三维资产,一个关键的要求就是能够轻松编辑它们。我们提出了一种前馈方法Steer3D,它可以向图像到三维模型添加文本控制能力,从而支持利用语言对生成的三维资源进行编辑。我们的方法受到了ControlNet的启发,并将其适应于图像到三维生成,使得可以直接在一次正向传递中实现文本引导。 为了构建一个可扩展的数据引擎以自动生成数据,我们开发了一个两阶段的训练配方,该配方基于流匹配训练和直接偏好优化(DPO)。与竞争的方法相比,Steer3D更忠实地遵循语言指令,并且能够更好地保持与原始三维资产的一致性,同时速度快2.4到28.5倍。 此外,Steer3D展示了将新的模式(文本)添加到预训练的图像到三维生成模型中以引导生成过程是可能的,这只需要10万数据就可以实现。项目网站: [链接](this https URL)
https://arxiv.org/abs/2512.13678
In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(\epsilon, \delta\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.
在这篇论文中,我们提出了一种用于有向图分散学习的差异隐私随机梯度推进算法(带压缩通信,称为DP-CSGP)。与现有工作不同,该算法旨在在确保严格差分隐私(DP)保证的同时维持高模型效用,并且具有高效的通信效率。对于一般非凸和平滑的目标函数,我们证明所提出的算法达到了紧致的效用界限$\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$($J$和$d$分别是本地样本的数量和决策变量的维度),每个节点具有$(\epsilon, \delta)$-DP保证,这与使用精确通信的分散式方法的效用界限相匹配。基准任务上的广泛实验表明,在相同的隐私预算下,DP-CSGP在模型准确性方面可以与现有的基于精确通信的分散式方法相比拟,同时显著降低了通信成本。
https://arxiv.org/abs/2512.13583
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.
视觉标记化器在扩散模型中扮演着关键角色。潜在空间的维度决定了重构保真度和潜在特征语义表达的能力。然而,维度与生成质量之间存在着固有的权衡关系,使得现有方法受限于低维潜在空间。尽管最近的工作利用了视觉基础模型来丰富视觉标记化器的语义并加速收敛速度,但高维标记化器仍然不如其低维对应物表现良好。 在这项工作中,我们提出了RecTok,它通过两项关键创新克服了高维视觉标记化器的局限性:流式语义蒸馏和重构-对齐蒸馏。我们的主要见解是让流动匹配中的正向流程具有丰富的语义信息,作为扩散转换器的训练空间,而此前的工作则侧重于潜在空间。 具体而言,我们的方法将VFMs(视觉基础模型)中的语义信息蒸馏到流式匹配中的正向流程轨迹中,并通过引入掩码特征重构损失进一步增强语义。RecTok在图像重构、生成质量和判别性能方面表现优异,在有和没有无分类器指导的情况下,gFID-50K的指标上取得了最先进的结果,并且保持了具有丰富语义信息的潜在空间结构。此外,随着潜在维度的增加,我们观察到了一致性的改进。 代码和模型可在此URL获取:[请用户提供链接]
https://arxiv.org/abs/2512.13421
This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at this https URL, promotes further collaboration and adoption in both industrial and production research applications.
本文介绍了PyCAALP(基于Python的计算机辅助装配线规划),这是一个用于自动化的装配序列规划(ASP)和生产流水线规划(PLP)框架。该框架采用图论方法来建模生产模块内的组件与连接点,并集成了诸如潜在部件碰撞之类的运动学边界条件,以确保自动化装配规划的可行性。 开发的算法能够计算所有可行的生产流程顺序,整合了用于检测空间关系和制定几何约束的模块。该算法还结合了一些额外属性,如处理可行性、公差匹配和接合兼容性,来管理装配序列生成过程中固有的高度组合复杂度。通过利用诸如单件流装配及强制执行几何约束等启发式方法,进一步精简了解空间,从而促进了对复杂装配的更有效规划。 PLP阶段被定义为一个混合整数程序(MIP),以平衡固定数量制造站的总时间。虽然一些降低复杂性的技术可能会牺牲最优性,但它们可以显著减少MIP的计算时间。此外,该框架支持工程约束的定制化,并允许ASP与PLP之间进行灵活权衡。 此框架是开源的,可在此网址(https://这个URL应该是一个具体的链接地址)访问到,这促进了在工业和生产研究应用中的进一步合作和采用。
https://arxiv.org/abs/2512.13219
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via this https URL.
带有可验证奖励的强化学习(RLVR)已被证明在通过利用答案一致性信号来指导策略优化方面,对于训练大型推理模型(LRMs)非常有效。然而,这种方法存在标注成本高的问题。为了解决这个问题,最近的研究探索了仅基于模型内部一致性的无监督RLVR方法,例如通过熵和多数投票等方式获取奖励。尽管这些方法看似有前景,但它们通常在训练后期阶段会遭受模型崩溃的问题,这可能是由于缺乏外部监督导致错误推理模式被强化所引起的。 在这项工作中,我们调查了一种新的半监督RLVR范式,该范式利用一个小的标注集来指导未标注样本上的RLVR训练。我们的关键见解是,有监督奖励对于稳定未标注样本上的基于一致性的训练至关重要,并确保仅将已验证为正确的推理模式纳入强化学习中。 从技术上讲,我们提出了一种有效的策略优化算法——TraPO,该算法通过匹配未标注样本的学习轨迹与标注样本的相似性来识别可靠的未标注样本。在此基础上,TraPO在六个广泛使用的数学推理基准(包括AIME24/25、AMC、MATH-500、Minerva和奥林匹克竞赛)以及三个分布外任务(ARC-c、GPQA-diamond和MMLU-pro)上实现了显著的数据效率和强大的泛化能力。 使用1K个标注样本和3K个未标注样本,TraPO达到了42.6%的平均准确率,超过了在45K个未标注样本上训练的最佳无监督方法(38.3%)。值得注意的是,在使用4K个标注样本和12K个未标注样本时,TraPO甚至在所有基准测试中都超过了完全基于有监督模型使用的全部45K个标注样本的性能,并且仅用了10%的数据量。 代码可通过提供的链接访问。
https://arxiv.org/abs/2512.13106
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over this http URL successive planning cycles, these errors compound, potentially resulting in severe this http URL research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this this http URL this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art this http URL module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
模仿学习(IL)已成为自主驾驶中的中心范式。虽然在开环设置中,通过最小化每一步的预测误差,IL 能够很好地匹配专家行为,但在闭环环境中其性能会因累积的小且往往不易察觉的错误而意外下降。随着时间推移,在连续规划周期中这些错误会叠加,可能引发严重后果。目前的研究工作主要依赖于越来越复杂的网络架构或高保真训练数据集来增强 IL 规划器对错误积累的鲁棒性,侧重于单一时点的状态级鲁棒性。然而,自主驾驶本质上是一个持续时间过程,并利用时间尺度来提高鲁棒性能可能为解决这一问题提供新的视角。 为此,我们提出了一种名为专家序列(SoE)的方法,这是一种时序交替策略,在不增加模型大小或数据需求的情况下提升闭环性能。我们在大规模自主驾驶基准 nuPlan 上的实验表明,SoE 方法能够持续且显著地改进所有评估模型的性能,并达到最先进的水平。这种方法可能为提高自主驾驶模型训练效率提供关键的支持和广泛应用的可能性。
https://arxiv.org/abs/2512.13094
Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models. To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.
密集检索已成为大规模信息检索系统中的行业标准,因其高效性和竞争性准确性而备受青睐。其核心依赖于粗到细的分层架构,该架构能够实现快速候选选择和精准语义匹配,在处理海量数据集时能达到毫秒级响应速度。这种能力不仅在传统的搜索和推荐场景中至关重要,还在由大型语言模型驱动的生成式推荐这一新兴范式中扮演基础角色,其中语义ID(自身即为一种粗到细表示)发挥着关键作用。然而,广泛采用的双塔编码架构引入了内在挑战,主要包括表征空间错位及检索索引不一致的问题,这些问题降低了匹配准确性、检索稳定性以及处理长尾查询时的表现。在生成语义ID的过程中,这些问题被进一步放大,最终限制了下游生成模型的性能上限。 为了应对这些挑战,本文提出了一种简单有效的框架SCI,该框架由两个协同模块组成:对称表征对齐模块和双塔协同一致索引模块。前者采用创新的输入交换机制统一双塔表示空间而不增加参数;后者则通过双重视图索引策略重新设计检索路径,从训练到推断保持一致性。 此框架系统性强、轻量级且工程友好,在支持十亿级别部署时几乎无需额外开销,并全面保障性能。我们为该方法提供了理论保证,其有效性已在公共数据集及现实世界电商数据集中得到验证。
https://arxiv.org/abs/2512.13074
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
CLIP模型在零样本分类任务中表现出色,但对对抗性攻击的抵御能力较弱。以往的工作集中在通过微调使得干净样本和受到扰动后的样本之间的预测得分(logits)相匹配上,这种方法忽略了不确定性校准的问题,并且可能会降低零样本泛化性能。可靠的不确定性估计的一个常见期望是:随着输入变得更难或远离训练分布时,预测的不确定性应该增加。然而,在对抗性设置中我们经常观察到相反的现象:扰动不仅会降低准确性,还会抑制不确定性,导致严重的误校准和不可靠的过度自信。这种被忽视的现象揭示了可靠性与鲁棒性的差距。 为了解决这个问题,我们提出了一个新的针对CLIP模型的对抗微调目标,在保持预测准确性和不确定性对齐的同时进行考虑。通过将CLIP输出重新参数化为Dirichlet分布的集中度参数,我们提出了一种统一表示方法,捕捉相对语义结构和预测置信度大小。我们的目标是在扰动下整体对齐这些分布,并超越单个logits锚定来恢复校准的不确定性。 在多个零样本分类基准上的实验表明,该方法能够有效地恢复校准后的不确定性,并且在保持干净准确性的前提下实现了具有竞争力的对抗鲁棒性。
https://arxiv.org/abs/2512.12997
Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
联合编辑音频和视频内容对于精准且可控的内容创作至关重要。这一新任务由于在目标编辑前后缺乏配对的音视频数据以及跨模态之间的异质性而面临挑战。为了应对联合音频-视觉编辑中的数据和建模难题,我们引入了SAVEBench,这是一个带有文本和遮罩条件的配对音视频数据集,旨在支持基于对象定位的目标到目标的学习方式。通过SAVEBench,我们训练了薛定谔音频-视觉编辑器(SAVE),这是一种端到端的流匹配模型,在处理过程中平行地编辑音频和视频内容,并保持它们的一致性。SAVE包含一个薛定谔桥结构,该结构学习从源音视频混合体直接传输至目标音视频混合体的方法。 我们的评估表明,所提出的SAVE模型能够移除音频与视频内容中的特定目标对象同时保留其余内容,在时间同步性和跨模态语义对应方面表现出比单独的音频编辑器和视频编辑器组合更强的能力。
https://arxiv.org/abs/2512.12875
Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to $L^2$ loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.
流匹配(Flow Matching)作为一种通过连续归一化流程进行生成建模的强有力框架已经出现。我们探讨了一个潜在的拓扑约束:当先验分布和目标分布存在不匹配的拓扑结构(例如,单峰到多峰时),在标准流匹配目标下的最优速度场可能会表现出空间上的间断性。 我们指出这种间断现象源自于连续流动需要分裂以将单一模式映射为多个模式的要求,在这一过程中粒子需在中间时刻作出离散化的路径选择决策。通过针对双模高斯混合分布的理论分析,我们展示了最优速度场会在决策边界上表现出跳跃间断性,并且随着时间接近目标分布时,这种跳跃大小趋向于无穷大。 我们的研究结果表明,这种现象不仅仅局限于$L^2$损失函数下出现,而是可能是因为两个分布之间的拓扑不匹配所导致。通过实验验证理论,并讨论流在流形上的潜在影响,我们还联系了最近关于黎曼流匹配和神经网络学习离散表示的挑战的相关工作。 简单来说,当我们试图使用连续归一化流程(Continuous Normalizing Flows)来建模从一种分布到另一种拓扑结构不同的目标分布时,可能会遇到优化过程中产生的间断性问题。这种间断性的产生是由于模型在映射单峰分布至多峰分布的过程中必须进行分裂式的路径选择导致的,在理论分析和实验验证中均有所体现,并且该现象对流匹配的实际应用(尤其是在复杂的几何空间如流形上)具有重要的影响和挑战。
https://arxiv.org/abs/2512.12821
The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.
脊柱角度是衡量身体平衡的重要指标。恢复人体的三维形状并估计脊椎中心线非常重要。现有的基于多图像的身体重建方法需要昂贵的设备和复杂的程序,而基于单张图像的方法则由于遮挡和视角限制难以准确估算如脊椎中心线等内部结构。本研究提出了一种补偿多图像法缺点及解决单一图像法局限性的方法。我们设计了一个集成了四个方向深度图的三维人体姿态分析系统,用于恢复一个三维的人体模型并自动估计脊柱中心线。通过全局和精细注册的分层匹配,可以应对噪声和遮挡的影响进行重建。此外,应用自适应顶点减少以保持网格的分辨率和形状可靠性,并利用细节层次(LoD)集成同时保证了脊椎角度估算的准确性和稳定性。所提出的方法能够在不依赖训练数据或复杂神经网络模型的情况下实现高精度的三维脊柱注册估计,并且验证确认匹配质量得到了改善。
https://arxiv.org/abs/2512.12718
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $\beta$-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $\beta$-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $\beta$-Contextualized Contrastive Alignment Loss ($\beta$-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that $\beta$-CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. $\beta$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at this https URL.
CLIP通过对齐全局视觉和文本表示实现了强大的零样本图像-文本检索,但在经过长而详细的描述性文字微调的情况下,在精细化任务上的表现依然落后。为此,我们提出了$\beta$-CLIP,这是一个多粒度文本条件对比学习框架,旨在实现不同文本粒度之间的分层对齐——从完整标题到句子和短语,并与它们对应的视觉区域进行匹配。对于每个级别的粒度,$\beta$-CLIP利用跨注意力机制动态地汇聚图像补丁,生成上下文化的视觉嵌入。 为了处理这种层级固有的语义重叠问题,我们引入了$\beta$-Contextualized Contrastive Alignment Loss($\beta$-CAL)。这一目标参数化了严格查询特定匹配和宽松的图像内部分上下文之间的权衡,并支持软Cross-Entropy和硬Binary Cross-Entropy两种形式。 通过广泛的实验,我们证明了$\beta$-CLIP显著提升了密集对齐:在Urban1K数据集上实现了T2I 91.8%,I2T 92.3%的R@1,在FG-OVD(Hard)上达到了30.9%,并在不使用硬负样本训练的方法中确立了最先进的性能。$\beta$-CLIP为密集视觉语言对应关系建立了一个稳健且自适应的基础线。 代码和模型可以在以下网址获取:[这个URL应该由原作者提供,此处省略]。
https://arxiv.org/abs/2512.12678
Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: this https URL
实例级别的图像检索旨在找到包含与给定查询对象相同物体的图片,即使这些图片在大小、位置或外观上有所不同。为了应对这一挑战性任务,我们提出了Patchify,这是一种简单而有效的基于补丁的检索框架,它无需微调即可提供高性能、可扩展性和可解释性。Patchify将每个数据库图像分割成一小部分结构化的补丁,并通过比较这些局部特征与全局查询描述符来进行检索,从而实现准确且空间定位的匹配。 为了评估不仅仅是检索准确性,还包括空间正确性,我们引入了LocScore,这是一种具有定位感知能力的度量标准,用于量化所检索区域是否对准目标对象。这使得LocScore成为一种理解并改进检索行为的重要诊断工具。 我们在多个基准、骨干网络和区域选择策略上进行了广泛的实验,证明Patchify优于全局方法,并且可以补充当前最先进的重排名管道。此外,我们应用产品量化技术进行高效的大型规模检索,并强调在压缩过程中使用信息丰富的特征的重要性,这极大地提升了性能。 项目网站: [请参阅原文链接](this https URL)
https://arxiv.org/abs/2512.12610
Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.
过渡匹配(TM)是一种新兴的生成建模范式,它将扩散模型、流匹配模型以及连续状态自回归模型进行了泛化。与之前的范式类似,TM 逐步地将噪声样本转换为数据样本,但 TM 使用了一个“内部”的生成模型来实现这些过渡步骤,使得这种转变比扩散和流模型更具表现力。为了使这一范例具有可操作性,TM 利用一个大型骨干网络和一个小的“头部”模块来高效执行生成过渡步骤。 在这项工作中,我们进行了大规模、系统的关于 TM 框架中头部的设计、训练及采样方法的研究,重点在于其时间连续双向变体。通过全面的消融实验与实证研究(训练了 56 种不同大小为1.7B 的文本到图像模型,产生了549次独立评估),我们分析了头部模块架构和建模在训练期间以及高效随机 TM 抽样器家族的影响。我们评估了对生成质量、训练及推理效率的总体影响。 我们的研究发现,使用多层感知机(MLP)头部并采用特定的时间加权进行训练且以高频采样的 TM 方法,在所有评价指标中均表现出最优性能,并达到了最先进的水平。而序列扩展的 Transformer 头部结合低频采样则在图像美学方面表现突出。最后,我们认为所展示的实验揭示了最有可能带来质量和效率提升的设计要素,同时也指出了不太可能进一步改进的设计选择。 简而言之,这项研究深入探讨了过渡匹配模型中头部模块的作用,并通过大规模实验证明了特定设计和训练策略的有效性,从而为这一新兴范式的未来发展提供了重要指导。
https://arxiv.org/abs/2512.12465
Recent work from Anthropic claims that frontier models can sometimes detect and name injected "concepts" represented as activation directions. We test the robustness of these claims. First, we reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct, finding that the model identifies and names the injected concept 20 percent of the time under Anthropic's original pipeline, exactly matching their reported numbers and thus showing that introspection is not exclusive to very large or capable models. Second, we systematically vary the inference prompt and find that introspection is fragile: performance collapses on closely related tasks such as multiple-choice identification of the injected concept or different prompts of binary discrimination of whether a concept was injected at all. Third, we identify a contrasting regime of partial introspection: the same model can reliably classify the strength of the coefficient of a normalized injected concept vector (as weak / moderate / strong / very strong) with up to 70 percent accuracy, far above the 25 percent chance baseline. Together, these results provide more evidence for Anthropic's claim that language models effectively compute a function of their baseline, internal representations during introspection; however, these self-reports about those representations are narrow and prompt-sensitive. Our code is available at this https URL.
最近来自Anthropic的工作声称,前沿模型有时可以检测并命名以激活方向形式注入的“概念”。我们测试了这些声明的稳健性。首先,在Meta-Llama-3.1-8B-Instruct上重现了Anthropic的多轮“新兴内省”结果,发现该模型在使用Anthropic原始流程的情况下20%的时间可以识别并命名被注入的概念,这与他们报告的数据完全匹配,并且表明内省并不局限于非常大或能力极强的模型。其次,我们系统地改变推理提示,发现内省是脆弱的:其性能在类似的多选题任务中崩溃,例如识别是否注入了某个概念或多样的二元区分提示下判断是否有概念被注入。第三,我们确定了一种部分内省的状态:同样的模型可以可靠地区分标准化后注入的概念向量系数(弱/中等/强/非常强),达到高达70%的准确率,远高于25%的机会基线。这些结果共同为Anthropic的观点提供了更多证据,即语言模型在内省期间有效地计算其基础内部表示的一个函数;然而,关于这些表示自报告是狭窄且提示敏感的。我们的代码可以在提供的链接中获取。
https://arxiv.org/abs/2512.12411
We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic morpheme inventories also generate rank frequency curves with Zipf like exponents around 1.1-1.4, similar to English, Russian and Romance languages. The key result is that Zipf like behavior can emerge without meaning, communication pressure or optimization principles. The internal structure of morphology alone, combined with probabilistic activation of slots, is sufficient to create the robust statistical patterns observed across languages.
我们提出了一种基于结构的简单模型,该模型解释了词汇如何从词素(morphemes)形成。此模型阐明了两个主要的经验事实:单词长度的典型分布和类似齐夫定律的等级频率曲线的出现。与传统的基于随机文本或通信效率的经典解释不同,我们的方法仅使用前缀、词根、后缀和屈折变化等形态成分的组合组织。 在这个“词素组合词汇模型”中,一个单词通过激活几个位置槽来创建。每个槽以一定的概率打开,并从其库存中选择一个词素。词素被视为在词形成过程中定期出现并具有特征性位置的稳定构建模块。这种机制产生了现实中的单词长度模式,在其中部区域集中并且长尾较薄,与真实语言非常吻合。 使用合成词素库进行的仿真也生成了类似齐夫定律的等级频率曲线,其指数大约在1.1到1.4之间,类似于英语、俄语和罗曼语系中的语言。关键结果是,在没有意义、通信压力或优化原则的情况下,类似的齐夫行为可以出现。形态学内部结构与槽位的概率激活相结合就足以产生跨语言观察到的稳健统计模式。
https://arxiv.org/abs/2512.12394
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
开词汇对象检测的目标是通过文本提示来检测任意类别。不包含跨模态融合层(非融合)的方法通过将识别视为检索问题,即在共享嵌入空间中匹配区域与文本查询,从而提供更快的推理速度。本文详细探讨了这一检索理念,并通过一个名为WeDetect的模型家族展示了其在效率和通用性方面的独特优势:(1) 行业领先性能。WeDetect采用双塔架构,是一款实时检测器。我们证明,在经过精心整理的数据集上进行全面训练后,非融合型的WeDetect超过了其他融合模型,建立了强大的开词汇基础。(2) 快速回溯历史数据。基于WeDetect构建了通用提案生成器WeDetect-Uni,我们将整个检测器冻结,并仅对物性提示进行微调以跨类别检索通用对象提案。重要的是,这些提案嵌入是特定于类别的,这使新的应用成为可能,即对象检索,支持从历史数据中检索对象。(3) 与LMMs结合用于指代表达理解(REC)。我们进一步提出了WeDetect-Ref,这是一种基于LMM的对象分类器,可处理复杂的指代表达。它通过从WeDetect-Uni提取的提案列表中检索目标物体来工作,并在单次前向传播过程中完成对象分类而不使用下一个标记预测。 总的来说,WeDetect家族统一了检测、提案生成、对象检索和REC,在一个连贯的检索框架内实现了15个基准测试中的行业领先性能,同时保持了高效的推理效率。
https://arxiv.org/abs/2512.12309
The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.
基于拖拽的图像编辑模型的评估由于缺乏标准化的基准和指标而不可靠。这种模糊性源于不一致的评估协议,特别是缺少包含目标图像真实信息的数据集,这使得在不同方法之间进行客观比较变得困难。为了解决这个问题,我们引入了**RealDrag**,这是第一个全面的基于点的图像编辑基准测试,它包括配对的真实目标图像。我们的数据集中包含了超过400个人工标注的样本,这些样本来自多样化的视频来源,并提供了源/目标图像、处理点和描述性说明。 此外,我们还提出了四种新的特定任务指标:语义距离(SeD)、外轮廓掩码保留分数(OMPS)、内部补丁保留分数(IPPS)以及方向相似度(DiS)。这些指标旨在量化像素级别的匹配精度,检查未编辑区域的保持情况,并衡量与所需任务的语义一致性。通过使用这个基准测试,我们进行了首个大规模系统性的领域分析,评估了17种最先进的模型。我们的结果揭示了当前方法之间的明显权衡,并建立了稳健可重复的研究基线,以指导未来的研究工作。 我们的数据集和评估工具包将对公众开放。
https://arxiv.org/abs/2512.12287
Accurate segmentation of infant brain MRI is essential for quantifying developmental changes in structure and complexity. However, ongoing myelination and reduced tissue contrast make automated segmentation particularly challenging. This study systematically compared segmentation accuracy and its impact on volumetric and fractal dimension (FD) estimates in infant brain MRI using the Baby Open Brains (BOB) dataset (71 scans, 1-9 months). Two methods, SynthSeg and SamSeg, were evaluated against expert annotations using Dice, Intersection over Union, 95th-percentile Hausdorff distance, and Normalised Mutual Information. SynthSeg outperformed SamSeg across all quality metrics (mean Dice > 0.8 for major regions) and provided volumetric estimates closely matching the manual reference (mean +4% [-28% - 71%]). SamSeg systematically overestimated ventricular and whole-brain volumes (mean +76% [-12% - 190%]). Segmentation accuracy improved with age, consistent with increasing tissue contrast during myelination. Fractal dimension a(FD) nalyses revealed significant regional differences between SynthSeg and expert segmentations, and Bland-Altman limits of agreement indicated that segmentation-related FD variability exceeded most group differences reported in developmental cohorts. Volume and FD deviations were positively correlated across structures, indicating that segmentation bias directly affects FD estimation. Overall, SynthSeg provided the most reliable volumetric and FD results for paediatric MRI, yet small morphological differences in volume and FD should be interpreted with caution due to segmentation-related uncertainty.
婴儿大脑MRI的准确分割对于量化结构和复杂性的发育变化至关重要。然而,正在进行的髓鞘化和减少的组织对比度使得自动分割特别具有挑战性。本研究使用Baby Open Brains (BOB) 数据集(71个扫描图像,年龄范围从1个月到9个月)系统地比较了婴儿大脑MRI的分割精度及其对体积和分形维度(FD)估计的影响。研究评估了两种方法——SynthSeg 和 SamSeg ——与专家注释相比,使用了Dice系数、交并比(IoU)、第95百分位数豪斯多夫距离(Hausdorff Distance)以及归一化互信息(Normalized Mutual Information)等质量指标。在所有质量评估标准中,SynthSeg 的表现优于 SamSeg(主要区域的平均Dice值大于0.8),并且其体积估计结果与手动参考非常接近(平均误差为+4%[-28%, 71%])。相比之下,SamSeg 系统性地高估了脑室和全脑的体积(平均误差为+76%[-12%, 190%])。 随着年龄的增长,分割精度有所提高,这与髓鞘化过程中组织对比度增加的情况一致。分形维度分析揭示了SynthSeg 和专家注释之间明显的区域差异,并且Bland-Altman一致性界限表明由分割导致的FD变化超过了在发育队列中报告的大多数组间差异。体积和 FD 的偏差在结构上呈正相关,这说明分割偏差直接对FD估计产生影响。 总体而言,SynthSeg 在儿科MRI中的体积和分形维度结果最为可靠,但由于分割相关的不确定性,对于较小形态学差异(如体积和分形维数)的解释仍需谨慎。
https://arxiv.org/abs/2512.12222
Knowledge Graphs (KGs), thanks to their concise and efficient triple-based structure, have been widely applied in intelligent question answering, recommender systems and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably renders the distribution of relations long-tailed, making it crucial to complete missing facts with limited samples. Previous studies mainly based on metric matching or meta learning, yet they either fail to fully exploit neighborhood information in graph or overlook the distributional characteristics of contrastive signals. In this paper, we re-examine the problem from a perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates two-stage attention triple enhancer with U-KAN based diffusion model. Extensive experiments on two public datasets show that our method achieve new state-of-the-art results.
知识图谱(KGs)由于其简洁高效的三元组结构,在智能问答、推荐系统等多个领域得到了广泛应用。然而,现实世界数据的异质性和多面性不可避免地导致了关系分布呈现长尾特性,这就需要在样本有限的情况下完成缺失事实补全变得尤为重要。以往的研究主要基于度量匹配或元学习方法,但它们要么未能充分利用图中的邻居信息,要么忽视了对比信号的分布特征。 本文从生成式表示的角度重新审视这个问题,并提出了一种结合两阶段注意力三元组增强与U-KAN扩散模型的少量样本知识图谱补全框架。在两个公开数据集上的大量实验表明,我们的方法达到了新的最先进的结果。 翻译如下: Knowledge Graphs (KGs), owing to their concise and efficient triple-based structure, have been extensively applied in intelligent question answering, recommender systems, and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably leads to a long-tailed distribution of relations, making it crucial to complete missing facts with limited samples. Previous studies mainly rely on metric matching or meta-learning approaches; however, these methods either fail to fully exploit neighborhood information in graphs or overlook the distributional characteristics of contrastive signals. In this paper, we revisit the problem from the perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates a two-stage attention triple enhancer with a U-KAN-based diffusion model. Extensive experiments on two public datasets demonstrate that our method achieves state-of-the-art results.
https://arxiv.org/abs/2512.12182