Spiking Neural Networks (SNNs) are promising bio-inspired third-generation neural networks. Recent research has trained deep SNN models with accuracy on par with Artificial Neural Networks (ANNs). Although the event-driven and sparse nature of SNNs show potential for more energy efficient computation than ANNs, SNN neurons have internal states which evolve over time. Keeping track of SNN states can significantly increase data movement and storage requirements, potentially losing its advantages with respect to ANNs. This paper investigates the energy effects of having neuron states, and how it is influenced by the chosen mapping to realistic hardware architectures with advanced memory hierarchies. Therefore, we develop STEMS, a mapping design space exploration tool for SNNs. STEMS models SNN's stateful behavior and explores intra-layer and inter-layer mapping optimizations to minimize data movement, considering both spatial and temporal SNN dimensions. Using STEMS, we show up to 12x reduction in off-chip data movement and 5x reduction in energy (on top of intra-layer optimizations), on two event-based vision SNN benchmarks. Finally, neuron states may not be needed for all SNN layers. By optimizing neuron states for one of our benchmarks, we show 20x reduction in neuron states and 1.4x better performance without accuracy loss.
脉冲神经网络(SNNs)是第三代仿生启发式神经网络,近年来的研究已经训练出了深度SNN模型,并且其准确率与人工神经网络(ANNs)相当。尽管SNN的事件驱动和稀疏特性显示了比ANNS更节能计算的潜力,但是SNN中的神经元具有随时间演化的内部状态。跟踪SNN的状态会显著增加数据移动和存储需求,可能使其相对于ANN的优势丧失。本文研究了拥有神经元状态对能量的影响以及这种影响是如何受到选定映射到具备先进内存层次结构的真实硬件架构时所影响的。因此,我们开发了一种名为STEMS的设计空间探索工具用于SNN。 STEMS模拟SNN的状态行为,并探索层内和层间映射优化以减少数据移动,同时考虑了SNN的空间和时间维度。通过使用STEMS,在两个基于事件的视觉SNN基准测试中,我们展示了最多可以将片外数据移动减少了12倍,能量消耗减少了5倍(在层内优化的基础上)。最终,神经元状态对于所有SNN层次可能并非都是必需的。通过对我们的一个基准进行神经元状态的优化,我们展示了可以将神经元状态减少多达20倍,并且性能提升了1.4倍而没有造成准确性损失。
https://arxiv.org/abs/2502.03287
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
我们介绍了一种新的方法,用于系统地将稀疏自编码器在大型语言模型连续层中发现的特征进行映射,这扩展了之前研究跨层特征联系的工作。通过使用无数据余弦相似度技术,我们可以追踪特定特征如何在每个阶段持续存在、转变或首次出现。这种方法生成了细化的流图,展示了特征演变的过程,从而使得对模型计算的细粒度解释和机理洞察成为可能。尤为重要的是,我们证明了这些跨层特征映射如何直接引导模型行为,通过放大或抑制选定特征来实现文本生成中的主题控制。总的来说,我们的研究结果强调了一种因果关系导向、跨层可解释性的框架的实用性,该框架不仅阐明了特征在前向传播过程中是如何发展的,还提供了透明地操控大型语言模型的新方法。
https://arxiv.org/abs/2502.03032
Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and lock-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, and IMDB dataset. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.
神经网络中的二进制和稀疏三进制权重能够实现更快的计算和更轻量级的表示,这使得它们能够在计算资源有限的边缘设备上使用。然而,标准循环神经网络(vanilla RNNs)对递归权重的变化非常敏感,因此将其权重进行二值化或三值化的处理极具挑战性。迄今为止,还没有任何方法能够成功实现标准RNN递归权重的二值化或三值化。 我们提出了一种新方法,利用哈达玛矩阵(Hadamard matrices)的特性来参数化一组二进制和稀疏三进制正交矩阵。这种方法使得具有二进制和稀疏三进制递归权重的正交RNNs(ORNNs)能够进行训练,并有效创建出一种特定类型的二值或稀疏三进制标准循环神经网络。我们通过在复制任务、排列及顺序MNIST任务以及IMDB数据集上对所得到的HadamRNN和lock-HadamRNN进行评估,证明了即使进行了二值化或稀疏三进制处理后,这些RNN仍能保持与现有最佳全精度模型相媲美的性能水平。这凸显出我们方法的有效性。 值得一提的是,我们的方法是首个具备解决超过1000个时间步长复制任务能力的具有二进制递归权重的方法。
https://arxiv.org/abs/2502.00047
Multi-person motion capture over sparse angular observations is a challenging problem under interference from both self- and mutual-occlusions. Existing works produce accurate 2D joint detection, however, when these are triangulated and lifted into 3D, available solutions all struggle in selecting the most accurate candidates and associating them to the correct joint type and target identity. As such, in order to fully utilize all accurate 2D joint location information, we propose to independently triangulate between all same-typed 2D joints from all camera views regardless of their target ID, forming the Joint Cloud. Joint Cloud consist of both valid joints lifted from the same joint type and target ID, as well as falsely constructed ones that are from different 2D sources. These redundant and inaccurate candidates are processed over the proposed Joint Cloud Selection and Aggregation Transformer (JCSAT) involving three cascaded encoders which deeply explore the trajectile, skeletal structural, and view-dependent correlations among all 3D point candidates in the cross-embedding space. An Optimal Token Attention Path (OTAP) module is proposed which subsequently selects and aggregates informative features from these redundant observations for the final prediction of human motion. To demonstrate the effectiveness of JCSAT, we build and publish a new multi-person motion capture dataset BUMocap-X with complex interactions and severe occlusions. Comprehensive experiments over the newly presented as well as benchmark datasets validate the effectiveness of the proposed framework, which outperforms all existing state-of-the-art methods, especially under challenging occlusion scenarios.
多个人体动作捕捉在稀疏角度观测下是一个充满挑战的问题,因为自遮挡和相互遮挡会带来干扰。现有的研究虽然能够准确地检测二维关节位置,但在将这些信息三角测量并提升到三维时,现有解决方案难以选择最精确的候选者并将它们正确关联至对应的关节点类型和目标身份。因此,为了充分利用所有准确的二维关节位置信息,我们建议独立地在来自所有摄像头视角的所有同类型二维关节之间进行三角测量,而不考虑它们的目标ID,形成“关节云”。这个关节云包含了从相同关节点类型和目标ID中有效提升出的真实关节以及由不同二维来源错误构建的虚假关节。这些冗余且不准确的候选者通过提出的关节云选择与聚合变换器(JCSAT)进行处理,该变换器涉及三个级联编码器,深入探索在交叉嵌入空间内所有三维点候选者的轨迹、骨骼结构和视角相关性。 我们提出了一种最优令牌注意力路径(OTAP)模块,在冗余观察中选取并聚集有信息量的特征以用于最终的人体运动预测。为了展示JCSAT的有效性,我们建立并发布了包含复杂互动和严重遮挡的新多人动作捕捉数据集BUMocap-X。在新发布以及基准数据集上的全面实验验证了所提出框架的有效性,在挑战性的遮挡场景下尤其超越所有现有的最先进方法。 简而言之,这项工作通过引入一种新的处理方式来克服现有三维人体重建中的关键问题,并通过创建一个包含复杂互动和严重遮挡的新数据集来证明其有效性。
https://arxiv.org/abs/2502.02936
We show that contact-rich motion planning is also sparsity-rich when viewed as polynomial optimization (POP). We can exploit not only the correlative and term sparsity patterns that are general to all POPs, but also specialized sparsity patterns from the robot kinematic structure and the separability of contact modes. Such sparsity enables the design of high-order but sparse semidefinite programming (SDPs) relaxations--building upon Lasserre's moment and sums of squares hierarchy--that (i) can be solved in seconds by off-the-shelf SDP solvers, and (ii) compute near globally optimal solutions to the nonconvex contact-rich planning problems with small certified suboptimality. Through extensive experiments both in simulation (Push Bot, Push Box, Push Box with Obstacles, and Planar Hand) and real world (Push T), we demonstrate the power of using convex SDP relaxations to generate global contact-rich motion plans. As a contribution of independent interest, we release the Sparse Polynomial Optimization Toolbox (SPOT)--implemented in C++ with interfaces to both Python and Matlab--that automates sparsity exploitation for robotics and beyond.
我们展示了当将接触丰富的运动规划视为多项式优化(POP)时,其同样具有稀疏性。不仅可以利用适用于所有POP的关联和项稀疏模式,还可以利用来自机器人运动学结构以及接触模式可分离性的专业化稀疏模式。这种稀疏性使得可以设计出高阶但稀疏的半定规划(SDP)松弛方法——基于拉瑟尔矩量法和平方和层级技术——这些方法能够: (i) 通过现成的SDP求解器在几秒钟内解决; (ii) 对于非凸接触丰富的规划问题,计算接近全局最优解,并且具有可认证的小幅度次优性。 我们通过广泛的实验(包括仿真环境中的Push Bot、Push Box、带有障碍物的Push Box和Planar Hand,以及真实世界中的Push T)展示了使用凸SDP松弛方法生成全局接触丰富运动计划的强大能力。作为一项独立贡献,我们发布了稀疏多项式优化工具箱(SPOT),该工具箱以C++编写,并具有Python和Matlab接口,用于自动利用稀疏性进行机器人技术及其他领域的应用。
https://arxiv.org/abs/2502.02829
Diffusion models demonstrate state-of-the-art performance on image generation, and are gaining traction for sparse medical image reconstruction tasks. However, compared to classical reconstruction algorithms relying on simple analytical priors, diffusion models have the dangerous property of producing realistic looking results \emph{even when incorrect}, particularly with few observations. We investigate the utility of diffusion models as priors for image reconstruction by varying the number of observations and comparing their performance to classical priors (sparse and Tikhonov regularization) using pixel-based, structural, and downstream metrics. We make comparisons on low-dose chest wall computed tomography (CT) for fat mass quantification. First, we find that classical priors are superior to diffusion priors when the number of projections is ``sufficient''. Second, we find that diffusion priors can capture a large amount of detail with very few observations, significantly outperforming classical priors. However, they fall short of capturing all details, even with many observations. Finally, we find that the performance of diffusion priors plateau after extremely few ($\approx$10-15) projections. Ultimately, our work highlights potential issues with diffusion-based sparse reconstruction and underscores the importance of further investigation, particularly in high-stakes clinical settings.
扩散模型在图像生成方面表现出卓越的性能,并且在稀疏医学图像重建任务中越来越受欢迎。然而,与依赖于简单分析先验的经典重建算法相比,扩散模型具有一个危险特性:即使结果不正确,它们也能产生看起来非常逼真的图像,特别是在观测数据较少的情况下。我们通过改变观测数量并使用像素级、结构化和下游指标将扩散模型的性能与经典先验(如稀疏正则化和Tikhonov正则化)进行比较,来研究其作为图像重建先验的有效性。我们在低剂量胸部壁计算机断层扫描(CT)用于脂肪质量量化的情况下进行了对比分析。 首先,我们发现当投影数量“足够多”的时候,经典先验优于扩散先验。其次,我们发现在非常少量的观测数据下,扩散先验能够捕捉到大量的细节信息,并且显著优于经典先验。然而,即使有大量观察数据,它们也无法捕获所有细节。最后,我们发现扩散先验在极少数(约10-15个)投影后其性能趋于稳定。 最终,我们的研究揭示了基于扩散的稀疏重建可能存在的一些问题,并强调了在高风险临床环境中进行进一步调查的重要性。
https://arxiv.org/abs/2502.02771
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
利用注意力稀疏性来加速长上下文的大规模语言模型(LLMs)已经成为一个热门的研究领域。然而,目前的算法如稀疏注意或键值(KV)缓存压缩倾向于使用固定的预算,在部署时这一方法面临着重大挑战,因为它未能考虑到现实场景中的动态变化特性,在这些情况下,准确性与效率之间的最优平衡可能会有很大的不同。 在本文中,我们发现将top-$p$采样(核采样)应用到稀疏注意上可以实现自适应的预算分配。基于此,我们提出了Twilight框架,该框架能够在不牺牲现有稀疏注意力算法准确性的前提下,将其转变为具有自适应稀疏性的算法。 实验结果表明,Twilight能够自适应地裁剪多达98%的冗余标记,在长上下文LLM解码中实现了15.4倍的自我注意操作加速和3.9倍的每令牌端到端延迟加速。
https://arxiv.org/abs/2502.02770
Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (<1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach -- which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) -- achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M -- the largest open LArTPC dataset (1M+ events, 5.2B labeled points) -- to advance SSL in high energy physics (HEP). Project site: this https URL
有效的自监督学习(SSL)技术已成为解锁大型数据集进行表示学习的关键。尽管许多有前景的方法已经通过在线语料库和配有说明的照片得到了发展,但它们在科学领域的应用——这些领域中的数据编码了高度专业化的知识——仍处于早期阶段。我们提出了一种用于时间投影室(TPC)中3D粒子轨迹分析的自监督掩码建模框架。这些探测器产生的全局稀疏(<1%占用率)但局部密集的点云,以毫米级分辨率捕捉到米级的粒子轨迹。 基于PointMAE的工作提出了体积标记化来将稀疏的离子化点分组为与分辨率无关的补丁,并引入了一个辅助能量填充任务来改进轨迹语义。我们称这种方法为基于点的液氩掩码自编码器(PoLAr-MAE),它在没有标注数据的情况下,达到了99.4%的追踪和97.7%的 Shower分类F分数,与监督基线相匹配。 尽管该模型能够学习到丰富的粒子轨迹表示,但它仍难以处理如重叠或短寿命粒子轨迹这样的亚标记现象。为了支持进一步的研究,我们发布了PILArNet-M——一个最大的开放LArTPC数据集(超过100万事件,52亿个标注点),以推动高能物理领域中的自监督学习。 项目网站:[这个链接](https://this-url.com)
https://arxiv.org/abs/2502.02558
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical components of modern natural language processing (NLP) applications in information retrieval, question answering, and knowledge-based text generation. However, existing solutions are often fragmented, lacking a unified framework that easily integrates these essential processes. The absence of a standardized implementation, coupled with the complexity of retrieval and re-ranking workflows, makes it challenging for researchers to compare and evaluate different approaches in a consistent environment. While existing toolkits such as Rerankers and RankLLM provide general-purpose reranking pipelines, they often lack the flexibility required for fine-grained experimentation and benchmarking. In response to these challenges, we introduce \textbf{Rankify}, a powerful and modular open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework. Rankify supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models to enhance retrieval quality. Additionally, Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface (this https URL). To encourage adoption and ease of integration, we provide comprehensive documentation (this http URL), an open-source implementation on GitHub(this https URL), and a PyPI package for effortless installation(this https URL). By providing a unified and lightweight framework, Rankify allows researchers and practitioners to advance retrieval and re-ranking methodologies while ensuring consistency, scalability, and ease of use.
检索、重新排序和检索增强生成(RAG)是现代自然语言处理(NLP)应用中信息检索、问题回答和基于知识的文本生成的关键组成部分。然而,现有的解决方案往往碎片化,缺乏一个能够轻松整合这些重要过程的统一框架。没有标准化的实现,加上检索和重新排序工作流程的复杂性,使得研究者在一致的环境中比较和评估不同方法变得具有挑战性。尽管现有工具包如Rerankers和RankLLM提供了通用的重新排序管道,但它们通常缺乏进行细粒度实验和基准测试所需的灵活性。为应对这些挑战,我们推出了\textbf{Rankify},这是一个强大且模块化的开源工具包,旨在将检索、重新排序和RAG统一在一个连贯的框架内。 Rankify支持一系列检索技术,包括密集型和稀疏型检索器,并集成了最先进的重新排序模型以提升检索质量。此外,Rankify还包含一组预检索数据集,以便于进行基准测试,这些数据集可在Huggingface(此链接)上获取。为了促进采用和集成的便利性,我们提供了详尽的文档(此链接)、GitHub上的开源实现(此链接),以及一个PyPI包以方便安装(此链接)。通过提供统一且轻量级的框架,Rankify使研究人员和实践者能够推进检索和重新排序方法的发展,并确保一致性和可扩展性的同时提高易用性。
https://arxiv.org/abs/2502.02464
3D Gaussian Splatting has emerged as an efficient photorealistic novel view synthesis method. However, its reliance on sparse Structure-from-Motion (SfM) point clouds consistently compromises the scene reconstruction quality. To address these limitations, this paper proposes a novel 3D reconstruction framework Gaussian Processes Gaussian Splatting (GP-GS), where a multi-output Gaussian Process model is developed to achieve adaptive and uncertainty-guided densification of sparse SfM point clouds. Specifically, we propose a dynamic sampling and filtering pipeline that adaptively expands the SfM point clouds by leveraging GP-based predictions to infer new candidate points from the input 2D pixels and depth maps. The pipeline utilizes uncertainty estimates to guide the pruning of high-variance predictions, ensuring geometric consistency and enabling the generation of dense point clouds. The densified point clouds provide high-quality initial 3D Gaussians to enhance reconstruction performance. Extensive experiments conducted on synthetic and real-world datasets across various scales validate the effectiveness and practicality of the proposed framework.
3D高斯点阵(Gaussian Splatting)作为一种高效的逼真新视角合成方法已经崭露头角,然而其依赖稀疏的结构从运动(SfM)点云来重建场景的质量始终存在问题。为了解决这些限制,本文提出了一种新的三维重建框架——高斯过程高斯点阵(GP-GS),其中开发了一个多输出高斯过程模型,以实现对稀疏SfM点云的自适应且基于不确定性指导的密度化。 具体来说,我们设计了一条动态采样和过滤流水线,这条管线通过利用基于高斯过程的预测从输入的二维像素和深度图中推断出新的候选点来自适应地扩展SfM点云。该流程使用不确定性的估计来引导去除方差高的预测,确保了几何一致性,并能够生成密集的点云。这些密度化的点云提供了高质量的初始三维高斯分布,以提升重建性能。 在合成和现实世界数据集上进行的各种规模上的广泛实验验证了所提出的框架的有效性和实用性。
https://arxiv.org/abs/2502.02283
This paper proposes ShapeShifter, a new 3D generative model that learns to synthesize shape variations based on a single reference model. While generative methods for 3D objects have recently attracted much attention, current techniques often lack geometric details and/or require long training times and large resources. Our approach remedies these issues by combining sparse voxel grids and point, normal, and color sampling within a multiscale neural architecture that can be trained efficiently and in parallel. We show that our resulting variations better capture the fine details of their original input and can handle more general types of surfaces than previous SDF-based methods. Moreover, we offer interactive generation of 3D shape variants, allowing more human control in the design loop if needed.
本文提出了一种新的三维生成模型ShapeShifter,该模型能够基于单一参考模型学习合成形状的变化。尽管近年来针对三维对象的生成方法受到了广泛关注,但目前的技术往往缺乏几何细节,并且需要较长的训练时间和大量的资源。我们的方法通过结合稀疏体素网格和点、法线以及颜色采样,在一个多尺度神经架构中解决了这些问题,该架构可以高效并行地进行训练。我们展示了所得到的变化更好地捕捉了原始输入的细微特征,并能处理比以前基于SDF的方法更为广泛的表面类型。此外,我们提供了三维形状变体的交互式生成功能,如果需要的话,可以在设计过程中提供更多的人类控制。
https://arxiv.org/abs/2502.02187
Recent advances in CV and NLP have inspired researchers to develop general-purpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM's effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.
近期在计算机视觉(CV)和自然语言处理(NLP)领域的进展激发了研究人员开发跨多个领域进行预训练的通用图基础模型。然而,不同领域之间图拓扑结构的巨大差异构成了一个基本挑战。此外,现实世界的图通常稀疏且容易受到噪声连接和对抗性攻击的影响。为了解决这些问题,我们提出了多领域图基础模型(MDGFM),这是一种统一框架,通过跨域对齐和利用拓扑信息来促进稳健的知识转移。MDGFM 通过自适应地平衡特征与拓扑结构,并精炼原始图以消除噪声并对其拓扑结构进行对齐,在不同的领域之间建立了桥梁。 为了进一步增强知识迁移的能力,我们引入了一种高效的提示微调方法。通过对齐拓扑结构,MDGFM 不仅提高了多域预训练的效果,还能够使知识稳健地转移到未见过的领域中去。理论分析为 MDGFM 的有效性以及领域泛化能力提供了保障。在同质和异质图数据集上的广泛实验验证了我们方法的鲁棒性和高效性。 这个研究提供了一个全新的视角来处理跨不同领域的图数据,通过解决拓扑结构差异的问题,并引入新的技术以增强模型对新环境的适应能力。
https://arxiv.org/abs/2502.02017
Navigating densely vegetated environments poses significant challenges for autonomous ground vehicles. Learning-based systems typically use prior and in-situ data to predict terrain traversability but often degrade in performance when encountering out-of-distribution elements caused by rapid environmental changes or novel conditions. This paper presents a novel, lidar-only, online adaptive traversability estimation (TE) method that trains a model directly on the robot using self-supervised data collected through robot-environment interaction. The proposed approach utilises a probabilistic 3D voxel representation to integrate lidar measurements and robot experience, creating a salient environmental model. To ensure computational efficiency, a sparse graph-based representation is employed to update temporarily evolving voxel distributions. Extensive experiments with an unmanned ground vehicle in natural terrain demonstrate that the system adapts to complex environments with as little as 8 minutes of operational data, achieving a Matthews Correlation Coefficient (MCC) score of 0.63 and enabling safe navigation in densely vegetated environments. This work examines different training strategies for voxel-based TE methods and offers recommendations for training strategies to improve adaptability. The proposed method is validated on a robotic platform with limited computational resources (25W GPU), achieving accuracy comparable to offline-trained models while maintaining reliable performance across varied environments.
在植被密集的环境中导航对自主地面车辆构成了重大挑战。基于学习的方法通常使用先验和现场数据来预测地形可通行性,但在遇到由快速环境变化或新情况引起的分布外元素时,性能往往会下降。本文提出了一种新颖、仅使用激光雷达(LiDAR)的在线自适应可通行性估计(TE)方法,该方法直接在机器人上训练模型,通过机器人与环境之间的交互收集自我监督数据。所提出的方案利用概率3D体素表示法来整合激光雷达测量值和机器人的经验,创建一个显著的环境模型。为了确保计算效率,采用了一种稀疏图基表示法来更新暂时演化的体素分布。使用无人驾驶地面车辆在自然地形中进行广泛的实验表明,该系统只需8分钟的操作数据即可适应复杂环境,并实现了0.63的马修斯相关系数(MCC)评分,在密集植被环境中实现了安全导航。这项工作研究了基于体素的TE方法的不同训练策略,并为提高适应性的培训策略提供了建议。所提出的方法在计算资源有限(25W GPU)的机器人平台上得到了验证,其准确性与离线训练模型相当,同时在各种环境下保持可靠的性能。
https://arxiv.org/abs/2502.01987
The history of art has seen significant shifts in the manner in which artworks are created, making understanding of creative processes a central question in technical art history. In the Renaissance and Early Modern period, paintings were largely produced by master painters directing workshops of apprentices who often contributed to projects. The masters varied significantly in artistic and managerial styles, meaning different combinations of artists and implements might be seen both between masters and within workshops or even individual canvases. Information on how different workshops were managed and the processes by which artworks were created remains elusive. Machine learning methods have potential to unearth new information about artists' creative processes by extending the analysis of brushwork to a microscopic scale. Analysis of workshop paintings, however, presents a challenge in that documentation of the artists and materials involved is sparse, meaning external examples are not available to train networks to recognize their contributions. Here we present a novel machine learning approach we call pairwise assignment training for classifying heterogeneity (PATCH) that is capable of identifying individual artistic practice regimes with no external training data, or "ground truth." The method achieves unsupervised results by supervised means, and outperforms both simple statistical procedures and unsupervised machine learning methods. We apply this method to two historical paintings by the Spanish Renaissance master, El Greco: The Baptism of Christ and Christ on the Cross with Landscape, and our findings regarding the former potentially challenge previous work that has assigned the painting to workshop members. Further, the results of our analyses create a measure of heterogeneity of artistic practice that can be used to characterize artworks across time and space.
艺术史见证了艺术品创作方式的重大转变,因此理解创意过程成为了技术艺术史中的核心问题。在文艺复兴和早期现代时期,绘画大多是由大师级画家指导学徒工作室完成的,这些学徒常常参与到项目中来。大师们在艺术风格和管理方式上差异很大,这意味着不同的艺术家组合和工具可能会出现在不同大师的作品之间,甚至在同一工作室内或单一画布上也能看到这种多样性。关于各个工作室是如何管理和艺术品是如何被创造出来的信息仍然难以捉摸。 机器学习方法有可能通过将笔触分析扩展到微观层面来揭示有关艺术家创作过程的新信息。然而,工作室绘画的分析面临着挑战,因为涉及的艺术家和材料记录稀少,这意味着没有外部示例可以用来训练网络识别他们的贡献。在这里,我们提出了一种称为成对分配训练分类异质性(PATCH)的新机器学习方法,它能够在没有外部训练数据或“真实情况”的情况下识别个体艺术实践模式。该方法通过监督手段实现了无监督的结果,并且在简单统计程序和无监督机器学习方法中表现出色。 我们将这种方法应用于西班牙文艺复兴大师埃尔·格列柯的两幅历史画作:《基督受洗》和《十字架上的基督与风景》,我们的研究结果可能挑战了将前者归功于工作室成员的工作。此外,我们分析的结果创建了一个衡量艺术实践异质性的指标,可以用来跨越时间和空间来描述艺术品。
https://arxiv.org/abs/2502.01912
Optimal Transport (OT) theory seeks to determine the map $T:X \to Y$ that transports a source measure $P$ to a target measure $Q$, minimizing the cost $c(\mathbf{x}, T(\mathbf{x}))$ between $\mathbf{x}$ and its image $T(\mathbf{x})$. Building upon the Input Convex Neural Network OT solver and incorporating the concept of displacement-sparse maps, we introduce a sparsity penalty into the minimax Wasserstein formulation, promote sparsity in displacement vectors $\Delta(\mathbf{x}) := T(\mathbf{x}) - \mathbf{x}$, and enhance the interpretability of the resulting map. However, increasing sparsity often reduces feasibility, causing $T_{\#}(P)$ to deviate more significantly from the target measure. In low-dimensional settings, we propose a heuristic framework to balance the trade-off between sparsity and feasibility by dynamically adjusting the sparsity intensity parameter during training. For high-dimensional settings, we directly constrain the dimensionality of displacement vectors by enforcing $\dim(\Delta(\mathbf{x})) \leq l$, where $l < d$ for $X \subseteq \mathbb{R}^d$. Among maps satisfying this constraint, we aim to identify the most feasible one. This goal can be effectively achieved by adapting our low-dimensional heuristic framework without resorting to dimensionality reduction. We validate our method on both synthesized sc-RNA and real 4i cell perturbation datasets, demonstrating improvements over existing methods.
最优传输(OT)理论旨在确定一个映射 $T:X \to Y$,该映射将源测度 $P$ 运输到目标测度 $Q$,同时最小化点 $\mathbf{x}$ 与其像 $T(\mathbf{x})$ 之间的成本函数 $c(\mathbf{x}, T(\mathbf{x}))$。基于输入凸神经网络 OT 解决方案,并结合位移稀疏映射的概念,我们在极小极大 Wasserstein 形式化中引入了一项稀疏性惩罚,从而在位移向量 $\Delta(\mathbf{x}) := T(\mathbf{x}) - \mathbf{x}$ 中推广了稀疏性,并增强了所得映射的可解释性。然而,增加稀疏度通常会降低可行性,导致 $T_{\#}(P)$ 更远离目标测度。在低维设置中,我们提出了一种启发式框架来平衡稀疏性和可行性的权衡,通过在训练过程中动态调整稀疏强度参数来进行调整。对于高维场景,我们直接约束位移向量的维度,通过强制执行 $\dim(\Delta(\mathbf{x})) \leq l$(其中 $l < d$, $X \subseteq \mathbb{R}^d$)来实现这一目标。在满足该约束条件的所有映射中,我们的目标是找到最可行的一个。通过调整我们在低维场景中的启发式框架而无需进行降维操作,可以有效地达到这个目标。 我们使用合成的单细胞 RNA 数据集和真实的 4i 细胞扰动数据集对方法进行了验证,并展示了相对于现有方法的改进效果。
https://arxiv.org/abs/2502.01889
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
扩散变换器(DiTs)在视频生成领域占据主导地位,但由于其高昂的计算成本,在现实世界中的应用受到了严重限制。即使是在高性能GPU上,生成几秒钟的视频通常也需要几十分钟的时间。这种低效性主要源于三维全注意力机制随着上下文长度增加而产生的二次方计算复杂度。 本文提出了一种无需训练的框架,称为稀疏视频生成器(SVG),该框架利用三维全注意力固有的稀疏特性来提升推理效率。我们发现,根据不同的稀疏模式,注意力头可以动态地分为两类:(1) 空间头部,在此模式下,每帧内空间相关的标记主导了注意力输出;(2) 时间头部,在此模式下,不同帧之间的时间相关标记主导了注意力输出。 基于这一见解,SVG提出了一个在线分析策略来捕捉这些动态稀疏模式,并预测注意力头的类型。结合一种新颖且硬件高效的张量布局转换和定制内核实现,SVG在CogVideoX-v1.5和HunyuanVideo上分别实现了高达2.28倍和2.33倍的整体加速效果,同时保持了生成的质量不变。
https://arxiv.org/abs/2502.01776
Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model's ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes.
“顿悟”(Grokking)或延迟泛化是一种有趣的机器学习现象,其中模型在测试集上的损失仅在其训练集损失收敛后才急剧下降。这一现象挑战了人们对深度学习网络训练动态的传统理解。本文对顿悟进行了形式化的定义和研究,并强调其产生的重要因素是训练数据与测试数据之间的分布差异。 我们引入了两个专门设计的数据集来分析顿悟:一个用于考察有限采样的影响,另一个则探讨迁移学习在顿悟中的作用。通过控制性地采用不平衡抽样不同子类别的方法,系统地再现了这一现象,并证明尽管小样本量与顿悟有很强的关联性,但它并不是导致顿悟的原因。相反,小样本量可以作为一种便捷机制来实现必要的分布变化。 此外,我们还发现当类别形成一种等变映射时,模型能够从相似类别或子类中学到知识,这就可以解释为何会发生顿悟现象。不同于先前的研究认为顿悟主要来源于高正则化和稀疏数据,我们的研究证明了即使在数据密集且超参数调整较少的情况下,顿悟仍然可能发生。 这些发现深化了我们对顿悟的理解,并为进一步开发更好的停止准则奠定了基础,在未来的训练过程中能够更有效地避免过拟合问题。
https://arxiv.org/abs/2502.01774
Sample inefficiency is a long-lasting challenge in deep reinforcement learning (DRL). Despite dramatic improvements have been made, the problem is far from being solved and is especially challenging in environments with sparse or delayed rewards. In our work, we propose to use Adversarial Estimates as a new, simple and efficient approach to mitigate this problem for a class of feedback-based DRL algorithms. Our approach leverages latent similarity search from a small set of human-collected trajectories to boost learning, using only five minutes of human-recorded experience. The results of our study show algorithms trained with Adversarial Estimates converge faster than their original version. Moreover, we discuss how our approach could enable learning in feedback-based algorithms in extreme scenarios with very sparse rewards.
样本效率低下是深度强化学习(DRL)长期面临的挑战。尽管在该领域取得了显著的改进,但这一问题仍未得到彻底解决,尤其是在稀疏或延迟奖励环境中尤为棘手。我们提出了一种新方法——使用对抗性估计来减轻反馈型DRL算法中的此类问题,并且这种方法简单而高效。我们的方法利用隐式相似度搜索,从少量的人类收集轨迹中提取信息以增强学习过程,仅需五分钟左右的人工记录经验即可实现这一目标。研究结果表明,采用对抗性估计训练的算法比原版算法更快地达到收敛状态。此外,我们还探讨了这种方法如何在极端情况下帮助反馈型算法进行学习,例如奖励极其稀疏的情况。
https://arxiv.org/abs/2502.01558
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
密集过程奖励已被证明是大规模语言模型(LLMs)推理时间扩展中稀疏结果级奖励的一个更有效的替代方案,特别是在需要复杂多步骤推理的任务中。尽管密集奖励也为基于强化学习(RL)的LLM提供了一个有吸引力的选择,因为其细粒度奖励具有解决一些固有的结果奖励问题的潜力,如训练效率和信用分配,但这种潜在的优势尚未得到充分利用。这主要是由于在线培训过程奖励模型(PRMs)时面临的挑战,例如收集高质量的过程标签成本高昂且难度大,使得它们特别容易受到奖励篡改的影响。 为了解决这些问题,我们提出了PRIME(通过隐式回报进行过程强化),它仅使用策略回放和结果标签就可以实现实时的PRM更新,并利用隐含的过程奖励。PRIME能够与各种优势函数很好地结合,并且放弃了现有的方法所需的专用奖励模型训练阶段,从而大大减少了开发负担。我们在竞赛数学和编程任务上展示了PRIME的有效性。从Qwen2.5-Math-7B-Base开始,PRIME在多个关键推理基准测试中平均提高了SFT模型15.1%的性能。值得注意的是,我们的最终模型Eurus-2-7B-PRIME在七个推理基准测试上超越了Qwen2.5-Math-7B-Instruct,并且使用了其十分之一的训练数据。 这项研究强调了密集过程奖励对于提升LLMs在需要复杂推理任务上的性能具有巨大的潜力,同时也展示了如何通过巧妙的方法设计来克服在线培训中的挑战。
https://arxiv.org/abs/2502.01456
We study fundamental limitations of Graph Neural Networks (GNNs) for learning sparse matrix preconditioners. While recent works have shown promising results using GNNs to predict incomplete factorizations, we demonstrate that the local nature of message passing creates inherent barriers for capturing non-local dependencies required for optimal preconditioning. We introduce a new benchmark dataset of matrices where good sparse preconditioners exist but require non-local computations, constructed using both synthetic examples and real-world matrices. Our experimental results show that current GNN architectures struggle to approximate these preconditioners, suggesting the need for new architectural approaches beyond traditional message passing networks. We provide theoretical analysis and empirical evidence to explain these limitations, with implications for the broader use of GNNs in numerical linear algebra.
我们研究了图神经网络(GNNs)在学习稀疏矩阵预条件器时的基本限制。尽管最近的工作表明,使用GNN预测不完全因式分解取得了令人鼓舞的结果,但我们证明了消息传递的局部性质为捕捉最优预处理所需的非局部依赖关系带来了固有的障碍。我们引入了一个新的基准数据集,该数据集中存在良好的稀疏预条件器,但需要进行非局部计算,并且这些数据集是通过合成示例和现实世界中的矩阵构造而成的。我们的实验结果表明,当前的GNN架构难以近似这些预条件器,这表明除了传统的消息传递网络之外,还需要新的架构方法。我们提供了理论分析和实证证据来解释这些限制,并对GNNs在数值线性代数中的广泛应用提出了影响。
https://arxiv.org/abs/2502.01397