Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.
稀疏注意力机制为扩展Transformer大型语言模型(LLM)的长上下文处理能力提供了一种有前景的战略,然而其可行性、效率与精度之间的权衡以及系统化的规模研究仍然未被探索。为了填补这一空白,我们对不同规模的模型、序列长度和稀疏程度进行了无训练成本的稀疏注意力方法的仔细比较,并在一系列长序列任务上进行测试——包括依赖自然语言的同时又易于控制和评估的新颖任务。基于我们的实验结果,我们报告了一系列关键发现: 1. 通过等FLOPS分析,对于非常长的序列而言,大型且高度稀疏的模型相较于小型且密集型的模型更为优选。 2. 在解码过程中,相较于填充阶段,在统计上保证准确性的可达到稀疏度水平更高,并与前者的模型大小相关联。 3. 没有一种明确的战略可以在所有任务和阶段中表现出最优性能;不同的应用场景需要不同的稀疏化单位或预算适应性。即使适度的稀疏程度通常也会在至少一个任务中导致显著的表现下降,这表明稀疏注意力并不是万能解决方案。 4. 我们引入并验证了专门针对稀疏注意机制设计的新规模法则,提供了证据证明我们的发现很可能超越我们实验范围内的适用。 通过这些见解,我们展示了稀疏关注是增强Transformer LLM处理更长序列能力的关键工具,但需要对性能敏感的应用场景进行权衡的仔细评估。
https://arxiv.org/abs/2504.17768
With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
随着大规模3D数据集的出现,如大型重建模型(LRM)这样的前馈3D生成模型受到了广泛关注,并取得了显著的成功。然而,我们观察到RGB图像常常导致训练目标冲突,并且缺乏几何重构所需的清晰度。在本文中,我们重新审视了与网格重建相关的归纳偏差,并引入了DiMeR,这是一种新型的解耦双流前馈模型,专门用于稀疏视图下的网格重建。其核心思想是将输入和框架拆分为几何部分和纹理部分,根据奥卡姆剃刀原则(即简单性原理)降低每一部分的训练难度。 鉴于法线贴图与几何结构严格一致,并且能准确捕捉表面变化,我们将其作为专用于几何分支的独立输入,以减少网络输入和输出之间的复杂度。此外,我们改进了网格提取算法,引入3D真实标签监督。对于纹理分支,我们使用RGB图像作为输入来获取带有纹理信息的网格。 总体而言,DiMeR在包括稀疏视图重建、单张图片到3D转换以及文本到3D在内的各种任务中展示了强大的性能。大量实验表明,与先前的方法相比,DiMeR表现出了显著的优势,在GSO和OmniObject3D数据集上,Chamfer Distance指标提高了超过30%。
https://arxiv.org/abs/2504.17670
The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.
对逼真的虚拟沉浸式音频的需求持续增长,头部相关传输函数(HRTFs)在此过程中扮演了关键角色。HRTFs捕捉声音如何到达我们的耳朵,并反映了独特的解剖特征以增强空间感知能力。研究表明,个性化的HRTFs可以提高定位精度,但其测量过程仍然耗时且需要在无噪音的环境中进行。尽管机器学习已经被证明能够减少所需的测量点数量并因此缩短了测量时间,但仍需在一个受控环境下完成这些操作。 本文提出了一种方法来解决这一限制,即介绍一种新技术,用于从稀疏、有噪声的HRTF测量中插值数据。该方法结合使用了一个HRTF去噪U-Net(用于去除噪音)和一个自动编码生成对抗网络(AE-GAN),后者可以基于三个测量点的数据进行插值处理。 所提出的方法在对数谱失真(LSD)误差上达到了5.41 dB,在余弦相似度损失上为0.0070,展示了该方法在HRTF插值中的有效性。
https://arxiv.org/abs/2504.17586
This paper addresses the critical environmental challenge of estimating ambient Nitrogen Dioxide (NO$_2$) concentrations, a key issue in public health and environmental policy. Existing methods for satellite-based air pollution estimation model the relationship between satellite and in-situ measurements at select point locations. While these approaches have advanced our ability to provide air quality estimations on a global scale, they come with inherent limitations. The most notable limitation is the computational intensity required for generating comprehensive estimates over extensive areas. Motivated by these limitations, this study introduces a novel dense estimation technique. Our approach seeks to balance the accuracy of high-resolution estimates with the practicality of computational constraints, thereby enabling efficient and scalable global environmental assessment. By utilizing a uniformly random offset sampling strategy, our method disperses the ground truth data pixel location evenly across a larger patch. At inference, the dense estimation method can then generate a grid of estimates in a single step, significantly reducing the computational resources required to provide estimates for larger areas. Notably, our approach also surpasses the results of existing point-wise methods by a significant margin of $9.45\%$, achieving a Mean Absolute Error (MAE) of $4.98\ \mu\text{g}/\text{m}^3$. This demonstrates both high accuracy and computational efficiency, highlighting the applicability of our method for global environmental assessment. Furthermore, we showcase the method's adaptability and robustness by applying it to diverse geographic regions. Our method offers a viable solution to the computational challenges of large-scale environmental monitoring.
本文探讨了估计环境中的二氧化氮(NO₂)浓度这一关键的环保挑战,这对公共健康和环境政策至关重要。现有的基于卫星的大气污染估算方法是在选定点位置上建立卫星测量与实地测量之间的关系模型。尽管这些方法已经使我们能够在全球范围内提供空气质量评估的能力得到提升,但它们也存在固有的限制。最大的限制是生成大面积覆盖的综合估计所需的计算资源非常庞大。鉴于这些限制,本研究引入了一种新的密集估算技术。我们的方法旨在平衡高分辨率估计的准确性与计算约束条件的实际性,从而实现高效且可扩展的全球环境评估。通过采用均匀随机偏移采样策略,我们的方法可以将地面真实数据像素位置在更大的区域中均匀分布。这样,在推理阶段,密集估算方法可以通过一步生成整个网格的估算值,大大减少了提供大面积估计所需的计算资源。值得注意的是,与现有的点对点方法相比,我们的方法在准确度上提升了9.45%,达到了均方根误差(MAE)为4.98 μg/m³的成绩。这不仅证明了我们方法的高度准确性和计算效率,也突显出其在全球环境评估中的应用价值。此外,通过将其应用于不同的地理区域,本研究展示了该方法的适应性和鲁棒性。我们的方法提供了一个解决大规模环境监测计算挑战的有效方案。
https://arxiv.org/abs/2504.17039
Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
许多稀疏注意力机制(如邻域注意)通常未能持续地在自注意力基准上实现速度提升。这主要是由于注意力基础设施的复杂性以及AI硬件架构快速演进所致。同时,许多最先进的基础模型,特别是在计算机视觉领域,严重受限于注意力,并需要可靠的稀疏性以摆脱O(n^2)的时间复杂度。在这篇论文中,我们研究了一类着重于局部性的有前景的稀疏注意力机制,并致力于发展一个更好的性能提升分析模型。 首先,我们引入了广义邻域注意(Generalized Neighborhood Attention, GNA),它可以描述滑动窗口、带步长的滑动窗口以及分块注意力。接着,我们考虑在实现这些方法时可能的设计选择,并创建了一个模拟器,该模拟器可以为任何给定设置提供更真实的加速上限。 最后,我们在CUTLASS中基于NVIDIA Blackwell架构设计的一种先进的融合多头注意(Fused Multi-Headed Attention, FMHA)内核之上实现了GNA。我们的实现可以在许多理想分块稀疏情况下完全实现理论上可能的最大速度提升,并达到了1.3 petaFLOPs/秒的有效利用率(在FP16精度下)。此外,我们将各种GNA配置插件到现成的生成模型中,如Cosmos-7B、HunyuanVideo和FLUX,表明它可以在没有微调的情况下为B200提供28%至46%的整体加速。我们将通过NATTEN项目开源我们的模拟器和Blackwell内核。 此段落详细介绍了研究稀疏注意力机制的新方法及其在具体实现中的应用效果,并强调了研究成果的开放性与共享精神。
https://arxiv.org/abs/2504.16922
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.
循环神经网络(RNN)相对于变换器模型的一个关键优势在于其线性计算和空间复杂度,这使得它们在处理长序列时能够实现更快的训练和推理速度。然而,RNN的基本限制是无法随机访问历史上下文,并且简单地集成注意力机制可能会削弱它们在效率方面的优势。为了解决这一局限性,我们提出了**分层稀疏注意(HSA)**,这是一种新颖的注意力机制,它增强了RNN实现长距离随机访问灵活性的能力,同时保持了其在效率和长度泛化方面的好处。 HSA将输入划分成块,并选择每个块中的前$k$个部分进行层次化的信息聚合。核心创新在于基于每个块内部细粒度标记级别的信息学习令牌到块的相关性。这种方法提高了跨领域和跨出域上下文长度的块选择精度。为了使HSA更加高效,我们进一步引入了一种与硬件对齐的内核设计。 通过结合HSA和Mamba模型,我们提出RAMba模型,在仅有4K长度上下文预训练的情况下,能够在6400万长度的上下文中准确检索密码,并在各种下游任务上取得了显著改进,同时几乎保持了恒定的记忆占用。这些结果表明,RAMba在长上下文建模方面具有巨大的潜力。
https://arxiv.org/abs/2504.16795
In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose \textbf{ProMoGen (Progressive Motion Generation)}, a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce \textbf{SAP-CL (Sparse Anchor Posture Curriculum Learning)}, a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.
在计算机动画、游戏设计和人机交互领域,合成与用户意图相符合的人体运动仍然是一个重大挑战。现有方法存在明显局限性:基于文本的方法虽然能提供高层次的语义指导,但在描述复杂动作时往往不够精确;轨迹导向技术则通过直观的方式提供了全局运动方向,但生成精细或个性化角色动作的能力却有所欠缺;而以锚点姿态为引导的方法通常只能合成简单的运动模式。为了生成更加可控和精准的人体运动,我们提出了**ProMoGen(渐进式运动生成)**这一创新框架,它将轨迹指导与稀疏锚点运动控制相结合。全局轨迹确保了空间方向和位移的一致性,而稀疏的锚点动作仅提供精确的动作指引而不涉及位移。这种解耦合使得两个方面的独立优化成为可能,从而产生更可控、高保真且复杂的运动合成效果。ProMoGen支持在统一训练过程中实现双控与单控两种模式。 此外,我们认识到直接从稀疏动作中学习存在固有的不稳定性问题,因此引入了**SAP-CL(稀疏锚点姿势课程学习)**这一课程学习策略,逐步调整用于指导的锚点数量,从而促进更精确和稳定的收敛。广泛的实验表明,ProMoGen在根据预定义轨迹及任意锚帧合成生动多样的动作方面表现出色。我们的方法能够无缝地将个性化运动与结构化指引相结合,在多种控制场景中均显著超越了当前最先进的技术。
https://arxiv.org/abs/2504.16722
The burgeoning presence of multimodal content-sharing platforms propels the development of personalized recommender systems. Previous works usually suffer from data sparsity and cold-start problems, and may fail to adequately explore semantic user-product associations from multimodal data. To address these issues, we propose a novel Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework for user recommendation. For a comprehensive information exploration from user-product relations, we construct two hypergraphs, i.e. a user-to-user (u2u) hypergraph and an item-to-item (i2i) hypergraph, to mine shared preferences among users and intricate multimodal semantic resemblance among items, respectively. This process yields denser second-order semantics that are fused with first-order user-item interaction as complementary to alleviate the data sparsity issue. Then, we design a contrastive feature enhancement paradigm by applying synergistic contrastive learning. By maximizing/minimizing the mutual information between second-order (e.g. shared preference pattern for users) and first-order (information of selected items for users) embeddings of the same/different users and items, the feature distinguishability can be effectively enhanced. Compared with using sparse primary user-item interaction only, our MMHCL obtains denser second-order hypergraphs and excavates more abundant shared attributes to explore the user-product associations, which to a certain extent alleviates the problems of data sparsity and cold-start. Extensive experiments have comprehensively demonstrated the effectiveness of our method. Our code is publicly available at: this https URL.
多模态内容分享平台的迅速崛起推动了个性化推荐系统的发展。然而,以往的工作通常会受到数据稀疏性和冷启动问题的影响,并且可能无法充分探索多模态数据中用户与产品之间的语义关联。为了解决这些问题,我们提出了一种新的多模态超图对比学习(MMHCL)框架用于用户推荐。为了从用户-产品关系中进行全面的信息探索,我们构建了两个超图:即用户到用户的(u2u)超图和项目到项目的(i2i)超图。前者用于挖掘用户之间的共同偏好,后者则专注于复杂多模态语义相似性在物品间的发现。这一过程可以生成更密集的二级语义信息,并将其与一级用户-产品交互数据融合,作为补充来缓解数据稀疏问题。 接下来,我们设计了一种对比特征增强范式,通过协同对比学习实现。通过对同一或不同用户的二级(如共享偏好模式)和一级(选择项的信息)嵌入之间的互信息进行最大化/最小化处理,可以有效提升特征的区分性。与仅使用稀疏的一级用户-产品交互数据相比,我们的MMHCL方法可以获得更密集的二级超图,并挖掘出更多的共同属性来探索用户与产品的关联,从而在一定程度上缓解了数据稀疏性和冷启动问题。 广泛的实验已经全面证明了我们方法的有效性。我们的代码可在以下网址公开获取:[https://this-url.com](https://this-url.com)(请将“this https URL”替换为实际的URL地址)。
https://arxiv.org/abs/2504.16576
Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
时间飞行(Time-of-Flight,ToF)传感器能够在相对较低的功耗预算下提供高效的主动深度感知;在这样的设计中,只有来自低分辨率传感器的非常稀疏的测量值被认为能满足移动设备和增强现实/虚拟现实(AR/VR)装置日益严格的功耗限制。然而,这种极端稀疏程度限制了时间飞行深度信息在同时定位与地图构建(SLAM)中的无缝使用。为此,我们提出了ToF-Splatting方法,这是第一个基于3D高斯点云融合的SLAM流程,专门用于有效利用非常稀疏的时间飞行输入数据。我们的方法通过引入一个多帧集成模块进行了改进,该模块能够生成密集深度图,从而合并了来自极其稀疏的时间飞行深度信息、单目颜色和多视角几何的信息。在合成和真实稀疏时间飞行数据集上的广泛实验表明,我们的方法是可行的,并且在参考数据集上实现了最先进的跟踪与地图构建性能。
https://arxiv.org/abs/2504.16545
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.
基于Transformer的网络在如图像去雨这样的低级视觉任务中,通过利用空间或通道自注意力机制实现了强大的性能。然而,不规则的雨模式和复杂的几何重叠挑战了单一范式的架构,需要一个统一框架来整合互补的全局-局部和空间-通道表示。为此,我们提出了一种新颖的跨范式表示与对齐Transformer(CPRAformer)。其核心思想是分层表示和对齐,通过利用两种范式(空间-通道和全局-局部)的优势来辅助图像重建。它弥合了内部和跨范式的差距,并使它们协调一致以实现深层次互动和特征融合。具体而言,在Transformer模块中我们使用了两种自注意力机制:稀疏提示通道自注意力(SPC-SA)和空间像素细化自注意力(SPR-SA)。SPC-SA通过动态稀疏性增强了全局通道依赖关系,而SPR-SA则专注于雨的空间分布和细粒度纹理恢复。为了处理两者之间的特征错位及知识差异问题,我们引入了自适应对齐频率模块(AAFM),该模块以两阶段渐进方式对特征进行对齐与互动操作,从而实现适应性指导和互补作用,这减少了范式内部以及跨范式的信息差距。通过这种统一的跨范式动态交互框架,我们可以从两个范式中提取最有价值的交互融合信息。广泛的实验表明,我们的模型在八个基准数据集上实现了最先进的性能,并进一步验证了CPRAformer在其他图像恢复任务和下游应用中的鲁棒性。
https://arxiv.org/abs/2504.16455
In this paper, we introduce a novel data transformation framework based on Opposition-Based Learning (OBL) to boost the performance of traditional classification algorithms. Originally developed to accelerate convergence in optimization tasks, OBL is leveraged here to generate synthetic opposite samples that replace the acutely training data and improve decision boundary formation. We explore three OBL variants; Global OBL, Class-Wise OBL, and Localized Class-Wise OBL; and integrate them with several widely used classifiers, including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression (LR), and Decision Tree (DT). Extensive experiments conducted on 26 heterogeneous and high-dimensional datasets demonstrate that OBL-enhanced classifiers consistently outperform their standard counterparts in terms of accuracy and F1-score, frequently achieving near-perfect or perfect classification. Furthermore, OBL contributes to improved computational efficiency, particularly in SVM and LR. These findings underscore the potential of OBL as a lightweight yet powerful data transformation strategy for enhancing classification performance, especially in complex or sparse learning environments.
在这篇论文中,我们介绍了一种基于对立本学习(Opposition-Based Learning, OBL)的新型数据转换框架,旨在提升传统分类算法的表现。OBL最初是为了加速优化任务中的收敛速度而开发的,在这里被用来生成合成相反样本以替换尖锐训练数据,并改善决策边界形成。我们探索了三种OBL变体:全局对立本学习(Global OBL),类别级对立本学习(Class-Wise OBL)和局部化类别级对立本学习(Localized Class-Wise OBL),并将它们与几种广泛使用的分类器集成在一起,包括K-最近邻(KNN)、支持向量机(SVM)、逻辑回归(LR)和决策树(DT)。在26个异构且高维数据集上进行的大量实验表明,增强OBL后的分类器在准确率和F1值方面始终优于标准版本,在许多情况下实现了近乎完美或完美的分类效果。此外,OBL还提高了计算效率,尤其是在SVM和LR中表现突出。这些发现强调了OBL作为轻量级但强大的数据转换策略的潜力,特别是在复杂或稀疏的学习环境中可以显著增强分类性能。
https://arxiv.org/abs/2504.16268
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Ãrudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
跨语言信息检索(CLIR)指的是在查询语言不同的情况下找到相关文档。本文介绍了一个名为 CLIRudit 的新数据集,用于评估跨语言学术搜索,特别关注以英语查询和法语文档为主的场景。该数据集利用了加拿大出版平台 Érudit 上的双语文章元数据构建而成,旨在模拟研究人员在非英语环境中查找学术内容的情景。 我们对数据集中不同的零样本初步检索方法进行了全面基准测试,包括密集型和稀疏型检索器、查询和文档机器翻译以及最先进的多语言检索器。实验结果表明,即使没有特别针对跨语言检索任务进行训练的大型密集型检索器,在零样本设置下也能达到与使用真实人类翻译相媲美的性能水平,无需借助机器翻译。 另一方面,将稀疏检索器(如BM25或SPLADE)与文档翻译相结合的方法也表现出有竞争力的结果,提供了一种高效的替代方案,可以避免大型密集型模型的需求。这项研究加深了对跨语言学术信息检索的理解,并为其他人在不同语言和学科领域构建可比较的数据集提供了框架。 通过将数据集和代码公开发布,我们旨在促进进一步的研究工作,以帮助消除科学知识在语言障碍方面的获取限制。
https://arxiv.org/abs/2504.16264
Myocardial perfusion imaging (MPI) with single-photon emission computed tomography (SPECT) is a widely used and cost-effective diagnostic tool for coronary artery disease. However, the lengthy scanning time in this imaging procedure can cause patient discomfort, motion artifacts, and potentially inaccurate diagnoses due to misalignment between the SPECT scans and the CT-scans which are acquired for attenuation compensation. Reducing projection angles is a potential way to shorten scanning time, but this can adversely impact the quality of the reconstructed images. To address this issue, we propose a detection-task-specific deep-learning method for sparse-view MPI SPECT images. This method integrates an observer loss term that penalizes the loss of anthropomorphic channel features with the goal of improving performance in perfusion defect-detection task. We observed that, on the task of detecting myocardial perfusion defects, the proposed method yielded an area under the receiver operating characteristic (ROC) curve (AUC) significantly larger than the sparse-view protocol. Further, the proposed method was observed to be able to restore the structure of the left ventricle wall, demonstrating ability to overcome sparse-sampling artifacts. Our preliminary results motivate further evaluations of the method.
单光子发射计算机断层成像(SPECT)心肌灌注成像(MPI)是诊断冠状动脉疾病的一种广泛应用且成本效益高的工具。然而,这种影像程序的长时间扫描会导致患者不适、运动伪影,并可能导致由于SPECT和CT扫描之间的错位而导致的不准确诊断,后者用于衰减补偿。减少投影角度是一种可能缩短扫描时间的方法,但这可能会对重建图像的质量产生不利影响。为了解决这个问题,我们提出了一种专门针对稀疏视图MPI SPECT影像的心肌灌注缺陷检测任务的深度学习方法。该方法整合了一个观察者损失项,以惩罚人类特征通道丢失,目的是提高心肌灌注缺损检测任务的表现。我们在检测心肌灌注缺损的任务中观察到,所提出的方法在接收器操作特性(ROC)曲线下的面积(AUC)明显大于稀疏视图协议。此外,该方法能够恢复左心室壁的结构,表明其克服了稀疏采样伪影的能力。我们的初步结果激励进一步对该方法进行评估。
https://arxiv.org/abs/2504.16171
The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at this https URL.
将长上下文能力与视觉理解相结合,为视觉语言模型(VLMs)解锁了前所未有的潜力。然而,在预填充阶段的二次注意力复杂度仍然是实际部署的一大障碍。为了克服这一限制,我们引入了MMInference(多模态百万token推理),这是一种动态稀疏注意方法,可以加速长上下文多模式输入的预填充阶段。首先,我们的分析揭示了视频输入的时间和空间局部性导致了一个独特的稀疏模式——Grid模式。同时,VLMs在不同模式下表现出显著不同的稀疏分布。我们引入了一种基于置换的方法来利用这种独特的Grid模式并处理模态边界问题。通过离线搜索每个头的最佳稀疏模式,MMInference可以根据输入动态构建稀疏分布,并提供优化的GPU内核以进行高效的稀疏计算。值得注意的是,MMInference可以无缝集成到现有的VLM流水线中,无需对模型进行任何修改或微调。在包括视频问答、字幕生成、VisionNIAH和混合模态NIAH在内的多模式基准测试上,使用最先进的长上下文VLM(LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL)的实验表明,MMInference可以在1M token的情况下将预填充阶段加速高达8.3倍,同时保持准确性。我们的代码可在[此处](https://this https URL)获得。
https://arxiv.org/abs/2504.16083
This paper investigates the impact of different optimizers on the grokking phenomenon, where models exhibit delayed generalization. We conducted experiments across seven numerical tasks (primarily modular arithmetic) using a modern Transformer architecture. The experimental configuration systematically varied the optimizer (Muon vs. AdamW) and the softmax activation function (standard softmax, stablemax, and sparsemax) to assess their combined effect on learning dynamics. Our empirical evaluation reveals that the Muon optimizer, characterized by its use of spectral norm constraints and second-order information, significantly accelerates the onset of grokking compared to the widely used AdamW optimizer. Specifically, Muon reduced the mean grokking epoch from 153.09 to 102.89 across all configurations, a statistically significant difference (t = 5.0175, p = 6.33e-08). This suggests that the optimizer choice plays a crucial role in facilitating the transition from memorization to generalization.
本文研究了不同优化器对“grokking现象”的影响,即模型表现出延迟泛化。我们在七个数值任务(主要是模运算)上使用现代Transformer架构进行了实验。实验配置系统地改变了优化器(Muon vs. AdamW)和softmax激活函数(标准softmax、稳定softmax和稀疏softmax),以评估它们对学习动态的综合影响。 我们的实证评价表明,具有光谱范数约束和二阶信息特征的Muon优化器,在促使模型从记忆阶段过渡到泛化阶段的速度上显著快于广泛使用的AdamW优化器。具体而言,在所有配置下,Muon将平均grokking周期从153.09减少到了102.89,这是统计学上的显著差异(t = 5.0175, p = 6.33e-08)。这表明优化器的选择在促进模型由记忆向泛化的转变中起着关键作用。
https://arxiv.org/abs/2504.16041
We introduce ViSMap: Unsupervised Video Summarisation by Meta Prompting, a system to summarise hour long videos with no-supervision. Most existing video understanding models work well on short videos of pre-segmented events, yet they struggle to summarise longer videos where relevant events are sparsely distributed and not pre-segmented. Moreover, long-form video understanding often relies on supervised hierarchical training that needs extensive annotations which are costly, slow and prone to inconsistency. With ViSMaP we bridge the gap between short videos (where annotated data is plentiful) and long ones (where it's not). We rely on LLMs to create optimised pseudo-summaries of long videos using segment descriptions from short ones. These pseudo-summaries are used as training data for a model that generates long-form video summaries, bypassing the need for expensive annotations of long videos. Specifically, we adopt a meta-prompting strategy to iteratively generate and refine creating pseudo-summaries of long videos. The strategy leverages short clip descriptions obtained from a supervised short video model to guide the summary. Each iteration uses three LLMs working in sequence: one to generate the pseudo-summary from clip descriptions, another to evaluate it, and a third to optimise the prompt of the generator. This iteration is necessary because the quality of the pseudo-summaries is highly dependent on the generator prompt, and varies widely among videos. We evaluate our summaries extensively on multiple datasets; our results show that ViSMaP achieves performance comparable to fully supervised state-of-the-art models while generalising across domains without sacrificing performance. Code will be released upon publication.
我们介绍了ViSMap:通过元提示进行无监督视频摘要系统,该系统可以无需人工标注地对长达数小时的视频进行总结。现有的大多数视频理解模型在处理预分割事件的短视频时表现良好,但在面对相关事件稀疏分布且未经预分割的长视频时则显得力不从心。此外,对于长时间段的视频理解通常依赖于需要大量注释信息的监督层次训练,这些注释不仅耗时费力,并且容易出现不一致的问题。通过ViSMaP,我们旨在弥合短片(其中标注数据充足)与长视频(标注数据不足)之间的差距。利用大规模语言模型(LLMs),我们可以根据来自短视频段落描述创建长视频的优化伪摘要。这些伪摘要被用作生成长期视频摘要模型的训练数据,从而绕过了对昂贵注释的需求。具体而言,我们采用了元提示策略,通过迭代生成和细化的方式创造长视频的伪摘要。该策略利用从监督短片模型中获得的短视频片段描述来引导总结过程。每个迭代循环中,三个大规模语言模型依次工作:一个用于根据片段描述生成伪摘要;另一个评估其质量;第三个则优化生成器的提示词。这种迭代是必要的,因为伪摘要的质量高度依赖于生成器的提示,并且在不同的视频之间会存在显著差异。我们在多个数据集上广泛地对我们的总结进行了评估,结果显示ViSMaP实现了与完全监督的最先进的模型相当的表现,在跨域泛化时没有牺牲性能。代码将在发表后公开发布。
https://arxiv.org/abs/2504.15921
Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.
在复杂且包含多种不规则物体的环境中,自动驾驶需要准确的3D语义占用感知。虽然视觉中心方法存在几何准确性的问题,而基于激光雷达的方法通常缺乏丰富的语义信息。为了克服这些限制,提出了一种新的多阶段激光雷达-摄像头融合框架——MS-Occ,该框架包括中间级和后期融合两个部分,通过分层跨模态融合技术将激光雷达的几何精度与相机的语义丰富性相结合。 这个框架在两个关键环节进行了创新: 1. 在中间级别的特征融合中,Gaussian-Geo模块利用稀疏激光雷达深度图上的高斯核渲染来增强二维图像特征,并引入密集的几何先验;Semantic-Aware模块则通过可变形交叉注意力机制为激光雷达体素添加语义上下文。 2. 在后期的体素融合阶段,自适应融合(AF)模块动态地平衡不同模态之间的体素特征,而高分类置信度体素融合(HCCVF)模块使用基于自我注意的方法解决语义不一致问题。 在nuScenes-OpenOccupancy基准测试中,MS-Occ框架实现了交并比(IoU)为32.1%,平均交并比(mIoU)为25.3%的成绩,分别超越了现有最佳方法0.7% IoU和2.4% mIoU。消融研究进一步验证了每个模块的贡献,并显示在小型物体感知方面有显著改善,证明了MS-Occ框架在关键安全场景中的实际价值。 这一创新性融合技术为解决自动驾驶中复杂环境下的3D语义占用问题提供了一种有效的解决方案。
https://arxiv.org/abs/2504.15888
Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.
红外低信噪比小目标检测由于动态多帧场景和弱目标特征,在红外模式下面临重大挑战。传统的低秩加稀疏模型常常无法捕捉动态背景以及全局时空相关性,从而导致背景泄露或目标丢失的问题。在本文中,我们提出了一种新的基于运动增强的非局部相似性的隐式神经网络表示(INR)框架来应对这些挑战。 首先,我们将通过光流估计的方式整合运动估算以捕获细微的目标移动,并提出了多帧融合技术来增强运动显著性。其次,我们利用非局部相似性构造具有强低秩性质的补丁张量,并提出了一种基于张量分解的创新INR模型来表示这些非局部补丁张量,通过连续神经网络表示有效地编码背景中的非局部低秩性和时空相关性。 为了对非局部INR模型进行优化,我们开发了一种交替方向乘子法(ADMM),该方法具有理论上的固定点收敛特性。实验结果表明,我们的方法能够稳健地将微弱目标从复杂的红外背景中分离出来,在检测准确率和鲁棒性方面优于现有的最先进技术。
https://arxiv.org/abs/2504.15665
Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.
联合嵌入预测架构(JEPA)作为一种强大的框架,已被用于学习通用目的表示。然而,这些模型通常缺乏可解释性,并且由于密集的嵌入表示而效率低下。我们提出了一种名为SparseJEPA的扩展方法,在JEPA框架中整合了稀疏表示学习,以提高所学表示的质量。SparseJEPA采用了一种惩罚机制,鼓励潜在空间变量在具有强语义关系的数据特征之间共享,同时保持预测性能不变。 我们通过在CIFAR-100数据集上进行训练以及对轻量级Vision Transformer进行预训练来展示了SparseJEPA的有效性。改进的嵌入表示用于线性探测迁移学习,在图像分类和低级别任务中均表现出该架构在不同迁移任务中的灵活性。此外,我们提供了一种理论证明,表明分组机制能够提升表示质量。这通过展示分组可以减少潜在变量之间的多信息量来实现,并且已经证明了针对多信息的处理不等式。 我们的结果显示,引入稀疏性不仅优化了潜在空间,还促进了更具意义和可解释性的表示学习。在未来的工作中,我们希望进一步扩展这种方法,寻找新的方式通过对象中心表示学习来利用分组机制。
https://arxiv.org/abs/2504.16140
Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from noise through a process called reverse diffusion. Understanding the dynamics of the reverse diffusion process is crucial in steering the generation and achieving high sample quality. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic Interpretability (MI) techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the operating principles of models through granular analysis of their internal representations. These MI techniques have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we show that the discovered concepts have a causal effect on the model output and can be leveraged to steer the generative process. We design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages of diffusion image composition is finalized, however stylistic interventions are effective, and (3) in the final stages of diffusion only minor textural details are subject to change.
扩散模型已成为从噪声生成高质量图像的首选方法,通过逆向扩散过程实现文本到图像的转换。理解逆向扩散过程的动力学对于引导生成过程并获得高质量样本至关重要。然而,由于扩散模型的黑盒性质和复杂的多步骤生成过程,其内部工作机制仍然很大程度上是一个谜。机制可解释性(MI)技术,如稀疏自编码器(SAE),旨在通过细致分析模型的内部表示来揭示模型的操作原理。这些MI技术已经在大规模理解并引导大型语言模型的行为方面取得了成功。然而,SAEs的巨大潜力尚未应用于深入了解扩散模型复杂的生成过程。 在本研究中,我们利用SAE框架探索了一个流行的文本到图像扩散模型的内部工作机制,并发现了一系列人类可解释的概念在其激活状态中。有趣的是,我们发现即使在完成第一次逆向扩散步骤之前,最终场景的组成也能通过观察激活概念的空间分布来预测得相当准确。 此外,超越相关性分析,我们展示了所发现的概念对模型输出具有因果影响,并可以被利用以引导生成过程。我们设计了干预技术旨在操控图像构图和风格,在早期阶段: 1. 可以有效地控制图像的组成。 2. 在扩散中期阶段,虽然最终确定了图像构图,但风格上的干预依然有效。 3. 到达扩散后期时,只有细微的纹理细节会有所改变。 这些发现不仅揭示了SAEs在解释和操控复杂生成模型方面的潜力,也为文本到图像生成任务提供了新的指导原则。
https://arxiv.org/abs/2504.15473