To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
为了减轻大型语言模型(LLMs)的计算负担,采用激活稀疏架构(如专家混合MoE)吸引了越来越多的关注。然而,传统MoE中非可微和刚性的路由机制损害了模型性能。此外,尽管每个标记仅激活少数参数,但这些稀疏激活架构在块级表现出低稀疏性,即多个连续标记的组合会激活大量参数的比例。这种稀疏模式不利于资源受限条件(例如终端设备)下的加速,并且与主流加速技术(如投机解码)不兼容。为了解决这些问题,我们引入了一种新的MoE架构BlockFFN及其高效的训练和部署技术。具体而言,我们使用集成了ReLU激活和RMSNorm的路由器来实现可微分和灵活的路由机制。接下来,为了同时促进标记级稀疏性(TLS)和块级稀疏性(CLS),设计了CLS感知的训练目标,使得BlockFFN更加易于加速。最后,我们实现了高效的加速内核,并首次结合了激活稀疏性和投机解码技术。实验结果显示,BlockFFN在其他MoE基准模型上的性能优越,达到了超过80%的TLS和70%八标记CLS(Chunk-Level Sparsity)。我们的内核实现在真实终端设备上比密集模型快达3.67倍。所有代码和检查点均可公开访问(此 https URL 链接)。
https://arxiv.org/abs/2507.08771
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbf{SGPMIL}, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.
多示例学习(Multiple Instance Learning,MIL)为仅提供粗略的袋级标签而无实例级注释可用的情况提供了自然解决方案。这种情况通常出现在数字病理学中,因为该领域涉及的是数十亿像素大小的图像。虽然基于确定性注意力机制的MIL方法能够实现强大的袋级性能,但它们往往忽略了实例相关性的内在不确定性。在本文中,我们通过引入**SGPMIL**(一种基于稀疏高斯过程(SGP)的概率注意机制框架),解决了实例级别注意力分数中的不确定量化缺乏的问题。SGPMIL通过学习关注分数的后验分布来进行原则化的不确定性估计,从而产生更可靠和校准良好的实例相关性图。 我们的方法不仅保持了袋级性能的竞争水平,还显著提高了在不确定性下的实例级预测的质量和可解释性。SGPMIL通过在SGP预测均值函数中引入特征缩放,扩展了先前的工作,这带来了更快的训练速度、更高的效率以及增强的实例级别表现。在多个知名数字病理学数据集上的广泛实验表明,在袋级和实例级评估方面,我们的方法表现出色。我们将公开提供代码以供使用。
https://arxiv.org/abs/2507.08711
Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
知识图谱(KGs)在通过引入结构化和有根据的知识来增强大型语言模型(LLMs)的学习过程中扮演着关键角色。然而,大多数现有的基于KG的改进方法依赖于耗参数量大的微调过程,这会带来灾难性遗忘的风险,并且损害预训练模型的泛化能力。此外,由于它们采用静态集成框架,这些方法对实时知识更新表现出有限的适应性。 为了解决这些问题,我们引入了首个测试时基于KG增强的LLMs框架,该框架围绕一个专有的、以知识图为指导的注意力(KGA)模块构建而成,可以实现在无需参数更新的情况下动态融合知识。所提出的KGA模块通过两条协同作用的路径扩展标准自注意力机制:向外聚合和向内聚合。 具体而言,向外路径通过输入驱动的知识图谱融合,将外部知识动态地集成到输入表示中。向内聚合则通过KG指导下的过滤来完善输入表示,抑制任务无关信号并放大与知识相关的模式,从而补充了向外路径的功能。尤为重要的是,在处理知识融合时,向外路径选择最相关的三元组,并将其反馈给融合过程,形成了一个闭环增强机制。 通过协同结合这两条路径,所提出的方法支持仅在测试时间进行实时知识融合,而无需对任何参数进行修改。在五个基准上的广泛实验验证了KGA的知识融合性能与现有方法相当。
https://arxiv.org/abs/2507.08704
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2\% (86.2\% vs 85.0\%) while cutting training time by 81\% (296s vs 1554s) and FLOPS by 73.4\% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1\%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.
Transformer模型在处理长序列时计算成本高昂,因为标准注意力机制的时间复杂度为二次方$O(n^2)$。我们引入了一种新颖的机制——Wavelet-Enhanced Random Spectral Attention (WERSA),它具有线性$O(n)$时间复杂度,在不牺牲性能的情况下实现了成功处理长序列的目标。 WERSA结合了内容自适应的随机光谱特征、多分辨率Haar小波以及可学习参数,能够在保持线性效率的同时选择性地关注数据中的信息量级。大规模测试在单个GPU上进行,并且跨越了各种基准(如视觉任务、NLP和层次推理),并涉及不同的注意力机制(包括Multiheaded Attention、Flash-Attention-2、FNet、Linformer、Performer以及Waveformer),结果显示WERSA具有一致的优势。 WERSA在所有测试中都达到了最佳的准确性。以ArXiv分类为例,与标准注意力相比,它提高了1.2%的准确率(86.2% vs 85.0%),并将训练时间缩短了81%(从296秒减少到1554秒)以及减少了73.4%的浮点运算次数(从26.2G减少至98.4G)。更重要的是,当面对ArXiv-128k中的极长序列时,WERSA在其他方法失败的情况下表现出色。它在这类序列上实现了最高的准确率(79.1%)和AUC值(0.979),并且能够处理导致二次方方法产生内存溢出错误的数据量,在这方面比其最近的竞争对手Waveformer还要快两倍。 通过显著降低计算负载而不牺牲准确性,WERSA使得更实用、更具成本效益的长上下文模型成为可能,尤其是在低资源硬件上。这有助于促进可持续且可扩展的人工智能开发。
https://arxiv.org/abs/2507.08637
This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
这项研究评估了最近提出的文档注意力网络(Document Attention Network,简称DAN)在从乌拉圭出生证明中提取关键信息的能力。这些出生证明是用西班牙文手写的。我们调查了两种自动转录手写文件的注释策略,并探讨了使用最少训练数据和标注努力对DAN进行微调的方法。实验是在两个包含相同图像(201份不同作者书写的出生证明扫描件)的数据集上进行的,但采用的是不同的注释方法。我们的研究结果表明,对于可以标准化的字段(如日期和出生地点),规范化注释更为有效;而对于含有姓名和姓氏且无法标准化的字段,外交式注释效果更好。
https://arxiv.org/abs/2507.08636
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们介绍了DocPolarBERT,这是一种针对文档理解的布局感知型BERT模型,它消除了对绝对二维位置嵌入的需求。我们将自注意力机制扩展为考虑基于相对极坐标系统而非笛卡尔坐标的文本块位置。尽管是在比广泛使用的IIT-CDIP语料库小六倍多的数据集上进行预训练的,DocPolarBERT仍取得了最先进的成果。这些结果表明,精心设计的注意机制可以补偿较小规模的预训练数据量,并为文档理解提供了高效且有效的替代方案。
https://arxiv.org/abs/2507.08606
This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.
这项研究旨在开发一种新的多模态融合框架,用于脑肿瘤分割。该框架通过双向交互注意力机制整合空间、语言和视觉信息,以提高分割精度和边界界定的准确性。方法:我们提出了两个核心组件:多模态语义融合适配器(MSFA),将3D MRI数据与临床文本描述通过层次化语义解耦进行集成;以及双向互动视觉-语义注意力机制(BIVA),使不同模式之间能够迭代交换信息。该框架在包含369个多机构MRI扫描的BraTS 2020数据集上进行了评估。 结果:所提出的方法在增强肿瘤、肿瘤核心和整个肿瘤区域分别实现了平均Dice系数为0.8505,以及95% Hausdorff距离为2.8256毫米的成绩,优于包括SCAU-Net、CA-Net和3D U-Net在内的最新方法。消融研究表明,语义模块和空间模块对边界精度的贡献至关重要。 结论:多模态语义融合结合双向互动注意力机制显著提升了脑肿瘤分割性能,为将临床知识融入医学图像分析中建立了新的范式。
https://arxiv.org/abs/2507.08574
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: this https URL
文本到音频(T2A)生成技术在生成模型的最新进展中取得了显著成果。然而,由于时间对齐的音视频数据质量和数量有限,现有的T2A方法难以处理包含精确时间控制的复杂文字提示,例如“2.4秒至5.2秒之间猫头鹰叫声”。近期的研究探讨了数据增强技术或引入时间条件作为模型输入以实现基于时间条件的10秒钟T2A生成,但其合成质量仍然有限。在这项工作中,我们提出了一种全新的无训练所需的时间控制型T2A框架——FreeAudio,首次实现了基于时间条件的长格式T2A生成,例如“在2.4秒至5.2秒之间猫头鹰叫声,在0秒至24秒期间蟋蟀鸣叫”。具体来说,我们首先使用大型语言模型(LLM)根据输入文本和时间提示来规划非重叠的时间窗口,并为每个窗口重新描述以生成更精细的自然语言描述。然后我们引入了:1)解耦与聚合注意力控制,用于精确的时间控制;2)上下文隐变量组合以及参考指导,以确保局部平滑性和全局一致性。大量实验表明: 1. 在无需训练的方法中,FreeAudio实现了最佳的时间条件T2A合成质量,并且其性能可媲美领先的基于训练的方法; 2. FreeAudio的长格式生成质量与基于训练的Stable Audio方法相当,为时间控制型长格式T2A合成铺平了道路。演示样本可在以下链接查看:[此处提供URL]
https://arxiv.org/abs/2507.08557
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at this https URL.
3D语义场景完成(SSC)因其在三维感知中的关键作用而引起了越来越多的关注。最近的研究主要集中在细化体素级特征以构建三维场景上。然而,将体素视为基本交互单元会固有地限制类别级别信息的利用,这对提升完成结果的精细化至关重要。为了解决这一问题,我们提出了一种新颖的双流范式——**分离实例和场景上下文(DISC)**,该方法通过分别优化来增强对实例和场景类别的学习。具体而言,我们将体素查询替换为具有特定于类别的几何和语义先验的判别性类别查询。此外,我们利用类固有的属性设计专门的解码模块,促进有针对性的交互并高效传递类别级别的信息流。 实验结果表明,DISC在SemanticKITTI和SSCBench-KITTI-360基准测试上取得了最先进的(SOTA)性能,分别获得了17.35和20.55的mIoU分数。特别值得注意的是,即使使用单帧输入,DISC也超越了多帧SOTA方法,并显著提高了实例类别的表现,在SemanticKITTI隐藏测试中,其在单一帧条件下的实例mIoU比现有最佳方法提升了17.9%,在多帧条件下提升了11.9%。代码可在[此处](https://this https URL)获取。 (注意:原文中的“this https URL”可能是为了指示读者查看特定链接,请根据实际情况替换为正确的网址或删除该标注。)
https://arxiv.org/abs/2507.08555
The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI model scaling. Utilizing an innovative architecture that integrates Mamba layers, linear self-attention, and a Mixture of Experts framework, White-Basilisk achieves state-of-the-art results in vulnerability detection tasks with a parameter count of only 200M. The model's capacity to process sequences of unprecedented length enables comprehensive analysis of extensive codebases in a single pass, surpassing the context limitations of current Large Language Models (LLMs). White-Basilisk exhibits robust performance on imbalanced, real-world datasets, while maintaining computational efficiency that facilitates deployment across diverse organizational scales. This research not only establishes new benchmarks in code security but also provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks, potentially redefining optimization strategies in AI development for domain-specific applications.
软件漏洞的泛滥对网络安全构成了重大挑战,需要更有效的检测方法。我们在此介绍了 White-Basilisk,这是一种新颖的漏洞检测方法,展示了卓越性能的同时还挑战了现有AI模型规模化的假设。通过采用创新性的架构,结合Mamba层、线性自注意力机制以及专家混合框架(MoE),White-Basilisk在仅有2亿参数的情况下,在漏洞检测任务中达到了最先进的结果。该模型处理前所未有的长序列的能力使得它可以一次性全面分析庞大的代码库,并且能够克服现有大型语言模型(LLMs)的上下文限制。 在不均衡的真实世界数据集上,White-Basilisk表现出了强大的性能,同时保持了计算效率,使其能够在各种规模的组织中进行部署。这项研究不仅为代码安全设定了新的基准,还提供了实证证据,证明精心设计的小型模型可以在特定任务中超越大型模型的性能,这可能会重新定义AI开发中的优化策略,特别是在领域特定应用方面。
https://arxiv.org/abs/2507.08540
Transformers excel when dealing with sequential data. Generalizing transformer models to geometric domains, such as manifolds, we encounter the problem of not having a well-defined global order. We propose a solution with attention heads following a space-filling curve. As a first experimental example, we present the Spiroformer, a transformer that follows a polar spiral on the $2$-sphere.
翻译如下: Transformer模型在处理序列数据时表现出色。当我们将这些模型推广到几何领域,如流形(manifolds),我们会遇到没有明确全局顺序的问题。我们提出了一种解决方案:使用遵循空间填充曲线的注意力头。作为第一个实验示例,我们介绍了Spiroformer,这是一种在$2$-球面上遵循极坐标螺旋路径的Transformer模型。
https://arxiv.org/abs/2507.08456
Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation engines often provide terse reports in English that are difficult for non-technical users to interpret and act upon. This paper presents xpSHACL, an explainable SHACL validation system that addresses this issue by combining rule-based justification trees with retrieval-augmented generation (RAG) and large language models (LLMs) to produce detailed, multilanguage, human-readable explanations for constraint violations. A key feature of xpSHACL is its usage of a Violation KG to cache and reuse explanations, improving efficiency and consistency.
形状约束语言(SHACL)是一种用于验证RDF数据的强大语言。鉴于知识图谱(KGs)在行业中的关注度日益增加,越来越多的用户需要正确地验证链接数据。然而,传统的SHACL验证引擎通常会生成简短的英文报告,这些报告对于非技术人员来说难以理解和采取行动。本文介绍了一种名为xpSHACL的可解释性SHACL验证系统,该系统通过结合基于规则的推理树、检索增强型生成(RAG)和大型语言模型(LLMs),为约束违规提供详细、多语言的人类可读解释来解决这一问题。xpSHACL的一个关键特点是使用违反知识图谱(Violation KG)来缓存和重复使用解释,从而提高效率和一致性。
https://arxiv.org/abs/2507.08432
Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level semantic knowledge for image instances. To ensure effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs), we introduce the Attention Mutual-Guidance (AMG) module, which facilitates interactions between visual and semantic information. Through mutual guidance, the AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that integrates SCP and VCP with contextual prompts, ensuring seamless coordination among the different prompts and enhancing the modeling of class embeddings and instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art methods on 14 different datasets. The code will be made available after publication.
Prompt学习促进了视觉-语言模型(VLMs)向各种下游任务的高效适应。然而,它面临两个主要挑战:(1) 对于未见过实例的类嵌入分布建模不足,导致在新类别上的泛化效果不佳;(2) 当前方法主要将跨模式对齐限制在视觉和文本编码器的最终输出层上,这从根本上限制了它们保留预训练多模态嵌入空间拓扑一致性的能力。为此,我们引入了MuGCP(多模态互导引条件提示学习),这是一种新的用于条件提示生成的范式。MuGCP利用多模态大型语言模型(MLLMs)作为条件提示学习器,自适应地生成语义条件提示(SCP),这些提示结合了图像实例中丰富的、细粒度的高级语义知识。为了确保视觉-语言模型(VLMs)跨多模式空间的有效对齐和交互,我们引入了注意力互导引(AMG)模块,该模块促进了视觉信息与语义信息之间的互动。通过相互指导,AMG模块生成视觉条件提示(VCP),增强了模型在多模态任务中的性能。此外,我们提出了一种多提示融合(MPF)机制,将SCP和VCP与上下文提示集成在一起,确保不同提示间的无缝协调,并增强类别嵌入及实例特定知识的建模能力。我们的MuGCP方法在14个不同的数据集上超越了现有的最先进方法。发布后我们将公开代码。
https://arxiv.org/abs/2507.08410
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
最近,大型语言模型(LLM)和推理型大型语言模型(RLLM)的发展受到了许多研究人员的广泛关注。通过长链思维(Long Chain-of-Thought, Long CoT)过程,RLLMs 增强了 LLMs 的推理能力,在解决复杂问题方面显著提升了性能表现。然而,很少有研究系统地探索在金融领域中哪些方法能够充分发挥 LLMs 和 RLLMs 的潜力。 为了探讨不同方法对 LLMs 和 RLLMs 的影响,我们利用五种 LLM 和三种 RLLM 来评估提示方法、代理框架和多语言对齐方法在金融问答任务中的效果。我们的研究结果表明: 1. 当前的提示方法和代理框架通过模拟长链思维(Long CoT),提升了 LLMs 在金融问题回答中的性能。 2. RLLMs 本身具备内在的长链思维能力,这限制了传统方法进一步提升其表现的有效性。 3. 目前先进的多语言对齐方法主要通过延长推理长度来改进 LLMs 的多语言性能,这对 RLLMs 来说几乎没有任何好处。 我们希望这项研究能为金融问答领域中 LLM 和 RLLM 的应用提供重要的参考。
https://arxiv.org/abs/2507.08339
Watermarking technology has gained significant attention due to the increasing importance of intellectual property (IP) rights, particularly with the growing deployment of large language models (LLMs) on billions resource-constrained edge devices. To counter the potential threats of IP theft by malicious users, this paper introduces a robust watermarking scheme without retraining or fine-tuning for transformer models. The scheme generates a unique key for each user and derives a stable watermark value by solving linear constraints constructed from model invariants. Moreover, this technology utilizes noise mechanism to hide watermark locations in multi-user scenarios against collusion attack. This paper evaluates the approach on three popular models (Llama3, Phi3, Gemma), and the experimental results confirm the strong robustness across a range of attack methods (fine-tuning, pruning, quantization, permutation, scaling, reversible matrix and collusion attacks).
水印技术因知识产权(IP)权利的重要性日益增加而备受关注,尤其是在数十亿资源受限的边缘设备上部署大型语言模型(LLM)的情况下。为了应对恶意用户可能带来的知识产权盗窃威胁,本文提出了一种无需重新训练或微调变压器模型的稳健水印方案。该方案为每个用户生成一个唯一的密钥,并通过解决从模型不变量构造出的线性约束来推导出稳定的水印值。此外,这项技术利用噪声机制在多用户场景中隐藏水印位置以抵御合谋攻击。 本文对三种流行模型(Llama3、Phi3和Gemma)进行了评估,实验结果证实了该方法在多种攻击手段(包括微调、修剪、量化、置换、缩放、可逆矩阵及合谋攻击)下具有强大的鲁棒性。
https://arxiv.org/abs/2507.08288
In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with reduced receptive fields and abstract levels. For transformers, although long-range correlations can be captured by multi-head attention, its quadratic complexity with respect to input size is computationally demanding. Therefore, either model may require input size reduction to allow more filters and layers for better segmentation. Nevertheless, given their discrete nature, models trained with patch-wise training or image downsampling may produce suboptimal results when applied on higher resolutions. To address this issue, here we propose the resolution-robust HNOSeg-XS architecture. We model image segmentation by learnable partial differential equations through the Fourier neural operator which has the zero-shot super-resolution property. By replacing the Fourier transform by the Hartley transform and reformulating the problem in the frequency domain, we created the HNOSeg-XS model, which is resolution robust, fast, memory efficient, and extremely parameter efficient. When tested on the BraTS'23, KiTS'23, and MVSeg'23 datasets with a Tesla V100 GPU, HNOSeg-XS showed its superior resolution robustness with fewer than 34.7k model parameters. It also achieved the overall best inference time (< 0.24 s) and memory efficiency (< 1.8 GiB) compared to the tested CNN and transformer models.
在医学图像分割领域,卷积神经网络(CNN)和变压器模型占主导地位。对于CNN而言,由于卷积层的局部感受野特性,可以通过连续的卷积和池化操作来捕捉长距离的空间相关性。然而,由于计算成本和内存占用可能过高,3D模型只能比2D模型拥有更少的层次和较小的感受野及抽象层级。而对于变压器来说,尽管多头注意力机制可以捕获长距离的相关性,但由于其输入大小二次复杂度的关系,在计算上是非常昂贵的。 因此,无论是CNN还是Transformer模型都可能需要减少输入尺寸来允许更多的滤波器和层次以实现更好的分割效果。然而,鉴于这些模型训练时采用的是分块式或图像下采样的方式,当应用于更高分辨率的数据集时可能会产生次优的结果。为了解决这个问题,我们在此提出了一个鲁棒于分辨率的HNOSeg-XS架构。 通过使用傅立叶神经操作器来建模图像分割,并利用其零样本超分辨率特性,我们能够用可学习的偏微分方程形式进行建模。通过将傅立叶变换替换为哈特利变换,并重新在频域中表述问题,我们创建了HNOSeg-XS模型,该模型具有鲁棒于不同分辨率的能力、计算速度快且内存占用低,同时参数效率极高。 当在BraTS'23、KiTS'23和MVSeg'23数据集上进行测试时(使用Tesla V100 GPU),HNOSeg-XS仅需不到34.7k个模型参数即展现了其优秀的分辨率鲁棒性。它还达到了比所有测试过的CNN和Transformer模型都要好的整体推理时间(< 0.24秒)和内存效率(< 1.8 GiB)。
https://arxiv.org/abs/2507.08205
Red-blood-cell lysis (HC50) is the principal safety barrier for antimicrobial-peptide (AMP) therapeutics, yet existing models only say "toxic" or "non-toxic." AmpLyze closes this gap by predicting the actual HC50 value from sequence alone and explaining the residues that drive toxicity. The model couples residue-level ProtT5/ESM2 embeddings with sequence-level descriptors in dual local and global branches, aligned by a cross-attention module and trained with log-cosh loss for robustness to assay noise. The optimal AmpLyze model reaches a PCC of 0.756 and an MSE of 0.987, outperforming classical regressors and the state-of-the-art. Ablations confirm that both branches are essential, and cross-attention adds a further 1% PCC and 3% MSE improvement. Expected-Gradients attributions reveal known toxicity hotspots and suggest safer substitutions. By turning hemolysis assessment into a quantitative, sequence-based, and interpretable prediction, AmpLyze facilitates AMP design and offers a practical tool for early-stage toxicity screening.
红细胞溶解(HC50)是抗菌肽(AMP)治疗药物的主要安全屏障,然而现有的模型只能给出“有毒”或“无毒”的判断。AmpLyze通过仅从序列预测实际的HC50值并解释驱动毒性的残基来填补这一空白。该模型结合了基于单个氨基酸残基级的ProtT5/ESM2嵌入和整个肽链级别的描述符,并通过双局部及全局分支,使用交叉注意力模块进行对齐,并采用log-cosh损失函数进行训练以增强对抗实验噪声的能力。最佳AmpLyze模型达到了0.756的相关系数(PCC)和0.987的均方误差(MSE),超越了传统的回归器以及当前最先进的方法。消融研究确认,两个分支都至关重要,并且交叉注意力模块能进一步提高1%的PCC和3%的MSE改进效果。预期梯度归因揭示了已知的毒性热点区域并提出了更安全的替代方案。通过将溶血评估转变为定量、基于序列的可解释预测,AmpLyze有助于抗菌肽的设计,并提供了一个实用的早期毒理筛查工具。
https://arxiv.org/abs/2507.08162
Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: this https URL
交通事故虽然是罕见事件,但其影响重大,需要通过长期上下文的多模态推理来进行准确的风险预测。在本文中,我们介绍了ALCo-FM,这是一种统一的自适应长上下文基础模型,它计算一个波动预评分以动态选择输入数据的上下文窗口,并通过浅层交叉注意力机制编码和融合这些多模态数据。该模型随后使用局部GAT(图注意网络)层以及类似于BigBird的稀疏全局变换器在H3六边形网格上运行,并结合蒙特卡洛dropout方法来评估其置信度,从而能够生成优越且校准良好的预测结果。 ALCo-FM在美国15个城市的数据上进行训练,并采用类权重损失以解决标签不平衡问题。此外,在未见城市中使用最少数据微调后,该模型实现了0.94的准确率、0.92的F1分数以及0.04的ECE(期望校准误差),在大规模城市风险预测方面超过了超过20个最新的基准方法。 代码和数据集可在以下网址获取:this https URL
https://arxiv.org/abs/2507.08153