This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class this http URL model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
这项研究提出了一种用于在MNIST数据集上分类手写数字的混合模型,该模型结合了卷积神经网络(CNN)与多阱霍普菲尔德网络。这种方法利用CNN从输入图像中提取高维特征,并使用k均值聚类将这些特征聚类为特定于每个类别的原型。这些原型作为具有多个能量陷阱的能量景观中的吸引子,在其中霍普菲尔德网络通过最小化一个平衡了特征相似性和类别归属的能函数来进行分类。该模型的设计能够稳健地处理同一类别内的变化,例如多样的手写风格,并且由于其基于能量的决策过程提供了可解释性框架。 通过对CNN架构和阱的数量进行系统优化,该模型在10,000张MNIST图像上达到了99.2%的高测试准确率,证明了它对于图像分类任务的有效性。研究结果强调了深度特征提取以及充分原型覆盖对于实现高性能的关键作用,并且其潜在的应用范围可能更广泛,在模式识别领域具有潜力。
https://arxiv.org/abs/2507.08766
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
常识表明,构建图像数据集通常依赖于耗时且效率低下的手动收集和标注方法。大型模型通过生成数据提供了一种解决方案。然而,与人工合成的数据相比,现实世界中的数据在构建图像数据集中显然更有价值。因此,我们提出了一种使用多智能体协作系统从真实世界的图像自动构建数据集的新方法,命名为DatasetAgent。通过协调四个配备有多模态大型语言模型(MLLM)的代理以及用于图像优化的工具包,DatasetAgent能够根据用户指定的要求构建高质量的图像数据集。特别地,在各种开源数据集上进行了两种类型的实验,包括扩展现有数据集和从头开始创建新数据集。在这两种情况下,使用多个由DatasetAgent构建的图像数据集来训练多种视觉模型以进行图像分类、目标检测和图像分割任务。
https://arxiv.org/abs/2507.08648
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2\% (86.2\% vs 85.0\%) while cutting training time by 81\% (296s vs 1554s) and FLOPS by 73.4\% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1\%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.
Transformer模型在处理长序列时计算成本高昂,因为标准注意力机制的时间复杂度为二次方$O(n^2)$。我们引入了一种新颖的机制——Wavelet-Enhanced Random Spectral Attention (WERSA),它具有线性$O(n)$时间复杂度,在不牺牲性能的情况下实现了成功处理长序列的目标。 WERSA结合了内容自适应的随机光谱特征、多分辨率Haar小波以及可学习参数,能够在保持线性效率的同时选择性地关注数据中的信息量级。大规模测试在单个GPU上进行,并且跨越了各种基准(如视觉任务、NLP和层次推理),并涉及不同的注意力机制(包括Multiheaded Attention、Flash-Attention-2、FNet、Linformer、Performer以及Waveformer),结果显示WERSA具有一致的优势。 WERSA在所有测试中都达到了最佳的准确性。以ArXiv分类为例,与标准注意力相比,它提高了1.2%的准确率(86.2% vs 85.0%),并将训练时间缩短了81%(从296秒减少到1554秒)以及减少了73.4%的浮点运算次数(从26.2G减少至98.4G)。更重要的是,当面对ArXiv-128k中的极长序列时,WERSA在其他方法失败的情况下表现出色。它在这类序列上实现了最高的准确率(79.1%)和AUC值(0.979),并且能够处理导致二次方方法产生内存溢出错误的数据量,在这方面比其最近的竞争对手Waveformer还要快两倍。 通过显著降低计算负载而不牺牲准确性,WERSA使得更实用、更具成本效益的长上下文模型成为可能,尤其是在低资源硬件上。这有助于促进可持续且可扩展的人工智能开发。
https://arxiv.org/abs/2507.08637
Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as this http URL and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
论证挖掘(AM)是一个跨学科的研究领域,融合了逻辑学、哲学、语言学、修辞学、法律学、心理学和计算机科学的见解。它涉及自动识别和提取论据成分,例如前提和主张,并检测它们之间的关系,如支持、攻击或中立。近年来,这一领域取得了显著进展,尤其是在大型语言模型(LLM)出现之后,这些模型在分析和提取论证语义方面比传统方法和其他深度学习模型更为高效。尽管有许多基准测试用于检验和验证LLM的质量,但关于这些模型在公开可用的论点分类数据库中的表现仍缺乏研究和成果。 本文对一系列大型语言模型进行了研究,使用了诸如[this URL](http://this.URL) 和 UKP 等多样化的数据集。所测试的模型包括 GPT、Llama 和 DeepSeek 的不同版本,以及整合 Chain-of-Thoughts(CoT)算法以增强推理能力的变体。结果显示,在论点分类基准中,ChatGPT-4o 表现最佳。在具备推理功能的模型中,Deepseek-R1 展现出其优越性。然而,尽管 GPT-4o 和 Deepseek-R1 的性能卓越,它们仍然会犯错误。本文讨论了所有模型中最常见的错误类型。 据我们所知,本研究是首次广泛分析这些数据集并使用大型语言模型和提示算法的研究工作。该工作还揭示了一些已知提示算法在论证分析中的弱点,并指出了改进的方向。此外,这项工作的附加值在于对现有论点数据集进行了深入的分析,并展示了它们的不足之处。
https://arxiv.org/abs/2507.08621
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fréchet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
环境声音录音中常常包含可理解的人类对话,这引发了隐私问题,并限制了数据的分析、共享和再利用。本文介绍了一种方法,该方法能够使语音变得不可辨认,同时保留声学场景的整体性和音频质量。我们的方法包括反转波形片段以扭曲语音内容。通过使用语音活动检测和语音分离管道来增强这一过程,从而更精确地定位语音区域。 为了证明所提方法的有效性,我们制定了一个三阶段评估协议,其中包括: 1. 使用单词错误率(WER)评估语音的可理解度; 2. 使用广泛使用的预训练模型中的声音源分类准确度下降(SCAD)来评估音频中各个声源的可检测性; 3. 通过使用包含未修改语音数据的数据集计算弗雷歇音频距离(FAD),评估音质。 在由语音和环境声音场景线性混合组成的模拟评测数据集中进行实验,结果显示我们的方法能够有效降低语音的理解度(WER为97.9%),并使声源的可检测性几乎不受影响(SCAD仅为2.7%),同时保持较高的感知质量(FAD值为1.40)。 进一步的研究表明了管道中每个组件的作用。此外,我们还发现,在我们的语音隐私保护方法中引入随机拼接可以增强算法对尝试恢复原始清晰语音的鲁棒性,但音频质量略有下降作为代价。
https://arxiv.org/abs/2507.08412
Deep hashing is an effective approach for large-scale image retrieval. Current methods are typically classified by their supervision types: point-wise, pair-wise, and list-wise. Recent point-wise techniques (e.g., CSQ, MDS) have improved retrieval performance by pre-assigning a hash center to each class, enhancing the discriminability of hash codes across various datasets. However, these methods rely on data-independent algorithms to generate hash centers, which neglect the semantic relationships between classes and may degrade retrieval performance. This paper introduces the concept of semantic hash centers, building on the idea of traditional hash centers. We hypothesize that hash centers of semantically related classes should have closer Hamming distances, while those of unrelated classes should be more distant. To this end, we propose a three-stage framework, SHC, to generate hash codes that preserve semantic structure. First, we develop a classification network to identify semantic similarities between classes using a data-dependent similarity calculation that adapts to varying data distributions. Second, we introduce an optimization algorithm to generate semantic hash centers, preserving semantic relatedness while enforcing a minimum distance between centers to avoid excessively similar hash codes. Finally, a deep hashing network is trained using these semantic centers to convert images into binary hash codes. Experimental results on large-scale retrieval tasks across several public datasets show that SHC significantly improves retrieval performance. Specifically, SHC achieves average improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics, respectively, over state-of-the-art methods.
深度哈希是一种有效的大型图像检索方法。当前的方法通常根据其监督类型分类为逐点(point-wise)、成对(pair-wise)和列表式(list-wise)。近期的逐点技术(如CSQ、MDS)通过预先分配每个类别的哈希中心,提高了不同数据集上的检索性能,从而增强了哈希码之间的区分性。然而,这些方法依赖于不考虑类别间语义关系的数据无关算法来生成哈希中心,这可能会降低检索性能。 本文引入了语义哈希中心的概念,在传统哈希中心的基础上发展而来。我们假设,语义相关的类别的哈希中心应该有更近的汉明距离,而与之无关的类别的哈希中心则应保持较大的距离。为此,我们提出了一个三阶段框架SHC(Semantic Hash Center),用于生成保留语义结构的哈希码。首先,我们开发了一个分类网络来使用依赖于数据相似性计算的方法识别类别间的语义相似性,并且该方法能够适应变化的数据分布。其次,我们引入了一种优化算法以生成保持语义相关性的语义哈希中心,同时强制执行中心之间的最小距离以避免过于相似的哈希码。最后,利用这些语义中心训练深度哈希网络将图像转换为二进制哈希码。 在多个公共大规模检索任务上的实验结果表明,SHC显著提高了检索性能。具体来说,在MAP@100、MAP@1000和MAP@ALL度量下,与最先进的方法相比,SHC分别实现了平均7.26%、7.62%和11.71%的改进。
https://arxiv.org/abs/2507.08404
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%.
在这篇论文中,我们介绍了MM-Gesture,这是由我们的团队HFUT-VUT开发的解决方案。该方案在IJCAI 2025年举行的第三届MiGA挑战赛微手势分类赛道上排名首位,并且其性能优于之前的最先进的方法。MM-Gesture是一个专门为识别细微且短时长的手势(简称MGs)而设计的多模态融合框架,它结合了关节、肢体、RGB视频、泰勒级数视频、光流视频和深度视频等多种模式中的互补线索。 我们的方法利用了PoseConv3D和Video Swin Transformer架构,并采用了一种新颖的模态加权集成策略。通过在更大的MA-52数据集上进行预训练,我们的方法进一步提升了RGB模态的表现。我们在iMiGUE基准测试上的广泛实验,包括不同模态下的消融研究,验证了我们提出的方法的有效性,实现了73.213%的顶级准确率。
https://arxiv.org/abs/2507.08344
Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.
深度学习在医学图像分析方面取得了显著进展,但其在临床实践中的应用仍受到现代模型规模庞大和透明度不足的限制。解释性技术的进步,如DL-Backtrace、逐层相关传播(Layer-wise Relevance Propagation)和集成梯度法(Integrated Gradients),使得评估训练于医学成像任务上的神经网络中各个组件的重要性成为可能。在这项工作中,我们引入了一个可解释性引导的剪枝框架,该框架在保持预测性能和透明度的同时降低了模型复杂度。通过选择性地保留每一层中最相关的部分,我们的方法能够实现有针对性的压缩,并保持临床上有意义的表示形式。多项医学图像分类基准实验表明,这种方法可以达到高压缩率且准确性损失极小,为医疗环境中部署轻量级、可解释性的模型铺平了道路。
https://arxiv.org/abs/2507.08330
Missing data presents a critical challenge in real-world datasets, significantly degrading the performance of machine learning models. While Large Language Models (LLMs) have recently demonstrated remarkable capabilities in tabular data imputation, exemplified by frameworks like UnIMP, their reliance on classical embedding methods often limits their ability to capture complex, non-linear correlations, particularly in mixed-type data scenarios encompassing numerical, categorical, and textual features. This paper introduces Quantum-UnIMP, a novel framework that integrates shallow quantum circuits into an LLM-based imputation architecture. Our core innovation lies in replacing conventional classical input embeddings with quantum feature maps generated by an Instantaneous Quantum Polynomial (IQP) circuit. This approach enables the model to leverage quantum phenomena such as superposition and entanglement, thereby learning richer, more expressive representations of data and enhancing the recovery of intricate missingness patterns. Our experiments on benchmark mixed-type datasets demonstrate that Quantum-UnIMP reduces imputation error by up to 15.2% for numerical features (RMSE) and improves classification accuracy by 8.7% for categorical features (F1-Score) compared to state-of-the-art classical and LLM-based methods. These compelling results underscore the profound potential of quantum-enhanced representations for complex data imputation tasks, even with near-term quantum hardware.
缺失数据在现实世界的数据集中构成了一项关键挑战,严重影响了机器学习模型的性能。虽然大型语言模型(LLMs)最近在表格数据插补方面展示了卓越的能力,例如由UnIMP框架所展示的技术成果,但它们依赖于传统的嵌入方法往往限制了其捕捉复杂非线性关联的能力,特别是在包含数值、分类和文本特征的混合类型数据场景中。本文介绍了Quantum-UnIMP,这是一种新颖的框架,它将浅层量子电路集成到了基于LLM的插补架构中。我们的核心创新在于使用瞬时量子多项式(IQP)电路生成的量子特征映射来替代传统的经典输入嵌入方法。这一策略使模型能够利用诸如叠加和纠缠等量子现象,从而学习更丰富、更具表现力的数据表示,并增强对复杂缺失模式的恢复能力。 我们在基准混合类型数据集上的实验表明,与最先进的经典和基于LLM的方法相比,Quantum-UnIMP可以将数值特征(RMSE)插补误差降低多达15.2%,并将分类准确性(F1-Score)提高8.7%。这些引人注目的结果强调了量子增强表示在复杂数据插补任务中的巨大潜力,即使使用近似期量子硬件也能实现这一目标。
https://arxiv.org/abs/2507.08255
Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at this https URL.
真菌种类的准确识别在计算机视觉领域面临着独特的挑战,主要是由于细粒度的种间差异和高度的种内变异。本文介绍了我们为FungiCLEF 2025竞赛设计的方法,该方法聚焦于使用FungiTastic Few-Shot数据集进行少量样本下的细粒度视觉分类(FGVC)。我们的团队(DS@GT)尝试了多种视觉变换模型、数据增强技术、加权采样以及文本信息的融合。我们还探索了生成式AI模型在零样本分类中的应用,通过结构化提示实现,但发现这些方法相对于基于视觉的方法来说表现不佳。 最终,我们的模型超越了竞赛中的基准,并强调了领域特定预训练和平衡采样策略的有效性。在比赛结束后对私人测试集的评估中,我们的方法排名为74个参赛团队中的第35位,这表明在元数据选择和跨域多模态学习方面还有进一步研究的空间。 我们的代码可在[此处](此链接应该指向一个公开可用的位置,例如GitHub)获取。
https://arxiv.org/abs/2507.08248
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEfficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram Token Skip-Gram (STSG) that treats bioacoustics as a sequence modeling task. This method converts audio into discrete "spectrogram tokens" by clustering Mel-spectrograms using Faiss K-means and then learns high-quality contextual embeddings for these tokens in an unsupervised manner with a Word2Vec skip-gram model. For classification, embeddings within a 5-second window are averaged and passed to a linear model. With a projected inference time of 6 minutes for a 700-minute test set, the STSG approach achieved a final ROC-AUC public score of 0.559 and a private score of 0.520, demonstrating the viability of fast tokenization approaches with static embeddings for bioacoustic classification. Supporting code for this paper can be found at this https URL.
The BirdCLEF+ 2025挑战要求从声音景观录音中分类出包括鸟类、哺乳动物、昆虫和两栖动物在内的206个物种,且必须在90分钟的CPU-only推理时间内完成,这使得许多最先进的深度学习方法变得不切实际。为了应对这一限制,GT BirdCLEF团队探索了两种策略。 首先,我们通过优化来自生物声学模型动物园(Bioacoustics Model Zoo)的预训练模型以适应CPU推理来建立具有竞争力的基础线。利用TFLite,我们实现了Perch模型在CPU上的近10倍推理加速,使其能够在大约16分钟内运行,并在比赛后的公共排行榜上达到了ROC-AUC得分为0.729,在私人排行榜上的得分则为0.711。动物园中的最佳模型是BirdSetEfficientNetB1,其公共评分为0.810,私人评分为0.778。 其次,我们引入了一个新的轻量级管道,名为频谱图标记跳字(Spectrogram Token Skip-Gram, STSG),该方法将生物声学任务视为序列建模任务。此方法通过使用Faiss K-means对梅尔频谱图进行聚类来将音频转换为离散的“频谱图标记”,然后利用Word2Vec跳字模型在无监督方式下学习这些标记的高质量上下文嵌入。对于分类,5秒窗口内的嵌入会被平均并传递给线性模型。预计STSG方法对700分钟测试集的推理时间约为6分钟,在公共排行榜上达到了ROC-AUC得分为0.559,并在私人排行榜上的得分则为0.520,这表明快速标记化方法与静态嵌入结合使用对于生物声学分类具有可行性。此论文的相关支持代码可以在以下链接找到:[此处插入URL]。
https://arxiv.org/abs/2507.08236
While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global context required to resolve anatomical ambiguity and render reliable landmark localization. To overcome this, we reformulate the 3D challenge as a \textbf{Parallel Probabilistic Landmark Localization} task along the 1D axial dimension. We propose the \textbf{Depth-Sequence Transformer (DST)}, a framework that processes full-resolution CT volumes as sequences of 2D slices, learning to predict $N=6$ independent probability distributions that pinpoint key anatomical landmarks. Our DST framework demonstrates exceptional accuracy and robustness. Evaluated on a 100-patient clinical cohort with rigorous 5-fold cross-validation, it achieves a Mean Absolute Error (MAE) of \textbf{0.1 slices}, with \textbf{96\%} of predictions falling within a $\pm1$ slice tolerance. Furthermore, to validate its architectural power, the DST backbone establishes the best result on the public Clean-CC-CCII classification benchmark under an end-to-end evaluation protocol. Our work delivers the first practical tool for automated segment-specific ICAC analysis. The proposed framework provides a foundation for further studies on the role of location-specific biomarkers in diagnosis, prognosis, and procedural planning. Our code will be made publicly available.
虽然全颅内颈动脉钙化(ICAC)总体积已被确立为中风的生物标志物,但越来越多的证据表明,这种聚合指标忽视了斑块位置的关键影响,因为不同节段中的钙化具有不同的预后和操作风险。然而,更精细、按节段进行的具体量化在技术上仍然是不可行的。传统的3D模型被迫处理降采样的体积或孤立的补丁,牺牲了解析解剖模糊性和可靠地标定位所需的整体上下文。为了解决这个问题,我们将3D挑战重新定义为沿1D轴向维度的**并行概率标定点定位**任务。我们提出了**深度序列变换器(DST)框架**,该框架将全分辨率CT体积作为2D切片序列处理,并学习预测6个独立的概率分布来确定关键解剖标志点的位置。 我们的DST框架表现出卓越的准确性和鲁棒性。在包含100名患者的临床队列上进行严格的五折交叉验证评估后,它实现了平均绝对误差(MAE)为**0.1片**的结果,并且有**96%**的预测落在±1片容差范围内。此外,为了验证其架构能力,DST骨干在公共Clean-CC-CCII分类基准上以端到端评估协议下取得了最佳结果。 我们的工作提供了首个用于自动节段特异性ICAC分析的实际工具。所提出的框架为后续研究奠定了基础,这些研究旨在探讨位置特定生物标志物在诊断、预后和程序规划中的作用。我们将公开发布相关代码。
https://arxiv.org/abs/2507.08214
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at this https URL.
虽然多实例学习(MIL)已被证明是用于组织病理学全切片图像(WSI)分析的一种有前途的方法,但其依赖于置换不变性显著限制了它有效揭示WSI内实例间语义关联的能力。基于我们的实证和理论研究,我们认为非置换不变性的方法能够更好地捕捉实例间的空间相关性,并能提供更有效的解决方案。鉴于这些发现,我们提出了一种针对现有MIL的新替代方案,该方案通过学习从随机混洗的排列中恢复实例顺序来进行WSI分析。我们将这一任务称为破解实例拼图问题,在此过程中揭示了实例之间的语义关联。为了应对实例拼图挑战,我们提出了一个新颖的暹罗网络解决方案,其理论依据来自于最优传输理论。我们在WSI分类和生存预测任务上验证了所提出的方法,结果显示该方法优于最近的状态-of-the-art MIL竞争对手。代码可在提供的链接处获取。
https://arxiv.org/abs/2507.08178
Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
对比视觉语言模型如CLIP被广泛应用于各种应用场景,包括零样本分类或多模态模型中的视觉编码器。尽管这些模型很受欢迎,但它们的表示形式存在重大局限性。例如,CLIP模型学习的是基于词袋的表示方法,因此无法区分“一艘黄色潜水艇和一辆蓝色巴士”与“一艘蓝色潜水艇和一辆黄色巴士”。以往尝试通过在训练中加入硬负样本或修改架构来解决这一问题的努力并未彻底解决问题。我们怀疑解决CLIP绑定问题的关键见解隐藏于学习算法中最重要的一部分:数据本身。 在这项工作中,我们填补了这一空白,严谨地识别出数据属性对CLIP学习绑定能力的影响,并使用合成数据集进行了验证。我们发现,自然数据的常见特性(如低属性密度、不完整的描述和显著性偏差——即人类描述者倾向于描述他们认为“最显眼”的物体)都对绑定性能产生了负面影响。 与普遍看法不同的是,我们发现在训练中既不是通过增加批量大小来隐式添加更多硬负样本,也不是通过明确创建硬负样本使CLIP能够学习可靠的绑定。只有当数据体现出我们识别出的数据属性时,CLIP才能几乎完美地实现绑定。
https://arxiv.org/abs/2507.07985
Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society's most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.
鉴于生成式人工智能的迅速采用及其对各种任务潜在影响,理解AI对经济的影响是社会面临的重要问题之一。在这项研究中,我们通过分析人们使用AI的工作活动、这些活动的成功程度和广泛性,并结合有关从事此类活动的职业的数据,朝着这个目标迈出了一步。我们分析了一个包含20万条匿名且隐私已清除的用户与微软Bing Copilot(一个公开可用的生成式AI系统)之间的对话数据集。我们发现人们寻求AI协助最常见的工作活动包括收集信息和写作,而AI本身最常执行的任务是提供信息和支持、写作、教学和建议。结合这些任务分类以及对任务成功度和影响范围的测量,我们为每个职业计算了AI适用性得分。研究结果表明,在计算机和数学、办公室及行政支持等知识型工作群体中,以及销售等需要提供和传递信息的职业中,AI的适用性得分最高。此外,我们还描述了哪些类型的工作活动执行得最成功,并探讨了工资与教育程度如何影响AI适用性,以及实际应用与职业AI影响预测之间的比较情况。
https://arxiv.org/abs/2507.07935
The rapid development of artificial intelligence has positioned large language models as fundamental components of intelligent legal systems. However, these models face significant limitations in legal dispute analysis, including insufficient legal knowledge representation, limited concept understanding, and reasoning deficiencies. This research proposes an enhanced framework integrating prompt engineering with multidimensional knowledge graphs. The framework introduces a three-stage hierarchical prompt structure comprising task definition, knowledge background, and reasoning guidance, supplemented by legal-specific reasoning templates and dynamic optimization mechanisms. A three-layer knowledge graph architecture is constructed with legal classification ontology, representation, and instance layers. Four complementary methods enable precise legal concept retrieval: direct legal norm code matching, domain-specific semantic vector similarity, ontology-based path reasoning, and specialized lexical segmentation. These components integrate with web search technology to establish a knowledge-enhanced framework for legal decision-making. Experimental results demonstrate significant performance improvements in legal dispute analysis, enabling accurate legal application analysis for complex cases while exhibiting nuanced understanding of judicial decision-making logic, providing a novel technical approach for implementing intelligent legal assistance systems.
人工智能的快速发展已将大型语言模型定位为智能法律系统中的核心组成部分。然而,这些模型在法律纠纷分析方面面临显著限制,包括法律知识表示不足、概念理解有限以及推理能力欠缺等问题。本研究提出了一种改进框架,该框架结合了提示工程与多维度知识图谱技术。这一框架引入了一个三阶段层次化提示结构,涵盖任务定义、知识背景和推理指导,并辅以特定于法律的推理模板及动态优化机制。同时构建了一个三层知识图架构,包括法律分类本体论层、表示层以及实例层。为了实现精确的法律概念检索,提出了四种互补方法:直接匹配法律规定代码、领域特定语义向量相似度分析、基于本体论路径的推理以及专业词汇分割。这些组件与网络搜索技术结合使用,以建立一个增强知识支持的法律决策框架。 实验结果表明,在法律纠纷分析方面性能有了显著提升,能够对复杂案件进行准确的应用分析,并展现出对司法判决逻辑的细微理解能力,为智能法律辅助系统的实施提供了新颖的技术路径。
https://arxiv.org/abs/2507.07893
The safety validation of automatic emergency braking system (AEBS) requires accurately distinguishing between false positive (FP) and true positive (TP) system activations. While simulations allow straightforward differentiation by comparing scenarios with and without interventions, analyzing activations from open-loop resimulations - such as those from field operational testing (FOT) - is more complex. This complexity arises from scenario parameter uncertainty and the influence of driver interventions in the recorded data. Human labeling is frequently used to address these challenges, relying on subjective assessments of intervention necessity or situational criticality, potentially introducing biases and limitations. This work proposes a rule-based classification approach leveraging the Prediction Divergence Principle (PDP) to address those issues. Applied to a simplified AEBS, the proposed method reveals key strengths, limitations, and system requirements for effective implementation. The findings suggest that combining this approach with human labeling may enhance the transparency and consistency of classification, thereby improving the overall validation process. While the rule set for classification derived in this work adopts a conservative approach, the paper outlines future directions for refinement and broader applicability. Finally, this work highlights the potential of such methods to complement existing practices, paving the way for more reliable and reproducible AEBS validation frameworks.
自动紧急制动系统(AEBS)的安全验证需要准确地区分误报(FP)和真正报(TP)的系统激活。虽然模拟可以通过比较有干预和无干预的情景来轻松区分这些情况,但对开放循环重仿真(如现场运行测试中的记录数据进行分析)更为复杂。这种复杂性源于情景参数不确定性以及驾驶员干预的影响。通常采用人工标注的方法来解决这些问题,这种方法依赖于主观判断干预的必要性或情境的关键程度,可能会引入偏见和限制。本研究提出了一种基于预测分歧原则(PDP)的规则分类方法,以解决上述问题。该方法应用于简化后的AEBS系统中,揭示了有效实施的关键优势、局限性和系统要求。研究结果表明,将这种方法与人工标注相结合可以增强分类过程的透明度和一致性,从而提高整体验证过程的质量。 本工作中所提出的分类规则采用保守的方法制定,文章还概述了未来改进和更广泛适用性的方向。最终,该工作强调此类方法具有补充现有实践、为更可靠和可重复的AEBS验证框架铺平道路的巨大潜力。
https://arxiv.org/abs/2507.07872
Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model's ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.
连续表示逻辑公式的方法使我们能够将符号知识整合到基于数据的学习算法中。如果这些嵌入在语义上是一致的,即相似的规范被映射成接近的向量,则可以在公式的语义空间内直接进行持续学习和优化。然而,为了将最优的连续表示转换为具体的规范需求,这些嵌入必须是可以逆向操作的。我们通过训练一个基于Transformer的解码器模型来解决这一问题,该模型能够逆转信号时态逻辑(STL)公式的语义嵌入。STL是一种强大的形式化方法,它允许我们在简洁而富有表现力的方式中描述随着时间变化的信号属性。 通过从STL语法构建一个小词汇表,我们展示了我们的模型仅在1个训练周期后就能生成有效的公式,并且大约经过10次迭代后能够泛化到逻辑语义。此外,该模型能够将给定嵌入解码为长度更短、嵌套层次较少但语义上接近(或等同于)参考标准的公式。 为了评估训练数据对模型捕捉嵌入中的语义信息及泛化能力的影响,我们在不同复杂度级别的训练公式的背景下展示了我们方法的有效性。最后,我们将该模型部署在需求挖掘任务中,即推断能够解决轨迹分类问题的STL规范,并直接在语义空间内进行优化。 这段翻译介绍了如何通过基于Transformer的解码器模型逆转信号时态逻辑公式嵌入的方法来实现对连续表示的逆向操作。这种方法不仅能够在训练周期早期生成有效公式,还表现出良好的泛化能力,并能在更广泛的复杂度范围内有效地捕捉语义信息并直接在语义空间内进行优化以解决具体任务。
https://arxiv.org/abs/2507.07808