The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at this https URL.
软件库的快速演化给代码生成带来了相当大的障碍,需要在保持向后兼容性的同时不断适应频繁的版本更新。虽然现有的代码演变基准提供了宝贵的见解,但它们通常缺乏基于执行的评估来生成符合特定库版本的代码。为了解决这一问题,我们引入了GitChameleon,这是一个新颖且精心策划的数据集,包含328个Python代码补全问题,每个问题都针对特定的库版本,并附有可执行的单元测试。GitChameleon严格评估当代大型语言模型(LLMs)、由LLM驱动的代理、代码助手和检索增强生成系统在进行符合版本要求的功能性准确代码生成方面的表现能力。我们的广泛评估表明,最先进的系统在这个任务中遇到了重大挑战;企业级模型仅实现了48-51%的基本成功率,这突显了问题的复杂性。通过提供一个基于执行的基准测试,强调代码库动态性质,GitChameleon使得更好地理解这一挑战成为可能,并有助于指导更适应性和可靠性的AI代码生成方法的发展。我们将在[此处](https://example.com)公开发布数据集和评估代码。
https://arxiv.org/abs/2507.12367
Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as "superposition catastrophe" and "the problem of 2". Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.
神经符号人工智能(神经符号AI)在逻辑分析和推理方面表现出色。高维计算(HDC),一种有前景的类脑计算模型,是神经符号AI的重要组成部分。已经提出了多种HDC模型来表示类-实例和类-类关系,但在表示更复杂的类-子类关系时——多个对象关联不同级别的类和子类——它们在分解任务中面临挑战,这是神经符号AI系统的关键任务。本文提出了一种新的HDC模型FactorHD,该模型能够高效地表示和分解复杂的关系。 FactorHD具有一个符号编码方法,嵌入了额外的记忆条款,为多对象保留更多的信息。此外,它还采用了一个高效的分解算法,通过识别目标类的记忆条款来选择性地消除冗余类别。这种模型在表示和分解具有类-子类关系的多个对象时,显著提高了计算效率和准确性,并克服了现有HDC模型如“叠加灾难”和“2的问题”的局限性。 评估结果显示,在10^9的表示大小下,FactorHD与现有的HDC模型相比,速度提升了大约5667倍。当与ResNet-18神经网络集成时,在Cifar-10数据集上实现了92.48%的分解精度。
https://arxiv.org/abs/2507.12366
We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.
我们提出了一种新颖的方法——Cluster Contrast(简称CueCo),这是一种无监督视觉表征学习方法,它巧妙地结合了对比学习和聚类方法的优势。受近期进展的启发,CueCo旨在同时在特征空间内散列并对齐特征表示。此方法使用两个神经网络:一个查询网络和一个键网络,其中键网络通过查询输出的慢速移动平均值进行更新。 CueCo采用对比损失来推动不同特征之间的距离,从而增强类间区分度,并利用聚类目标将同一簇内的特征拉近,促进类内紧凑性。我们的方法在使用ResNet-18骨干网络并通过线性评估的情况下,在CIFAR-10数据集上达到了91.40%的Top-1分类准确率,在CIFAR-100和ImageNet-100上的准确率分别为68.56%和78.65%。 通过将对比学习与聚类技术相结合,CueCo为无监督视觉表征学习的发展开辟了新的方向。
https://arxiv.org/abs/2507.12359
Weed detection is a critical component of precision agriculture, facilitating targeted herbicide application and reducing environmental impact. However, deploying accurate object detection models on resource-limited platforms remains challenging, particularly when differentiating visually similar weed species commonly encountered in plant phenotyping applications. In this work, we investigate Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) to enhance the performance of lightweight models for real-time smart spraying systems. Utilizing YOLO11x as the teacher model and YOLO11n as both reference and student, both CWD and MGD effectively transfer knowledge from the teacher to the student model. Our experiments, conducted on a real-world dataset comprising sugar beet crops and four weed types (Cirsium, Convolvulus, Fallopia, and Echinochloa), consistently show increased AP50 across all classes. The distilled CWD student model achieves a notable improvement of 2.5% and MGD achieves 1.9% in mAP50 over the baseline without increasing model complexity. Additionally, we validate real-time deployment feasibility by evaluating the student YOLO11n model on Jetson Orin Nano and Raspberry Pi 5 embedded devices, performing five independent runs to evaluate performance stability across random seeds. These findings confirm CWD and MGD as an effective, efficient, and practical approach for improving deep learning-based weed detection accuracy in precision agriculture and plant phenotyping scenarios.
杂草检测是精准农业中的一个关键组成部分,它有助于实现有针对性的除草剂应用并减少对环境的影响。然而,在资源受限平台上部署准确的对象检测模型仍然具有挑战性,尤其是在区分植物表型应用中常见的视觉相似杂草种类时。在这项工作中,我们研究了通道级知识蒸馏(CWD)和掩码生成式蒸馏(MGD),以增强轻量级模型在实时智能喷洒系统中的性能。使用YOLO11x作为教师模型,并将YOLO11n同时用作参考和学生模型,这两种方法都能够有效地从教师模型向学生模型转移知识。 我们的实验是在一个包含甜菜作物及四种杂草类型(蓟、旋花、豚草和千金子)的真实世界数据集上进行的,结果表明所有类别的平均精度(AP50)均有所提高。经过蒸馏处理后的CWD学生模型在mAP50方面比基准模型提高了2.5%,而MGD则提高了1.9%,同时没有增加模型复杂度。 此外,为了验证实时部署的可能性,我们在Jetson Orin Nano和Raspberry Pi 5嵌入式设备上评估了YOLO11n的学生模型,并进行了五次独立运行以评估不同随机种子下的性能稳定性。这些发现证实了CWD和MGD是一种有效、高效且实用的方法,能够提高基于深度学习的杂草检测精度,在精准农业及植物表型应用中具有重要意义。
https://arxiv.org/abs/2507.12344
This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
本文介绍了KeyDiff3D框架,这是一种无需监督的单目三维关键点估计方法,可以从单一图像中准确预测三维关键点。与以往依赖手动标注或校准多视角图像的方法相比(这些数据收集起来成本高昂),我们的方法仅使用一系列单视图图像即可进行单目三维关键点估计。为了实现这一目标,我们利用了预先训练的多视角扩散模型中嵌入的强大几何先验知识。在该框架内,这种模型能够从单一图像生成多视角图像,并作为监督信号向我们的模型提供三维几何线索。同时,我们将扩散模型用作强大的二维多视图特征提取器,并从中构建出三维特征体,从而将由扩散模型学习到的隐式三维先验转换为显式的三维特征。 除了精确的关键点估计之外,我们还引入了一种流程,能够对通过扩散模型生成的三维物体进行操作。在包括Human3.6M、Stanford Dogs以及多种野外和跨域数据集上的实验结果表明,我们的方法在精度、泛化能力以及单张图像驱动下使由扩散模型生成的三维对象可操控性方面表现出色。
https://arxiv.org/abs/2507.12336
This paper introduces a neural polar decoder (NPD) for deletion channels with a constant deletion rate. Existing polar decoders for deletion channels exhibit high computational complexity of $O(N^4)$, where $N$ is the block length. This limits the application of polar codes for deletion channels to short-to-moderate block lengths. In this work, we demonstrate that employing NPDs for deletion channels can reduce the computational complexity. First, we extend the architecture of the NPD to support deletion channels. Specifically, the NPD architecture consists of four neural networks (NNs), each replicating fundamental successive cancellation (SC) decoder operations. To support deletion channels, we change the architecture of only one. The computational complexity of the NPD is $O(AN\log N)$, where the parameter $A$ represents a computational budget determined by the user and is independent of the channel. We evaluate the new extended NPD for deletion channels with deletion rates $\delta\in\{0.01, 0.1\}$ and we verify the NPD with the ground truth given by the trellis decoder by Tal et al. We further show that due to the reduced complexity of the NPD, we are able to incorporate list decoding and further improve performance. We believe that the extended NPD presented here could have applications in future technologies like DNA storage.
本文介绍了一种用于恒定删除率删除信道的神经极化解码器(NPD)。现有的针对删除信道的极化解码器具有较高的计算复杂度,为$O(N^4)$,其中$N$表示块长度。这限制了极化码在短至中等长度块中的应用。在这项工作中,我们证明采用NPD可以降低删除信道的计算复杂度。首先,我们将NPD架构扩展以支持删除信道。具体来说,NPD架构由四个神经网络(NN)组成,每个网络复制基础连续取消(SC)解码器的操作。为了支持删除信道,我们只修改了一个架构。 NPD的计算复杂度为$O(AN\log N)$,其中参数$A$代表用户确定的计算预算,并且与信道无关。我们评估了新的扩展型NPD在删除率为$\delta\in\{0.01, 0.1\}$的删除信道中的表现,并通过Tal等人提出的网格解码器验证了NPD的结果。 此外,由于NPD复杂度降低,我们可以引入列表解码以进一步提升性能。我们认为这里介绍的扩展型NPD可能在未来技术如DNA存储中有应用前景。
https://arxiv.org/abs/2507.12329
We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.
我们认为,扩散模型在建模复杂分布方面取得的成功,在很大程度上源于它们的输入条件设置。本文从理想表示应提高样本保真度、易于生成且具有组合性以支持训练数据之外的新样本生成这一视角出发,研究了用于调节扩散模型的表示方法。我们引入了离散潜在码(Discrete Latent Code, DLC),这是一种源自单纯形嵌入并在自监督学习目标下进行训练的图像表示形式。DLC是一系列离散令牌序列,而不是标准的连续图像嵌入。它们易于生成,并且其组合性使得能够合成超出训练数据分布的新颖图像。使用DLC训练的扩散模型在无条件图像生成方面提高了生成保真度,在ImageNet数据集上建立了新的最先进的性能。此外,我们展示了通过组合DLC可以促使图像生成器产生出训练分布之外、并且能连贯地结合不同图像语义的新样本。最后,我们演示了如何利用大规模的预训练语言模型来实现文本到图像的生成。我们将一个文本扩散语言模型进行有效微调,使其能够生成超出图像生成器训练数据分布范围的新颖DLC样本。
https://arxiv.org/abs/2507.12318
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.
虽然通过强化学习训练的大规模推理模型(LRMs,例如Deepseek-R1)在不断发展的大型语言模型(LLM)领域中展现了先进的推理能力,但它们对安全威胁的易感性仍然是一个关键弱点。这种弱点尤其体现在链式思维(CoT)生成过程中,在此过程中,像后门提示攻击这样的对抗方法可以系统地破坏模型的核心推理机制。新兴的Chain-of-Thought Attack (CoTA) 通过利用提示控制力来揭示这一漏洞,同时以低成本干预的方式降低 CoT 安全性和任务性能。 为了解决这种复杂的安全-性能脆弱性问题,我们提出了“思维纯净度”(Thought Purity, TP):一种防御范式,该范式系统地增强了对恶意内容的抵抗能力,同时保持操作效率。我们的解决方案通过三个协同组件实现这一目标: 1. 安全优化的数据处理管道。 2. 增强型规则约束的强化学习。 3. 自适应监控指标。 我们提出的方法建立了针对与强化学习相一致的推理系统中CoTA漏洞的第一种全面防御机制,显著提升了下一代AI架构的安全性和功能性的平衡。
https://arxiv.org/abs/2507.12314
Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there's a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets -- VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs' understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.
大型语言模型(LLMs)已经在各种自然语言处理任务和领域中广泛使用,展示了其适应性和有效性。在电子设计自动化(EDA)领域,LLMs在寄存器传输级(RTL)代码生成和总结等任务中显示出潜力。然而,尽管通用编码相关任务中的LLM应用日益增多,针对硬件描述语言(HDL),尤其是VHDL的评估和改进研究却相对匮乏。在这项研究中,我们使用多种指标以及两个数据集——VHDL-Eval和VHDL-Xform来评估现有代码生成模型在VHDL编码和总结任务中的性能。其中,后者是我们内部构建的数据集,旨在衡量LLMs对功能等价代码的理解能力。我们的发现显示,这些模型在这两项任务的各个指标上表现出持续且一致的欠佳表现,这表明它们在这个领域的适用性存在显著不足。 为了应对这一挑战,我们提出了一种新的方法——描述链(Chain-of-Descriptions, CoDes),旨在提升LLMs在VHDL代码生成和总结方面的性能。CoDes涉及根据问题陈述或现有代码生成一系列中间的描述步骤:(i) 对于编码生成任务,基于问题陈述;(ii) 对于代码总结任务,则是基于现有的VHDL代码。这些步骤随后与原始输入提示(问题陈述或代码)结合,并作为LLMs的输入来生成最终输出。 我们的实验表明,在两个数据集上,CoDes方法在多种指标上的表现均显著优于标准的指令策略。这种方法不仅提高了VHDL编码和总结的质量,还为未来研究改进适用于VHDL的代码生成模型提供了一个框架。
https://arxiv.org/abs/2507.12308
The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) pre-trained model (PTM) generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available at this https URL.
在线连续学习(OCL)中的数据隐私限制,即只能一次性访问的数据,使流数据的灾难性遗忘问题变得更加复杂。当前OCL中领先的解决方案通常采用内存节约示例或从先前类别提取的特征来在当前任务中重放这些信息。另一方面,提示符基础的方法在持续学习中表现出色,但伴随着可训练参数数量不断增加的问题。第一种方法可能由于数据开放政策而在实践中不可行,而第二种方法则面临与流数据相关的吞吐量问题。 在这项研究中,我们提出了一种新颖的基于提示符的方法用于在线连续学习,该方法包括四个主要组成部分:(1)作为通用知识的单个轻量级提示生成器;(2)特定知识可训练缩放和偏移调整器;(3)预训练模型(PTM)泛化保留;以及(4)硬-软更新机制。我们的方法在CIFAR100、ImageNet-R、ImageNet-A 和 CUB 数据集上均显著优于当前最先进的解决方案。复杂性分析表明,与现有方法相比,我们所提出的方法需要更少的参数数量,并且在训练时间、推理时间和吞吐量方面表现出适度水平。 为了进一步研究,我们的代码可以在以下网址获得:[请在此处插入实际链接]。
https://arxiv.org/abs/2507.12305
Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived this http URL addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at this https URL, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
文本异常检测是自然语言处理(NLP)中的关键任务,其应用范围涵盖欺诈检测、虚假信息识别、垃圾邮件检测和内容审核等。尽管在大型语言模型(LLMs)和异常检测算法方面取得了重大进展,但缺乏用于评估现有文本数据异常检测方法的标准化和全面基准测试工具,限制了严格比较及创新方法的发展。本工作进行了全面的经验研究,并引入了一个基于各种预训练语言模型生成的嵌入表示、跨越多种文本数据集的文本异常检测基准。 我们的工作系统性地评估了基于嵌入表示的文本异常检测的有效性,具体包括: 1. 早期的语言模型(如GloVe、BERT); 2. 多种大型语言模型(LLaMa-2、LLama-3、Mistral、OpenAI的小型、ada和大型版本); 3. 跨多个领域的文本数据集(新闻、社交媒体、科学出版物); 4. 全面的评估指标(AUROC、AUPRC)。 我们的实验揭示了一个关键的经验见解:嵌入质量显著影响异常检测的有效性,当使用LLM生成的数据时,基于深度学习的方法并未显示出比传统浅层算法(如KNN、孤立森林等)具有任何性能优势。此外,我们观察到跨模型性能矩阵的强烈低秩特性,这使得在实际应用中能够采用一种高效的策略进行快速模型评估(或嵌入评估)和选择。 此外,通过开源我们的基准工具包,其中包括来自不同模型的所有嵌入表示以及代码,这项工作为未来研究具有鲁棒性和可扩展性的文本异常检测系统提供了基础。
https://arxiv.org/abs/2507.12295
Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.
体操技能分类是一种计算机视觉任务,旨在从图像中推断运动员所执行的技能,从而实现自动性能评估和个性化分析。传统的方法通过姿势估计方法来确定来自图像中的骨骼数据位置,然后再将这些信息输入到分类算法中以推断出所执行的技能。尽管在人体姿态估计算法方面取得了进展,但它们仍然涉及高昂的计算成本、较长的推理时间和复杂的设置,这限制了这类方法在实时应用或移动设备上的适用性。 这项工作提出了一种直接的方法来进行体操技能识别,这种方法利用深度估计和运动员区域检索来避免昂贵的人体姿势估计模块。通过使用Depth Anything V2进行深度估计以及YOLOv10进行运动员定位,我们从背景中分割出主体,而不是依赖传统的姿态估计技术。这种策略提高了效率、减少了推理时间,并改善了分类准确性。 我们的方法在很大程度上超越了基于骨架的方法,在使用RGB图像补丁时实现了38.3倍更快的推理速度,并且通过深度图块提升了分类准确率(0.837对0.815)。除了这些性能提升之外,我们管道的模块化设计还允许灵活替换组件,从而支持未来的改进和适应现实世界的应用。
https://arxiv.org/abs/2507.12292
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
在大型语言模型(LLMs)的发展中,任务自动化在软件工程领域的应用得到了增强;然而,目前的评估主要集中在自然语言处理任务上,忽略了代码质量。大多数基准测试更侧重于高级推理能力而非可执行代码和实际性能表现,这导致了对这些模型在生产环境中的真实能力和潜在风险理解上的不足。为解决这一问题,我们提出了MERA Code,这是MERA基准测试家族的新成员,专门用于评估俄语环境中最新代码生成LLMs的代码质量。该基准包括11个评价任务,涵盖了8种编程语言。 我们的评估方法学包含一个分类法,明确列出模型完成这些任务所需的实用编码技能。这个基准提供了开源代码库供用户进行MERA评测,同时兼容各种编程环境的评分系统,并且有一个具备排行榜和提交系统的平台。我们将对开放源代码LLMs和前沿API模型进行评估,分析它们在非英语语言环境中处理实际编码任务时的局限性。 我们计划公开发布MERA以引导未来的研究方向,预测模型开发中的突破性功能,并标准化评估流程。
https://arxiv.org/abs/2507.12284
Diffusion models have demonstrated remarkable image generation capabilities, but also pose risks in privacy and fairness by memorizing sensitive concepts or perpetuating biases. We propose a novel \textbf{concept erasure} method for text-to-image diffusion models, designed to remove specified concepts (e.g., a private individual or a harmful stereotype) from the model's generative repertoire. Our method, termed \textbf{FADE} (Fair Adversarial Diffusion Erasure), combines a trajectory-aware fine-tuning strategy with an adversarial objective to ensure the concept is reliably removed while preserving overall model fidelity. Theoretically, we prove a formal guarantee that our approach minimizes the mutual information between the erased concept and the model's outputs, ensuring privacy and fairness. Empirically, we evaluate FADE on Stable Diffusion and FLUX, using benchmarks from prior work (e.g., object, celebrity, explicit content, and style erasure tasks from MACE). FADE achieves state-of-the-art concept removal performance, surpassing recent baselines like ESD, UCE, MACE, and ANT in terms of removal efficacy and image quality. Notably, FADE improves the harmonic mean of concept removal and fidelity by 5--10\% over the best prior method. We also conduct an ablation study to validate each component of FADE, confirming that our adversarial and trajectory-preserving objectives each contribute to its superior performance. Our work sets a new standard for safe and fair generative modeling by unlearning specified concepts without retraining from scratch.
扩散模型在图像生成方面表现出色,但也因记忆敏感概念或延续偏见而存在隐私和公平性风险。为此,我们提出了一种针对文本到图像的扩散模型的概念擦除方法——**FADE(Fair Adversarial Diffusion Erasure)**,旨在从模型的生成能力中移除指定概念(例如私人个体或有害刻板印象)。该方法结合了轨迹感知微调策略与对抗性目标,以确保可靠地移除概念同时保持整体模型保真度。理论上,我们证明FADE能最小化被擦除概念与模型输出之间的互信息,从而保证隐私和公平性。实验上,我们在Stable Diffusion和FLUX模型上评估了FADE,并使用之前工作的基准测试(例如MACE中的对象、名人、显式内容及风格擦除任务)进行了验证。结果表明,FADE在概念移除效率与图像质量方面均超越近期基准方法如ESD、UCE、MACE和ANT,提升了5%至10%的概念去除和保真度的调和平均值。此外,我们还进行了一项消融研究以验证FADE每一组成部分的有效性,证实了其对抗性和轨迹保持目标对该方法优越性能的贡献。我们的工作通过遗忘指定概念而不必从头开始重新训练的方式,为安全且公平的生成建模设定了新的标准。
https://arxiv.org/abs/2507.12283
Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
支气管肺发育不良(BPD)是一种慢性肺部疾病,影响到35%的极低出生体重婴儿。它被定义为在月经年龄达到36周时仍需依赖氧气的状态,并会导致长期的呼吸并发症。然而,预防性干预措施也伴随着严重的风险,包括神经发育障碍、机械通气引起的肺损伤以及全身性的并发症。因此,在低风险婴儿中避免不必要的毒性是非常重要的,这就要求早期对BPD进行预后和预测。 极早产儿入院时拍取的X光片通常在出生24小时内获取,并且可以作为无创性预后的工具。在这项研究中,我们利用163名极度低体重(妊娠周数≤32周,体重介于401-999克)婴儿在出生后24小时内的胸部X光片开发并测试了一种深度学习方法。 我们对一个在成人胸部X光片上预训练的ResNet-50模型进行了微调,并使用逐步层冻结和有区别的学习率来防止过拟合并评估了CutMix增强技术和线性探测。对于中度/重度BPD结果预测,我们的最佳表现模型(采用了逐步冻结、线性探测及CutMix技术)达到了AUROC为0.78 ± 0.10,平衡准确率为0.69 ± 0.10以及F1值为0.67 ± 0.11的性能。领域内预训练显著优于ImageNet初始化(p = 0.031),这证实了特定领域的预训练对BPD结果预测的重要性。 常规IRDS评分显示有限的预后价值(AUROC 0.57 ± 0.11),确认了学习标记的需求。本方法表明,通过领域特定预训练能够从日常入院第一天的X光片中准确预测BPD。通过逐步冻结和线性探测技术的应用,这种方法在计算上是可行的,并且适合于现场级别的实施以及未来联邦学习部署中的应用。
https://arxiv.org/abs/2507.12269
Gaussian processes have become a popular tool for nonparametric regression because of their flexibility and uncertainty quantification. However, they often use stationary kernels, which limit the expressiveness of the model and may be unsuitable for many datasets. We propose a framework that uses nonstationary kernels whose parameters vary across the feature space, modeling these parameters as the output of a neural network that takes the features as input. The neural network and Gaussian process are trained jointly using the chain rule to calculate derivatives. Our method clearly describes the behavior of the nonstationary parameters and is compatible with approximation methods for scaling to large datasets. It is flexible and easily adapts to different nonstationary kernels without needing to redesign the optimization procedure. Our methods are implemented with the GPyTorch library and can be readily modified. We test a nonstationary variance and noise variant of our method on several machine learning datasets and find that it achieves better accuracy and log-score than both a stationary model and a hierarchical model approximated with variational inference. Similar results are observed for a model with only nonstationary variance. We also demonstrate our approach's ability to recover the nonstationary parameters of a spatial dataset.
高斯过程由于其灵活性和不确定性量化能力,已经成为非参数回归中一种流行的工具。然而,它们通常使用平稳核函数,这限制了模型的表达能力和可能不适合许多数据集。我们提出了一种框架,该框架使用参数随特征空间变化而非平稳的核函数,并将这些参数建模为以特征作为输入的神经网络的输出。通过链式法则计算导数来共同训练神经网络和高斯过程。我们的方法清晰地描述了非平稳参数的行为,并且与用于扩展到大型数据集的大规模近似技术兼容。该方法灵活并且易于适应不同的非平稳核函数,而无需重新设计优化程序。 我们使用GPyTorch库实现了这些方法,并可以轻松进行修改。我们在多个机器学习数据集上测试了我们的方法的非平稳方差和噪声变体,发现其比稳态模型及用变分推断近似后的层次模型在准确性和对数分数上都有更好的表现。仅使用非平稳方差的模型也得到了类似的结果。我们还展示了该方法对于空间数据集恢复非平稳参数的能力。 总的来说,这种方法提供了一种更灵活且适应性强的方式来应用高斯过程,在各种类型的数据集中都能获得比传统方法更好的性能。
https://arxiv.org/abs/2507.12262
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.
对于临床数据整合和医疗服务而言,HL7 FHIR(Fast Healthcare Interoperability Resources)标准已经确立为复杂健康数据之间互操作性的理想格式。以前尝试将自由形式的临床记录自动转换为结构化的FHIR资源时,人们依赖于模块化、基于规则的系统或经过指令微调和约束解码的大型语言模型(LLMs)。然而,这些方法常常受限于泛化能力有限以及结构一致性不足的问题。为此,我们提出了一种由LLM代理驱动的端到端框架,并结合代码执行和医疗术语数据库工具来解决这些问题。我们的解决方案名为Infherno,旨在遵守FHIR文档架构,在从非结构化文本预测FHIR资源方面与人类基线表现相当。该实现包括一个前端界面,支持定制和合成数据以及本地和专有模型的应用,从而促进临床数据整合流程,并增强跨机构的数据互操作性。
https://arxiv.org/abs/2507.12261
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at this https URL.
尽管端到端的自动语音识别(ASR)模型在转录普通对话方面表现出色,但在准确识别上下文中相关的关键词,如专有名词或用户特定实体时常常遇到困难。先前的研究探讨了通过文本模式中的关键词字典来改进关键词识别的方法,这可以通过引导逐令牌生成的令牌级融合或者直接复制关键词短语的短语级融合实现。然而,这些方法在不同的粒度级别上运作,并各有其局限性。在这篇论文中,我们提出了一种新颖的多粒度融合方法,该方法综合运用了大型语言模型(LLMs)中的令牌级和短语级融合的优势。我们的方法结合了一个晚期融合策略,巧妙地将ASR的声音信息与LLM丰富的上下文知识相结合,在保证细粒度令牌精度的同时实现整体短语级别的理解。在中文和英文数据集上的实验表明,我们提出的方法在关键词相关指标上达到了最先进的性能,同时保持了非关键词文本的高准确性。消融研究表明,令牌级和短语级组件都对性能提升做出了重大贡献,在我们的联合多粒度框架中相辅相成。代码和模型将在[公开网址](https://this https URL)提供。
https://arxiv.org/abs/2507.12252
Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.
深度学习在医学图像分类领域取得了显著进展,特别是随着卷积神经网络(CNN)的采用。各种深度学习框架,如Keras、PyTorch和JAX,在模型开发和部署方面提供了独特的优势。然而,这些框架在医学成像任务中的相对性能仍然有待探索。本研究对这三个框架中CNN的实现进行了全面分析,并使用PathMNIST数据集作为基准进行评估。我们从训练效率、分类准确性和推理速度等方面来衡量它们在实际应用中的适用性。我们的发现揭示了计算速度和模型精度之间的权衡,为医学图像分析的研究人员和实践者提供了有价值的见解。
https://arxiv.org/abs/2507.12248
Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
体操健身(Calisthenics)是一种快速增长的自重训练学科,它包含不同的类别,其中一个重点是技能。在体操健身中,技能包括运动员执行的各种静态和动态元素。静态技能的评估基于它们的难度等级以及保持的时间长度。能够通过分析视频中的身体姿态来识别等长(静止)技能并估算其持续时间的自动化工具对运动员训练和裁判评判比赛时非常有用。 尽管关于动作识别的身体姿态分析文献丰富,但此前没有研究专门解决体操健身技能在视频中进行时间分割的问题。本研究旨在为实现体操健身领域的自动化工具提供初步步骤。为了在这个领域推进知识,我们提出了一套由运动员表演的静态体操健身技巧的视频数据集。每个视频都标注了时间分割信息,以确定每项技能的具体范围。因此,我们报告了一个基本方法的结果,用于解决在提议的数据集中进行技能时间分割的问题。 结果表明,所提问题的实现是可行的,但仍有改进的空间。
https://arxiv.org/abs/2507.12245