Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees' infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.
森林害虫威胁着生态系统的稳定性,需要高效的监测方法。为了克服传统方法在大规模、精细化检测上的局限性,本研究专注于准确识别受害树木并分析其感染模式。我们提出了FID-Net,这是一种基于无人机可见光影像的深度学习模型,能够检测受病虫侵害的树木,并通过三种空间度量进行虫害分析。 FID-Net是基于YOLOv8n构建的,加入了轻量级特征增强模块(Feature Enhancement Module, FEM),用于提取疾病敏感线索;同时引入自适应多尺度特性融合模块(Adaptive Multi-scale Feature Fusion Module, AMFM)来对双通道特性进行对齐和融合(RGB和FEM增强后的图像);最后是高效的信道注意机制(Efficient Channel Attention, ECA),用来高效地提升判别信息。 基于检测结果,我们构建了一个害虫情况分析框架:(1) 使用核密度估计来定位感染热点;(2) 通过邻居评估来评估健康树木的感染风险;以及(3) 利用DBSCAN聚类算法识别高密度健康树群作为优先保护区域。 在中国天山东部地区的无人机影像实验中,FID-Net在32个森林地块上的检测性能表现突出:准确率为86.10%,召回率为75.44%,mAP@0.5为82.29%,mAP@0.5:0.95为64.30%。这些结果优于主流的YOLO模型。 数据分析表明,受感染的树木表现出明显的群聚现象,支持了有针对性的森林保护措施。FID-Net能够准确区分树种健康状况,并结合空间度量提供可靠数据用于智能虫害监控、早期预警及精确管理。
https://arxiv.org/abs/2512.13104
Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image this http URL, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.
概念擦除,即微调扩散模型以移除不希望出现或有害的视觉概念,已经成为减轻文本到图像生成中的不安全或非法图片生成问题的一种主流方法。然而,现有的删除方法通常采用单向擦除策略,要么抑制目标概念,要么增强安全替代方案,这使得在概念移除和生成质量之间实现平衡变得困难。 为了解决这一局限性,我们提出了一种新颖的双向图像引导概念擦除(Bi-Erasing)框架,该框架能够同时执行概念压制和安全性提升。具体来说,基于文本提示与相应图片的联合表示,Bi-Erasing 引入了两个解耦的图像分支:负向分支负责抑制有害语义,正向分支则为安全替代方案提供视觉指导。通过共同优化这些互补的方向,我们的方法在擦除效果和生成实用性之间实现了平衡。此外,我们还对图像支路应用基于掩码的过滤,以防止在擦除过程中无关内容带来的干扰。 经过广泛的实验评估,所提出的 Bi-Erasing 方法在概念移除的有效性和视觉保真度之间的平衡上优于基线方法。
https://arxiv.org/abs/2512.13039
Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.
从冠状动脉CT血管造影(CCTA)中准确分割冠状动脉对于定量冠脉分析和临床决策支持至关重要。然而,由于细小的血管口径、复杂的分支结构、模糊边界以及心肌干扰等因素,可靠的分割仍然具有挑战性。 为此,我们提出了一种整合了心肌解剖先验知识、结构感知特征编码以及三维小波反小波变换的冠状动脉分割框架。在编码过程中结合心肌先验信息和基于残差注意力机制的特征增强以强化冠脉结构表示。通过小波反小波单元进行下采样和上采样,实现了联合空间频率建模,并保持了多尺度结构的一致性;同时,在解码阶段采用一个多尺度特征融合模块来整合语义信息与几何信息。 该模型在公开的ImageCAS数据集上进行了训练和评估,采用了基于3D重叠补丁的方法并按7:1:2的比例划分为训练、验证和测试部分。实验结果显示,所提出的方法实现了Dice系数为0.8082,敏感性为0.7946,精确度为0.8471,HD95(距离误差)为9.77毫米,优于几种主流分割模型的性能。消融研究表明各个组成部分之间存在互补贡献。 所提出的这种方法在复杂几何条件下实现了更稳定和一致的冠状动脉分割,为后续冠脉结构分析任务提供了可靠的分割结果。
https://arxiv.org/abs/2512.12539
Olympic Taekwondo has faced challenges in spectator engagement due to static, defensive gameplay and contentious scoring. Current Protector and Scoring Systems (PSS) rely on impact sensors and simplistic logic, encouraging safe strategies that diminish the sport's dynamism. This paper proposes an AI-powered scoring system that integrates existing PSS sensors with additional accelerometers, gyroscopes, magnetic/RFID, and impact force sensors in a sensor fusion framework. The system classifies kicks in real-time to identify technique type, contact location, impact force, and even the part of the foot used. A machine learning pipeline employing sensor fusion and Support Vector Machines (SVMs) is detailed, enabling automatic kick technique recognition for scoring. We present a novel kick scoring rubric that awards points based on specific kick techniques (e.g., turning and spinning kicks) to incentivize dynamic attacks. Drawing on a 2024 study achieving 96-98% accuracy, we validate the feasibility of real-time kick classification and further propose enhancements to this methodology, such as ensemble SVM classifiers and expanded datasets, to achieve the high-stakes accuracy required by the sport. We analyze how the proposed system can improve scoring fairness, reduce rule exploitation and illegitimate tactics, encourage more dynamic techniques, and enhance spectator understanding and excitement. The paper includes system design illustrations, a kick scoring table from an AI-augmented rule set, and discusses anticipated impacts on Olympic Taekwondo.
奥运会跆拳道比赛在观众参与方面面临着挑战,因为静态防守型的比赛风格和争议性的评分系统影响了比赛的观赏性。现有的保护和计分系统(PSS)依赖于冲击传感器和简单的逻辑判断,这种设计鼓励选手采取安全策略,从而削弱了运动的动力学特点。本文提出了一种基于人工智能的评分系统,该系统结合了现有PSS传感器与额外加速度计、陀螺仪、磁性和RFID传感器以及冲击力传感器,在一个传感器融合框架内运作。 这个新系统能够实时分类踢击动作,识别技术类型、接触位置、冲击力量甚至使用的脚部部分。文中详细介绍了利用传感器融合和支恃向量机(SVMs)的机器学习管道,该管道可实现自动化的踢技认别以供评分使用。我们提出了一套新颖的踢分规则,根据具体的踢击技术(如旋转和翻滚踢)来授予分数,以此激励动态进攻。 基于2024年的一项研究中达到了96-98%准确率的技术,本文验证了实时踢击分类的可行性,并进一步提出了对这一方法进行改进的方式,比如使用集成支恃向量机分类器以及扩展数据集等手段,以满足比赛所需的高精度要求。此外,我们分析了该系统如何提高评分公正性、减少规则滥用及非法战术、鼓励更多动态技术应用并增强观众的理解和兴奋度。 本文包括系统设计图示、基于AI增强规则集的踢分表格,并讨论了这一提议对奥运会跆拳道的影响。
https://arxiv.org/abs/2512.12474
Graph Spectral Clustering methods (GSC) allow representing clusters of diverse shapes, densities, etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Furthermore, the presence of documents without clear content meaning and the stochastic nature of the clustering algorithms deteriorate explainability. This paper proposes an enhancement to the explanation methodology, proposed in an earlier research of our team. It allows us to overcome the latter problems by taking inspiration from rough set theory.
图谱聚类方法(GSC)能够表示形状各异、密度不同的簇。然而,当这些算法应用于例如文本文档时,其结果难以向用户解释,特别是由于在频谱空间中的嵌入与文档内容之间没有明显的关联。此外,缺乏明确含义的文档和聚类算法的随机特性进一步降低了可解释性。本文提出了一种改进的解释方法,该方法借鉴粗糙集理论,解决了上述问题,并且是基于我们团队早期研究的一种增强版。
https://arxiv.org/abs/2512.12436
Knowledge Graphs (KGs), thanks to their concise and efficient triple-based structure, have been widely applied in intelligent question answering, recommender systems and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably renders the distribution of relations long-tailed, making it crucial to complete missing facts with limited samples. Previous studies mainly based on metric matching or meta learning, yet they either fail to fully exploit neighborhood information in graph or overlook the distributional characteristics of contrastive signals. In this paper, we re-examine the problem from a perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates two-stage attention triple enhancer with U-KAN based diffusion model. Extensive experiments on two public datasets show that our method achieve new state-of-the-art results.
知识图谱(KGs)由于其简洁高效的三元组结构,在智能问答、推荐系统等多个领域得到了广泛应用。然而,现实世界数据的异质性和多面性不可避免地导致了关系分布呈现长尾特性,这就需要在样本有限的情况下完成缺失事实补全变得尤为重要。以往的研究主要基于度量匹配或元学习方法,但它们要么未能充分利用图中的邻居信息,要么忽视了对比信号的分布特征。 本文从生成式表示的角度重新审视这个问题,并提出了一种结合两阶段注意力三元组增强与U-KAN扩散模型的少量样本知识图谱补全框架。在两个公开数据集上的大量实验表明,我们的方法达到了新的最先进的结果。 翻译如下: Knowledge Graphs (KGs), owing to their concise and efficient triple-based structure, have been extensively applied in intelligent question answering, recommender systems, and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably leads to a long-tailed distribution of relations, making it crucial to complete missing facts with limited samples. Previous studies mainly rely on metric matching or meta-learning approaches; however, these methods either fail to fully exploit neighborhood information in graphs or overlook the distributional characteristics of contrastive signals. In this paper, we revisit the problem from the perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates a two-stage attention triple enhancer with a U-KAN-based diffusion model. Extensive experiments on two public datasets demonstrate that our method achieves state-of-the-art results.
https://arxiv.org/abs/2512.12182
A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.
计算机辅助设计(CAD)模型以两种相互关联的形式编码一个对象:一个是参数化构造序列,另一个是由此产生的可见几何形状。在迭代设计过程中,对几何形状的调整不可避免地需要同步编辑底层的参数序列,这种操作称为基于几何的参数化 CAD 编辑。该任务要求做到以下几点:1)保留原始序列的结构;2)确保每次编辑的语义正确性;3)保持与目标形状的高度一致性,所有这些都在稀疏的编辑数据三元组下进行。 我们提出了一种迭代计划-生成-验证框架——CADMorph,该框架在推理过程中协同利用了预先训练好的领域特定基础模型:一个参数到形状(P2S)潜在扩散模型和一个掩码参数预测(MPP)模型。在计划阶段,来自 P2S 模型的跨注意力映射能够识别出需要修改的片段并提供编辑掩模。接着,在生成阶段,MPP 模型将使用语义正确的编辑填充这些掩模。验证阶段中,P2S 模型会将每个候选序列嵌入到形状潜在空间,并测量其与目标形状的距离,选择距离最近的一个。 这三个阶段利用了预先训练的模型内在的几何意识和设计知识,分别解决了结构保持、语义正确性和形状保真度的问题。此外,无论是 P2S 还是 MPP 模型都是在没有三元组数据的情况下进行训练的,从而绕过了数据稀疏瓶颈。 CADMorph 超越了 GPT-4o 和专业的 CAD 基线,并支持迭代编辑和逆向工程增强等下游应用。
https://arxiv.org/abs/2512.11480
Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: this https URL.
伪装物体检测(COD)在计算机视觉领域是一个重要的挑战,旨在识别和分割那些与背景高度融合的物体。目前主流的方法已经在跨层特征融合方面取得了进展,但在解码阶段仍存在两个关键问题:一是同一层级特性之间的跨通道信息交互不足,限制了特性的表达能力;二是无法有效共同建模边界和区域信息,难以准确重建完整且轮廓清晰的目标区域。 为了解决第一个问题,我们提出了一个信道信息互动模块(CIIM),它在信道维度引入了一种水平-垂直整合机制。这个模块执行跨通道的特征重组与交互操作,以有效地捕获互补性的跨通道信息。为了应对第二个挑战,我们构建了一个基于先验知识引导的合作解码架构。该架构通过边界提取(BE)和区域提取(RE)模块生成边界的先验信息及目标定位图,并采用混合注意力机制协同校准解码后的特征,从而有效克服语义模糊性和不精确的边界问题。 此外,多尺度增强(MSE)模块丰富了上下文特征表示。在四个COD基准数据集上的广泛实验验证了所提出模型的有效性及其领先性能。我们还将其模型转移到显著物体检测(SOD)任务中,并展示了其在下游任务如息肉分割、透明物体识别及工业和道路缺陷检测中的适应能力。 代码和实验结果可在以下网址获取:[提供的URL]。 原文主要讨论了如何通过改进的特征处理方法来提升伪装物体检测模型的表现力,提出了两个创新性模块(CIIM与MSE),并展示了这些技术在多个数据集上的优越性能以及跨任务的应用潜力。
https://arxiv.org/abs/2512.11369
Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.
手动生成G代码对于学习数控机床的操作非常重要。先前的工作在使用大语言模型(LLMs)进行G代码验证时,主要检查编程中的错误。然而,数控加工需要广泛地使用和了解人机界面(HMI),因为HMI会显示机器的状态和错误信息。目前的LLM由于无法访问视觉模态而缺乏利用HMI知识的能力。本文提出了一种基于少量样本的视觉语言模型(VLM)验证方法,该方法同时评估G代码和HMI显示屏中的错误及安全状态。 输入数据集包括15斜角PRO车床上配对的G代码文本及其相关的HMI截图,涵盖了正确和可能出错的情况。为了实现少量示例学习,根据先前的经验知识为VLM提供了基于结构化JSON模式的数据支持。在确定了提示之后,使用包含错误或无误的G代码和HMI实例作为少量示例来引导VLM模型。 通过多种场景下的不正确的G代码和HMI错误对模型进行了评估,并以每个槽(字段)的准确性为指标对比零样本VLM模型的表现。结果显示,基于少量样本提示的方法能够显著提高检测到HMI错误及与G代码不符之处的能力,从而实现了更全面的调试。 因此,所提出的框架证明适用于验证通常在数控培训中手动生成的G代码的有效性。
https://arxiv.org/abs/2512.11296
Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.
细胞绘画是一种流行的技术,用于创建具有高对比度的人眼可识别的细胞形态图像。细胞绘画技术存在两个主要问题:(1)劳动密集型;(2)需要化学固定,这使得无法研究细胞动态变化。我们训练了一种扩散模型(Morphological Observation Neural Enhancement Tool 或 MONET),利用大型数据集从明场图像预测细胞绘画通道。我们展示了模型质量随着规模的增加而提高,并且该模型使用一致性架构生成时间推移视频,即使在没有获得细胞绘画视频训练数据的情况下也能做到这一点。此外,我们还展示这种架构使模型能够进行上下文学习,从而部分地将该模型转移到不匹配的数据分布上,例如不同的细胞系和成像协议中。 虚拟细胞绘制技术并不旨在完全取代物理细胞绘制技术,而是作为一种补充工具,在生物研究中启用新的工作流程。
https://arxiv.org/abs/2512.11928
Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.
鲁棒对抗强化学习作为一种有效的训练方法,已成功应用于让智能体在处理现实环境中不确定干扰时的表现更加稳健。这种方法在自动驾驶和机器人控制等需要连续决策的领域中具有关键应用价值。在这种框架下,代理(agent)的训练通常被设计成一个零和马尔可夫博弈,其中一方是主角,另一方是对手,以此来增强策略的鲁棒性。然而,由于对手模型本身是可以训练的,这就不可避免地引入了非平稳的学习动态,导致在复杂且高维环境中出现严重的训练不稳定性和收敛困难。 为了解决这些问题,本文提出了一种名为“不确定性感知评论者集合以实现稳健对抗强化学习”(UACER)的新方法。该方法包括以下两个策略: 1. 多样化的评论者集合:采用多样化的K个评论网络并行工作来稳定Q值估计,而不是使用传统的单一评论架构。这种做法不仅能够减少方差,还能增强鲁棒性。 2. 非线性时间变化衰减不确定性机制(TDU):超越简单的线性组合方法,我们开发了一种基于方差的Q值聚合策略,该策略显式地将知识性不确定性纳入其中,以动态调节探索与利用之间的权衡,并同时稳定训练过程。 跨多个MuJoCo控制问题进行全面实验后,验证了UACER在总体性能、稳定性及效率方面均优于现有先进技术。
https://arxiv.org/abs/2512.10492
The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
平台如TikTok和YouTube上的视频内容快速增长,加剧了跨视觉、音频和文本流的多模态仇恨言论的传播。有害线索在这种环境中以微妙且异步的方式出现。现有的研究主要集中在视频级别的分类上,而实际上至关重要的任务——时间定位(即识别仇恨言论何时发生)则很少被关注。在只有视频级别标签可用的弱监督条件下,这个问题尤为明显,因为静态融合或基于分类的架构难以捕捉跨模态和时间动态特征。 为了应对这些挑战,我们提出了MultiHateLoc框架,这是第一个针对弱监督多模态仇恨定位设计的方法。MultiHateLoc包括以下三个组成部分: 1. 模态感知的时间编码器,用于建模异质序列模式,并包含一个特制的基于文本的数据预处理模块以增强特征。 2. 动态跨模态融合机制,可在每一时刻自适应地强调最具信息量的模态,并且还采用了跨模态对比对齐策略来增强多模态特征的一致性。 3. 模态感知的MIL(Multiple Instance Learning)目标,在视频级别的监督下识别区分度高的片段。 尽管仅依赖于粗粒度标签,MultiHateLoc能够生成细粒度、可解释的帧级别预测。在HateMM和MultiHateClip数据集上的实验表明,我们的方法在定位任务中达到了当前的最佳性能水平。
https://arxiv.org/abs/2512.10408
Speech enhancement(SE) aims to recover clean speech from noisy recordings. Although generative approaches such as score matching and Schrodinger bridge have shown strong effectiveness, they are often computationally expensive. Flow matching offers a more efficient alternative by directly learning a velocity field that maps noise to data. In this work, we present a systematic study of flow matching for SE under three training objectives: velocity prediction, $x_1$ prediction, and preconditioned $x_1$ prediction. We analyze their impact on training dynamics and overall performance. Moreover, by introducing perceptual(PESQ) and signal-based(SI-SDR) objectives, we further enhance convergence efficiency and speech quality, yielding substantial improvements across evaluation metrics.
语音增强(SE)旨在从嘈杂的录音中恢复清晰的语音。尽管生成方法,如分数匹配和薛定谔桥显示出很强的有效性,但它们往往计算成本高昂。流匹配通过直接学习将噪声映射到数据的速度场,提供了一个更高效的替代方案。在这项工作中,我们对三种训练目标下的流匹配在SE中的系统研究进行了介绍:速度预测、$x_1$ 预测和预处理的 $x_1$ 预测。我们分析了它们对训练动态和整体性能的影响。此外,通过引入感知(PESQ)和信号基线(SI-SDR)目标,我们进一步提高了收敛效率和语音质量,在评估指标上取得了显著改进。
https://arxiv.org/abs/2512.10382
Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
最近,人们在用于扩散模型的变分自编码器(VAE)的潜在空间中探索了辐射场表示。这种方法提供了高效的渲染,并且可以无缝地与基于扩散的方法结合使用。然而,这些方法面临一个根本性的限制:VAE的潜在空间缺乏多视角一致性,导致在三维重建过程中出现模糊纹理和细节缺失的问题。现有的解决方案试图通过微调VAE来解决这个问题(但会牺牲重建质量),或者依赖于预训练的扩散模型来恢复细粒度的细节(但有产生幻觉的风险)。我们提出了Splatent,这是一种基于扩散的方法增强框架,旨在在VAE潜在空间中的3D高斯点集(3DGS)上运行。我们的关键见解与传统的三维中心视角不同:与其在三维空间中重建细粒度的细节,不如从输入视图通过多视角注意机制恢复它们。这种方法在保留预训练VAE的重建质量的同时实现了忠实的细节恢复。经过多个基准测试评估后,Splatent为VAE潜在辐射场重建建立了新的最先进水平。我们进一步证明,将我们的方法与现有的前馈框架结合使用,可以持续改善细节保存能力,为高质量的稀疏视图三维重建开辟了新可能。
https://arxiv.org/abs/2512.09923
Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
半监督学习(SSL)已成为医学图像分割的一个有前景的方向,使模型能够从有限的标注数据和大量的未标注样本中进行学习。然而,现有的多模态医疗影像的半监督学习方法往往难以利用各模式之间的互补信息,因为不同MRI序列之间存在语义差异和错位问题。为了解决这一挑战,我们提出了一种新颖的半监督多模态框架,该框架明确地增强了特定于每个模态的表现,并促进了适应性跨模态信息融合。具体而言,我们引入了一个模态特异性增强模块(MEM),通过通道注意力机制来强化各模态特有的语义线索;以及一个可学习互补信息融合(CIF)模块,用于在不同模态之间灵活交换互补知识。整个框架利用结合了监督分割损失和未标注数据上跨模态一致性正则化的混合目标函数进行优化。 我们在BraTS 2019(HGG子集)上进行了广泛的实验,结果表明,在1%、5%和10%不同标注数据比例的设定下,我们的方法始终优于强大的半监督及多模态基线模型,并在Dice分数和灵敏度得分方面取得了显著改进。进一步的消融研究表明,我们提出的MEM和CIF模块能够在缓解跨模态差异的同时,在稀缺监督条件下提高分割的鲁棒性。
https://arxiv.org/abs/2512.09801
This study presents a lightweight dual-domain super-resolution network (DDSRNet) that combines Spatial-Net with the discrete wavelet transform (DWT). Specifically, our proposed model comprises three main components: (1) a shallow feature extraction module, termed Spatial-Net, which performs residual learning and bilinear interpolation; (2) a low-frequency enhancement branch based on the DWT that refines coarse image structures; and (3) a shared high-frequency refinement branch that simultaneously enhances the LH (horizontal), HL (vertical), and HH (diagonal) wavelet subbands using a single CNN with shared weights. As a result, the DWT enables subband decomposition, while the inverse DWT reconstructs the final high-resolution output. By doing so, the integration of spatial- and frequency-domain learning enables DDSRNet to achieve highly competitive performance with low computational cost on three hyperspectral image datasets, demonstrating its effectiveness for hyperspectral image super-resolution.
这项研究提出了一种轻量级的双域超分辨率网络(DDSRNet),结合了空间网络(Spatial-Net)和离散小波变换(DWT)。具体来说,我们提出的模型包括三个主要组成部分:(1) 一个浅层特征提取模块,称为空间网络(Spatial-Net),执行残差学习和双线性插值;(2) 基于DWT的低频增强分支,用于优化粗略图像结构;以及 (3) 共享高频细化分支,在水平(LH)、垂直(HL)和对角线(HH)小波子带中同时进行精细处理,采用单个具有共享权重的CNN实现。因此,DWT实现了子带分解的功能,而逆DWT则重建最终的高分辨率输出。 通过这种方式,空间域与频率域学习的结合使DDSRNet能够在三个高光谱图像数据集上以低计算成本达到高度竞争性的性能,展示了其在高光谱图像超分辨率任务中的有效性。
https://arxiv.org/abs/2512.09546
This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.
这项研究提出了一种基于大规模语言模型的文本分类算法,旨在解决传统方法在捕捉长距离依赖性、理解上下文语义以及处理类别不平衡方面的局限。该框架包括文本编码、上下文表示建模、注意力机制增强、特征聚合和分类预测阶段。 在表示阶段,通过大规模预训练的语言模型获得深层语义嵌入,并应用注意机制来增强对关键特征的选择性表示。在聚合阶段,结合全局和加权策略生成稳健的文本级向量。在分类阶段,使用全连接层和Softmax输出来预测类别分布,并采用交叉熵损失来优化模型参数。 比较实验引入了多个基准模型,包括递归神经网络、图神经网络和Transformer,并从精度(Precision)、召回率(Recall)、F1-Score以及AUC等指标评估它们的性能。结果显示,所提出的方法在所有度量标准上均优于现有模型,特别是在召回率和AUC方面有显著改进。 此外,还进行了对超参数和数据条件进行敏感性实验,涵盖了隐藏维度对AUC的影响及类别不平衡比例对召回率的影响。研究发现表明,合适的模型配置会对性能产生重要影响,并揭示了该模型在不同条件下具有适应性和稳定性。 总体而言,所提出的文本分类方法不仅实现了有效的性能提升,还通过系统分析验证了其在复杂数据环境中的鲁棒性和适用性。
https://arxiv.org/abs/2512.09444
This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.
这项研究探讨了一种基于知识增强的大语言模型代理的可解释推理方法在金融决策中的应用。为了克服传统金融决策方法依赖参数化知识、缺乏事实一致性以及忽略推理链等问题,提出了一种集成了外部知识检索、语义表示和推理生成于一体的框架。该方法首先将财务文本和结构化数据编码为语义表示,并通过相似度计算从外部知识库中检索任务相关的信息。内部表示与外部知识结合时采用加权融合的方式,在保证流畅性的同时,提升了生成内容的事实准确性和完整性。 在推理阶段,引入了多头注意力机制来构建逻辑链,使模型能够在生成过程中呈现透明的因果关系和可追溯性。最后,该模型通过同时优化任务目标和解释一致性目标,不仅提高了预测性能,还增强了推理的可解释性。通过对金融文本处理及决策任务的实验表明,该方法在准确性、文本生成质量和事实支持方面均优于基线方法,验证了知识增强与可解释推理的有效性。 总体而言,提出的这种方法克服了传统模型在语义覆盖和推理透明度方面的局限性,并在复杂的金融场景中展现了强大的实际应用价值。
https://arxiv.org/abs/2512.09440
Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
无类别(class-agnostic)的三维实例分割旨在解决一个具有挑战性的任务,即在不依赖语义类别的前提下,对所有物体实例进行分割,包括那些之前未见过的对象。当前的方法由于缺少标注的三维场景数据或存在噪声的二维分割信息而难以实现泛化。虽然合成数据生成提供了一种有前景的解决方案,但现有的三维场景综合方法无法同时满足几何多样性、背景复杂性和布局合理性等关键需求,而这对于此类任务至关重要。 为了应对这些挑战,我们提出了一种名为ASSIST-3D(Adapted 3D Scene Synthesis for class-agnostic 3D Instance Segmentation)的管道,用于生成适合模型泛化增强的数据。具体来说,ASSIST-3D具有三个关键创新点: 1. 异构对象选择:从广泛的三维CAD资产集合中进行随机采样,以最大化几何和背景多样性。 2. 场景布局生成:通过大型语言模型(LLM)指导的空间推理结合深度优先搜索来合理地放置物体。 3. 真实感点云构建:通过多视角RGB-D图像渲染与合成场景中的融合,紧密模拟真实世界传感器数据采集过程。 在ScanNetV2、ScanNet++和S3DIS基准测试上的实验表明,使用ASSIST-3D生成的数据训练的模型显著优于现有的方法。进一步的对比分析强调了我们专门设计的管道相对于现有三维场景综合方法的优势。
https://arxiv.org/abs/2512.09364
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
多参与者会议在商业谈判和医疗咨询等领域中普遍存在,此类会议常常会讨论涉及敏感信息的内容,如商业机密、战略规划及患者状况等。先前的研究表明,攻击者可以通过检测物体上由讲话引起的微小振动来利用毫米波雷达窃听房间内的谈话内容。然而,这种窃听攻击无法区分多参与者会议中的哪个人说了什么内容,这可能导致误解和糟糕的决策制定。 在本文中,我们回答了“谁说了什么”的问题。通过利用普遍存在物体带来的空间多样性,我们提出了一种攻击系统,该系统可以让攻击者无需事先了解如身份、参与人数或座位安排等信息的情况下,远程窃听面对面会议中的谈话内容。由于面对面会议的参与者通常坐在不同的位置上,他们的讲话会在附近的物体上产生不同的振动模式。为了利用这一点,我们设计了一个鲁棒性的无监督方法来检测频域内的讲话引起的振动差异,从而区分不同参与者。同时,探索了一种基于深度学习框架的方法来结合来自不同物体的信号以提升语音质量。 通过广泛的实验验证了这一概念性攻击在语音分类和信号增强方面的可行性。实验结果表明,在会议室内有几位参与者的场景下,我们的攻击方法可以达到高达0.99的语音分类准确率。此外,我们的攻击技术在所有真实世界的情况下都展示了持续的语音质量提升效果,并且适用于雷达与物体之间不同距离的情况。 该研究工作不仅揭示了当前毫米波雷达技术可能带来的隐私安全风险,还为未来的防护措施提出了新的挑战和思考方向。
https://arxiv.org/abs/2512.09285