Creating responsible artificial intelligence (AI) systems is an important issue in contemporary research and development of works on AI. One of the characteristics of responsible AI systems is their explainability. In the paper, we are interested in explainable deep learning (XDL) systems. On the basis of the creation of digital twins of physical objects, we introduce the idea of creating readable twins (in the form of imprecise information flow models) for unreadable deep learning models. The complete procedure for switching from the deep learning model (DLM) to the imprecise information flow model (IIFM) is presented. The proposed approach is illustrated with an example of a deep learning classification model for image recognition of handwritten digits from the MNIST data set.
创建负责任的人工智能(AI)系统是当前关于人工智能研究和开发中的一个重要议题。负责任的AI系统的特征之一就是其可解释性。在本文中,我们对可解释的深度学习(XDL)系统感兴趣。基于物理对象数字孪生体的概念,我们提出了为不可读的深度学习模型创建可读孪生(以模糊信息流模型形式)的想法。文中详细介绍了从深度学习模型(DLM)转换到模糊信息流模型(IIFM)的完整过程。通过使用用于识别MNIST数据集中的手写数字图像分类模型的例子来说明所提出的方法。
https://arxiv.org/abs/2504.13150
Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.
人类动作识别(HAR)在深度学习模型的应用中取得了显著成果,但由于这些模型的黑箱特性,其决策过程仍然难以理解。确保解释性对于需要透明度和问责制的实际应用场景至关重要。现有的视频可解释人工智能(XAI)方法主要依赖于特征归因或静态文本概念,这两种方式都难以捕捉动作识别所必需的动作动态性和时间依赖性。为了解决这些问题,我们提出了姿势概念瓶颈的可解释动作识别框架(PCBEAR),这是一种新型的概念瓶颈框架,引入了人体姿态序列作为具有运动感知和结构化概念的视频动作识别方法。 与基于像素级特征或静态文本描述的方法不同,PCBEAR利用人体骨架姿态,专注于身体的动作,从而提供关于动作动态性的稳健且可解释的说明。我们定义了两种类型的姿势相关概念:静态姿势概念用于描述单帧的空间配置,而动态姿势概念则用于跨越多帧的运动模式。为了构建这些概念,PCBEAR对视频中的姿态序列进行聚类分析,能够自动发现有意义的概念而无需人工注释。 我们在KTH、Penn-Action和HAA500数据集上验证了PCBEAR的有效性,结果表明它不仅具有高水平的分类性能,还提供了与运动相关的可解释说明。我们的方法在提供强大的预测性能的同时,也向人类使用者揭示了模型推理过程中的直观理解,并支持测试时的操作干预以调试和改进模型行为。
https://arxiv.org/abs/2504.13140
Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
水下声学目标识别(UATR)对于保护海洋生物多样性和国家安全具有重要意义。深度学习的发展为UATR提供了新的机遇,但同时也面临着参考样本稀缺和复杂环境干扰带来的挑战。为了应对这些问题,我们提出了一种多任务平衡通道注意卷积神经网络(MT-BCA-CNN)。该方法结合了通道注意力机制与多任务学习策略,构建了一个共享特征提取器及多个任务分类器,以同时优化目标分类和特征重构的任务。通过动态增强如谐波结构等具有区分性的声学特征并抑制噪声,通道注意机制能够显著提高模型性能。 在Watkins海洋生物数据集上的实验表明,在27类少量样本的场景中,MT-BCA-CNN达到了97%的分类准确率和95%的F1值,远远优于传统的CNN、ACNN模型及流行的新一代UATR方法。消融研究证实了多任务学习与注意力机制相结合所具有的协同效应,而动态权重调整策略有效地平衡了各任务贡献。这项工作为少量样本情况下的水下声学识别提供了一种高效解决方案,并推进了海洋生物声学和声呐信号处理领域的研究进展。
https://arxiv.org/abs/2504.13102
Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors' complex physical characteristics and visual signal processing requirements present unique challenges for robotic applications. The lack of efficient and accurate simulation tools for VBTS has significantly limited the scale and scope of tactile robotics research. Here we present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development. By enabling large-scale simulation and experimentation with tactile sensing, Taccel accelerates the development of more capable robotic systems, potentially transforming how robots interact with and understand their physical environment.
触觉感知对于实现人类水平的机器人操作能力至关重要。VBTS(视觉捕获弹性凝胶垫变形模式的传感器)作为一种有前景的解决方案,因其通过摄像机捕捉弹性凝胶垫接触时的变形图案而提供了高空间分辨率和成本效益。然而,这些传感器复杂的物理特性和对视觉信号处理的要求为机器人应用带来了独特的挑战。缺乏高效的仿真工具来模拟VBTS显著限制了触觉机器人研究的发展规模和范围。在这里,我们介绍了Taccel,这是一个高性能仿真实验平台,它整合了IPC(图像处理组件)和ABD(高级行为驱动),能够以高精度和前所未有的速度对机器人、触觉传感器以及物体进行建模,在数千个并行环境中实现了比实时快18倍的加速。不同于之前在亚实时光速下运行且并行化有限的仿真器,Taccel提供了精确的物理模拟,并生成了逼真的触觉信号,同时通过用户友好的API支持灵活的机器人-传感器配置。通过对物体识别、机械手抓取和关节对象操作进行广泛的验证,我们展示了精准的模拟效果以及成功的从模拟到实际环境的应用转换能力。这些功能使Taccel成为扩大触觉机器人研究与开发规模的强大工具。通过启用大规模的模拟实验来探索触觉感知,Taccel加速了更高级别机器人系统的研发,有可能改变机器如何与其物理环境互动和理解的方式。
https://arxiv.org/abs/2504.12908
This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3\% on CIFAR-10 and 85.77\% on CIFAR-100, without relying on any pretrained weights. The model's streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.
本文介绍了AdaptoVision,这是一种新颖的卷积神经网络(CNN)架构,旨在高效地平衡计算复杂性和分类准确性。通过利用增强型残差单元、深度可分离卷积和分层跳跃连接,AdaptoVision显著减少了参数数量和计算需求,同时在各种基准数据集和医学图像数据集中保持了竞争性的性能水平。广泛的实验表明,在不依赖任何预训练权重的情况下,AdaptoVision在BreakHis数据集上达到了最先进的水平,并且在CIFAR-10数据集上的准确率高达95.3%,而在CIFAR-100数据集上的准确率为85.77%。该模型的精简架构和战略性简化促进了有效的特征提取和强大的泛化能力,使其特别适合部署于实时和资源受限环境中。
https://arxiv.org/abs/2504.12652
Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at this https URL.
构建变化检测在城市发展、灾害评估和军事侦察中仍面临挑战。尽管像Segment Anything Model (SAM)这样的基础模型显示出了强大的分割能力,但SAM在建筑变化检测任务上因领域差距问题而受到限制。现有的基于适配器的微调方法面对不平衡的建筑物分布时也遇到了挑战,导致细微变化难以被发现,并且边缘提取不够准确。此外,在使用光学流处理双时间序列错位的变化检测中,背景噪声对其构成了威胁,影响了建筑变化的检测并削弱了检测精度和边缘识别。 为了解决这些问题,我们提出了一种新的基于SAM的网络——带有分布感知傅里叶适应和边界约束扭曲(FAEWNet)的建筑物变化检测方法。该网络利用SAM编码器从遥感图像中提取丰富的视觉特征。为了引导SAM关注特定地面目标,我们提出了一个分布感知傅里叶聚合适配器来整合任务导向的变化信息。这个适配器不仅有效地解决了领域差距问题,还注意到了变化建筑的分布情况。此外,为了解决噪声干扰和高度偏移估计中的错位问题,我们设计了一个新的流模块以精化建筑物边缘提取,并增强了对变化建筑物的感知。 在LEVIR-CD、S2Looking以及WHU-CD数据集上的最先进的实验结果证明了FAEWNet的有效性。该代码可在以下链接获取:[这里提供原网址]。
https://arxiv.org/abs/2504.12619
Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.
目的:手术室(OR)是一个复杂的环境,优化工作流程对于减少成本和改善患者结果至关重要。使用计算机视觉方法自动识别围手术期事件能够帮助识别瓶颈以进行手术室的优化。然而,隐私问题限制了从手术室视频中通过计算机视觉来进行自动化事件检测的应用,因此需要采取保护隐私的方法来分析手术室的工作流程。 方法:我们提出了一种两阶段流水线用于隐私保护的手术室视频分析和事件检测。在第一阶段,利用视觉基础模型进行深度估计和语义分割,从常规RGB视频中生成去身份化的数字孪生(DT)以表示手术室。在第二阶段,使用SafeOR模型,这是一种融合了双流方法的方法,该方法处理分割掩码和深度图来检测手术事件。我们在一个内部数据集上评估了这种方法,这个数据集包含了38次模拟的外科手术试验,并且有五种类别的事件。 结果:我们的结果显示,基于DT的方法在检测手术室事件时,其性能与基于原始RGB视频模型的表现相当,有时甚至更好。 结论:数字孪生(DT)能够进行隐私保护下的手术工作流程分析,有助于跨机构共享去身份化数据,并有可能通过缓解特定领域的外观差异来增强模型的泛化能力。
https://arxiv.org/abs/2504.12552
Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.
大规模枪击事件对公共安全构成了重大挑战,产生了大量非结构化的文本数据,阻碍了有效的调查和政策制定。尽管情况紧急,但很少有先前的研究能够有效地利用自动提取技术从这些事件中获取关键信息以支持法律和调查工作。本文介绍了一个通过应用命名实体识别(NER)技术来获取大规模枪击事件知识的首个数据集。该研究集中于识别诸如罪犯、受害者、地点及犯罪工具等对法律和调查至关重要的实体。NER过程利用大型语言模型(LLM),并通过少量样本提示法实现关键信息的有效提取与组织,这些信息来源于包括新闻报道、警方报告和社会媒体在内的多种渠道。 在实际大规模枪击事件语料库上的实验结果显示,GPT-4o是进行大规模枪击事件命名实体识别最有效的模型,它在微平均精度(Micro Precision)、微平均召回率(Micro Recall)和微平均F1分数(Micro F1-score)上均取得了最高分。同时,o1-mini表现出竞争力的性能,使其成为复杂性较低任务中的资源高效替代选择。观察还发现增加样本数量可以提高所有模型的表现,但对GPT-4o和o1-mini而言,性能提升更为显著,这突显了它们在少量样例学习场景下的优越适应能力。
https://arxiv.org/abs/2504.12545
Regardless of past learning, an agent in an open world will face unfamiliar situations and events outside of prior experience, existing models, or policies. Further, the agent will sometimes lack relevant knowledge and/or sufficient time to assess the situation, generate and evaluate options, and pursue a robustly considered course of action. How can an agent respond reasonably to situations that are outside of its original design scope? How can it recognize such situations sufficiently quickly and reliably to determine reasonable, adaptive courses of action? We identify key characteristics needed for solutions, evaluate the state-of-the-art by these requirements, and outline a proposed, novel approach that combines domain-general meta-knowledge (in the form of appraisals inspired by human cognition) and metareasoning. It has the potential to provide fast, adaptive responses to unfamiliar situations, more fully meeting the performance characteristics required for open-world, general agents.
无论过去的经历如何,处于开放世界中的代理(agent)将面对超出其先前经验、现有模型或策略范围的不熟悉情况和事件。此外,在某些情况下,代理可能缺乏评估情况的相关知识或充足时间来生成并评价选项,并采取稳健考虑后的行动方案。在这种情况下,代理应如何合理地应对超出其最初设计范畴的情况?它又如何能够足够快速且可靠地识别这些情况以确定合理的、适应性的行动路线? 我们指出了为解决这些问题所需的关键特性,根据这些要求评估了当前的技术水平,并提出了一个结合通用领域元知识(通过类似人类认知的评估形式)和元推理的新颖方法。这种方法有潜力提供对不熟悉情况快速且灵活的响应,从而更全面地满足开放世界中通用代理所必需的表现特征。
https://arxiv.org/abs/2504.12497
Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level 'Bird' to mid-level 'Hummingbird' to fine-level 'Green hermit', allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues: Distinguishing 'Bird' from 'Plant' relies on global features like feathers or leaves, while separating 'Anna's hummingbird' from 'Green hermit' requires local details such as head coloration. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce internal visual consistency by aligning fine-to-coarse predictions through intra-image segmentation. Our method outperforms zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions. It also improves internal image segmentation without requiring pixel-level annotations.
层次分类能够在分类体系的多个层级上预测标签,例如从粗略层级的“鸟类”到中间层级的“蜂鸟”,再到精细层级的“绿色隐士蜂鸟”。这种分类方法允许在不同的视觉条件下灵活地进行识别。通常情况下,层次分类被看作是多个单一层级的任务组合,但是每个层级可能依赖于不同的视觉线索:区分“鸟类”和“植物”需要依靠全局特征如羽毛或叶子,而将“安娜蜂鸟”与“绿色隐士蜂鸟”区分开来则需依赖局部细节如头部颜色。先前的方法通过利用外部语义监督来提高准确性,但这种统计学习方法在测试时无法确保一致的视觉基础,导致层次分类错误。 我们首次提出了一种强制内部视觉一致性的方式,即通过对图像内的分割来进行细粒度到粗粒度预测的一致性对齐。我们的方法在层次分类基准上超越了零样本CLIP和最新的基线模型,在准确性方面表现更好,并且生成更一致的预测结果。此外,该方法还能改进图像内部分割效果,而无需像素级别的标注数据。
https://arxiv.org/abs/2406.11608
Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% +/-2m on average while outperforming baselines at a factor of 2 in mean error.
施工现场的移动机器人需要精确的姿态估计来执行自主测量和检查任务。由于存在诸如平坦抹灰墙等重复特征以及因楼层内和楼层间公寓布局相似而导致的感知歧义,因此在建筑工地进行定位是一项特别具有挑战性的问题。本文专注于仅使用LiDAR数据,通过与建筑物准确扫描网格的全局重新定位移动机器人的位置。我们的方法训练了一个神经网络,该网络是在精确的真实大型网格中模拟LiDAR生成合成LiDAR点云的基础上进行训练的。我们用PointNet++作为骨干网来训练扩散模型,这使我们能够从单个LiDAR点云中建模多个可能的位置候选。所得模型可以在狭窄和复杂的环境中成功预测LiDAR的全局位置,即使存在感知歧义带来的不利影响也不例外。所学的潜在全球位置分布可以提供多模式位置分布。我们在五个真实数据集上评估了我们的方法,并展示了平均77%±2米的地方识别精度,同时在均方误差上比基线方法提高了两倍。
https://arxiv.org/abs/2504.12412
We present a geometry-driven method for normalizing dysarthric speech using local Lie group transformations of spectrograms. Time, frequency, and amplitude distortions are modeled as smooth, invertible deformations, parameterized by scalar fields and applied via exponential maps. A neural network is trained to infer these fields from synthetic distortions of typical speech-without using any pathological data. At test time, the model applies an approximate inverse to real dysarthric inputs. Despite zero-shot generalization, we observe substantial ASR gains, including up to 16 percentage points WER reduction on challenging TORGO samples, with no degradation on clean speech. This work introduces a principled, interpretable approach for robust speech recognition under motor speech disorders
我们提出了一种基于几何的方法,利用频谱图的局部李群变换来对构音障碍(dysarthria)语音进行归一化处理。时间、频率和振幅失真被建模为光滑且可逆的形变,这些变形通过标量场参数化,并通过指数映射应用。我们训练了一个神经网络从典型语音的人工失真中推断出这些场,而无需使用任何病理数据。在测试阶段,该模型对真实的构音障碍输入施加近似的逆变换。尽管没有针对病理语音进行过零样本泛化训练,但我们在自动语音识别(ASR)上取得了显著的改进,包括在挑战性的TORGO样例中误词率(WER)最多降低了16个百分点,并且对于干净的语音没有造成性能下降。这项工作引入了一种原理明确、可解释的方法来提高运动性构音障碍条件下的鲁棒语音识别能力。
https://arxiv.org/abs/2504.12279
We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies. Our project page is this https URL
我们提出了CAL(Complete Anything in Lidar),这是一种基于LiDAR的形状补全方法,用于处理实际环境中的数据。这与基于LiDAR的语义/全景场景完成紧密相关。然而,现有的方法只能在现有LiDAR数据集中已标注的封闭词汇表中完成和识别对象。与此不同的是,我们的零样本方法利用多模态传感器序列的时间上下文来挖掘观察到的对象形状及其语义特征。然后将这些信息提炼成仅基于LiDAR的实例级补全与识别模型。尽管我们只挖掘了部分形状补全,但我们发现提炼出的模型能够从数据集中多个此类不完整观测中学习并推断出完整对象形状。我们的模型可以在标准的语义和全景场景完成基准测试上进行提示操作,定位(以模态3D边界框形式)对象,并识别超出固定类词汇表的对象。 项目页面链接:[请在此处插入实际URL]
https://arxiv.org/abs/2504.12264
Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.
自动语音识别(ASR)在诸如对话代理、工业机器人、呼叫中心自动化和自动字幕生成等众多应用中,对于人机交互至关重要。然而,开发高性能的ASR模型仍然具有挑战性,尤其是在像阿拉伯语这样的低资源语言中,因为大而标注良好的语音数据集稀缺且制作成本高昂且耗时。在这项工作中,我们采用弱监督学习方法,使用Conformer架构训练一个阿拉伯语ASR模型。我们的模型从头开始在15,000小时的弱注释语音数据上进行训练,涵盖了现代标准阿拉伯语(MSA)和方言阿拉伯语(DA),从而消除了昂贵的人工转录需求。 尽管缺乏人工验证标签,但我们的方法达到了业内最先进的性能水平,在阿拉伯语ASR的标准基准测试中超过了所有先前的努力。通过证明弱监督作为一种可扩展且成本效益高的替代方案的有效性,传统监督方法在资源匮乏的环境下为改进ASR系统铺平了道路。
https://arxiv.org/abs/2504.12254
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as ViT and SSM architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: this https URL.
基于雷达的人体动作识别(HAR)作为一种有前景的替代方法,已经超越了传统的监测手段,如可穿戴设备和摄像头系统。这得益于其在隐私保护和鲁棒性方面的独特优势。然而,现有的解决方案通常依赖于卷积神经网络(CNNs)和循环神经网络(RNNs),尽管这些模型在部署时非常有效,但它们的计算需求很高。这种高计算要求限制了它们在资源受限或需要多传感器环境中的应用。更为先进的架构如视觉变换器(ViT)和序列到序列模型(SSM)提供了改进的建模能力,并且已经朝着轻量级设计方向努力。然而,这些架构的计算复杂度仍然相对较高。 为了充分利用变换器架构的优势,同时提高准确性和减少计算复杂性,本文提出了RadMamba——一种参数高效的雷达微多普勒导向型Mamba SSM模型,专为基于雷达的人体动作识别而设计。在三个不同的数据集上进行了测试:在DIAT数据集中,RadMamba达到了99.8%的分类准确率,并且只使用了现有最佳模型1/400的参数量;而在CI4R数据集中,它与领先模型取得了相同的92.0%的准确性,但使用的参数仅为这些模型的1/10。在UoG2020数据集上评估连续动作序列的情况下,RadMamba通过仅使用6,700个参数就比其他大量参数模型高出了至少3%的性能。 我们的代码可在以下链接获取:[此处插入链接]
https://arxiv.org/abs/2504.12039
Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
最近的手语研究进展得益于基于卷积神经网络(CNN)的骨干架构,这些架构主要从传统的计算机视觉任务中转移而来(例如物体识别、图像识别)。然而,这类基于CNN的骨干通常擅长提取轮廓和纹理等特征,但可能难以捕捉与手语相关的特征。实际上,手语任务需要关注特定的手语相关区域,包括不同区域之间的协作(如左手区域和右手区域)以及单个区域内有效内容的理解。为了捕获这些区域相关的特征,我们引入了MixSignGraph模型,该模型将手势序列表示为混合图组,并设计了三个用于特征提取的图模块:即局部手语图(LSG)模块、时间手语图(TSG)模块和层次化手语图(HSG)模块。具体来说,LSG模块学习同一帧内跨区域特征之间的相关性,即关注空间特征;TSG模块跟踪相邻帧间跨区域特征的交互作用,即关注时间特征;而HSG模块聚合来自单个帧不同粒度特征图中的相同区域特征,即关注层次化特征。此外,为了进一步提高无音节注释的手语任务性能,我们提出了一种简单却反直觉的文本驱动CTC预训练(TCP)方法,该方法从文本标签中生成伪音节标签进行模型预训练。在目前五个公开的手语数据集上进行的广泛实验表明了所提模型的优越性。值得注意的是,在多个手语任务和几个数据集中,我们的模型无需任何额外线索即可超越最先进的模型。
https://arxiv.org/abs/2504.12020
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at this https URL.
在深度学习领域,带有噪声标签的学习是一个开放的研究课题,在过去十年中吸引了大量关注并迅速发展。带噪标签学习对于驾驶员分心行为识别至关重要,因为现实世界中的视频数据常常包含标注错误的样本,这会影响模型的可靠性和性能。然而,在驾驶员活动识别领域,对标签噪声的学习研究还很少。 在本文中,我们提出了首个用于驾驶员活动识别任务的带噪标签学习方法。基于聚类假设,我们首先使模型能够从给定的视频数据中学习出便于聚类的低维表示,并将生成的嵌入分配到不同的簇中。随后,在每个簇内执行协同细化以平滑分类器输出。此外,我们提出了一种灵活的样本选择策略,该策略结合了两个选择标准,无需依赖任何超参数即可从训练数据集中过滤出干净样本。同时,我们在样本选择过程中引入了一个自适应参数,以强制跨类别的平衡性。 在公共Drive&Act数据集上进行的一系列全面实验(涵盖所有粒度级别)证明了我们的方法相较于其他图像分类领域衍生的标签去噪方法具有优越性能。源代码可在[此处](https://URL)获取。
https://arxiv.org/abs/2504.11966
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: this https URL
尽管当前的骨架动作识别模型在大规模数据集上展示了出色的表现,但它们适应新应用场景的能力仍然具有挑战性。当面对新的动作类别、多样化的表演者和不同的骨架布局时,这种挑战尤为明显,这会导致性能显著下降。此外,收集骨架数据的成本高昂且困难重重,使得大规模数据采集变得不切实际。本文研究了一次性和有限规模的学习设置,以实现利用最少的数据进行高效适应的目的。现有的方法通常忽视了标记样本之间的丰富互信息,在低数据场景中导致表现不佳。为了提高标记数据的效用,我们确定了表演者之间的变异性以及每个动作内部的一致性是两个关键属性。 本文提出了SkeletonX,这是一种轻量级的训练流程,能够与现有基于GCN(图卷积网络)的骨架动作识别器无缝集成,在受限标签数据下促进有效的训练。首先,我们提出了一种针对两种关键属性定制的样本对构建策略,以形成和聚合样本对。其次,我们开发了一个简洁且有效的特征聚合模块来处理这些对。 在NTU RGB+D、NTU RGB+D 120以及PKU-MMD数据集上使用不同GCN骨干网络进行了广泛的实验,结果表明,在仅使用有限的数据从零开始训练时,该流程能够有效提高性能。此外,它在一次性设置中超越了以前的最佳方法,参数量仅为前者的十分之一,并且浮点运算次数显著减少。 代码和数据可在以下链接获取:this https URL
https://arxiv.org/abs/2504.11749
Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. %Furthermore, existing feature fusion strategies in no-reference video quality assessment (NR-VQA) often rely on fixed weighting schemes, which fail to adaptively adjust feature importance. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline.
受人类视觉系统(HVS)双流理论的启发——其中腹侧通路负责对象识别和细节分析,而背侧通路则专注于空间关系和运动感知——越来越多基于这一框架的视频质量评估(VQA)方法被提出。近年来,在大型多模态模型方面,尤其是对比语言-图像预训练(CLIP),取得了显著进展,这促使研究人员将CLIP整合到双流基础的VQA方法中。这种集成旨在利用模型出色的语义理解能力来复制腹侧通路中的对象识别和细节分析以及背侧通路的空间关系分析。然而,CLIP最初是为图像设计的,并且缺乏捕捉视频固有的时间与运动信息的能力。 此外,在无参考视频质量评估(NR-VQA)中现有的特征融合策略通常依赖于固定的加权方案,这无法自适应地调整特征的重要性。为了克服这一限制,本文提出了一种解耦视觉-语言建模结合文本引导的自适应方法——盲视视频质量评估(DVLTA-VQA)。该方法将CLIP的视觉和文本组件解耦,并将其整合到NR-VQA流程的不同阶段中。
https://arxiv.org/abs/2504.11733
Inertial measurement units (IMUs), have been prevalently used in a wide range of mobile perception applications such as activity recognition and user authentication, where a large amount of labelled data are normally required to train a satisfactory model. However, it is difficult to label micro-activities in massive IMU data due to the hardness of understanding raw IMU data and the lack of ground truth. In this paper, we propose a novel fine-grained user perception approach, called Saga, which only needs a small amount of labelled IMU data to achieve stunning user perception accuracy. The core idea of Saga is to first pre-train a backbone feature extraction model, utilizing the rich semantic information of different levels embedded in the massive unlabelled IMU data. Meanwhile, for a specific downstream user perception application, Bayesian Optimization is employed to determine the optimal weights for pre-training tasks involving different semantic levels. We implement Saga on five typical mobile phones and evaluate Saga on three typical tasks on three IMU datasets. Results show that when only using about 100 training samples per class, Saga can achieve over 90% accuracy of the full-fledged model trained on over ten thousands training samples with no additional system overhead.
惯性测量单元(IMUs)在移动感知应用中广泛使用,例如活动识别和用户认证,在这些应用中通常需要大量标注数据来训练满意的模型。然而,由于难以理解原始的IMU数据以及缺乏真实情况下的验证信息,对大规模IMU数据中的微小活动进行标注变得非常困难。为此,本文提出了一种新的细粒度用户感知方法,称为Saga,该方法仅需少量标记的IMU数据即可实现惊人的用户感知精度。Saga的核心思想是首先利用大量未标记的IMU数据中嵌入的不同层次丰富的语义信息来预训练一个骨干特征提取模型。同时,在特定下游用户感知应用中,使用贝叶斯优化来确定不同语义级别预训练任务的最佳权重。 我们在五种典型的移动设备上实现了Saga,并在三个IMU数据集上的三项典型任务中对其进行了评估。实验结果表明,当仅使用每个类别大约100个训练样本时,Saga可以实现超过90%的精度,而这种精度通常是经过充分训练(基于数万个训练样本)后的全功能模型才能达到的,且Saga在性能上没有额外的系统开销。
https://arxiv.org/abs/2504.11726