The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT's release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.
2022年末发布的ChatGPT引发了学术和教育界的热议与担忧。一些人认为,该工具能够生成足以通过基本事实准确性检查的类似人类文本,“足够频繁”,这预示着信息检索和计算机辅助学习的黄金时代已经到来。然而,另一些人则担心这一工具可能导致前所未有的学术不诚实和作弊现象。在本研究中,我们通过对一个免费大学水平的人工智能伦理MOOC多年的学生论文回应数据集进行分析,量化了大型语言模型(LLMs)出现对在线教育的影响。我们的数据集中包含了ChatGPT发布前后学生提交的论文。我们发现,ChatGPT的推出与学生论文长度和风格上的显著变化相吻合,这在其他领域如学术出版中也有类似观察结果。同时,正如相关公共讨论所预期的一样,我们还发现了有关AI和LLMs的关键内容词使用频率的变化,但这些变化并不一定反映在通过(动态)主题建模确定的学生论文中讨论的主题或话题上。
https://arxiv.org/abs/2504.13038
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
在检索系统中,同时实现搜索准确性和效率是非常具有挑战性的。这一问题尤其体现在部分相关视频检索(PRVR)中,在此场景下,为了提升准确性而加入更多样化的上下文表示会随时间尺度变化增加计算和内存成本。为了解决这一矛盾,我们提出了一种原型式的PRVR框架,该框架将视频中的多样化背景信息编码为固定数量的原型。接下来,我们引入了几种策略来增强文本关联及对原型中视频的理解,并设置了一个正交目标以确保这些原型能够捕捉到内容的多样性。为了在通过文字查询搜索时保持原型可查找性的同时准确地编码视频上下文,我们实施了跨模态和单模态重构任务。其中,跨模态重构任务将原型与文本特征对齐于共享空间中,而单模态重构任务则保留所有视频背景信息的完整性。此外,我们还采用了一种视频混合技术来提供弱引导以进一步对齐原型与其相关的文本表示。我们在TVR、ActivityNet-Captions和QVHighlights数据集上的广泛评估验证了我们的方法的有效性,并且在效率方面没有牺牲性能。
https://arxiv.org/abs/2504.13035
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: this https URL.
漫画制作行业需要高度精确、高效且具备上下文一致性和灵活控制能力的线稿着色技术。一张漫画页面通常包含各种各样的人物、物体和背景,这使得上色过程变得复杂。尽管在图像生成领域扩散模型取得了进展,但在线稿上色中的应用仍然有限,并面临着处理大量参考图片、推理耗时长以及缺乏灵活性等挑战。我们研究了广泛上下文图像指导在线稿着色质量方面的必要性。 为了解决这些问题,我们引入了一种高效且多功能的方法——Cobra,它支持颜色提示并利用超过200张参考图的同时保持低延迟特性。Cobra的核心是一个因果稀疏DiT架构,该架构采用特别设计的位置编码、因果稀疏注意力和键值缓存来有效管理长上下文参考,并确保色彩一致性。 实验结果显示,通过广泛使用上下文参考信息,Cobra能够实现精确的线稿着色,在推理速度和互动性方面有了显著提升,从而满足了工业界的迫切需求。我们将在项目页面上发布代码和模型:[此链接](this https URL)。
https://arxiv.org/abs/2504.12240
Detection of spatial areas where biodiversity is at risk is of paramount importance for the conservation and monitoring of ecosystems. Large terrestrial mammalian herbivores are keystone species as their activity not only has deep effects on soils, plants, and animals, but also shapes landscapes, as large herbivores act as allogenic ecosystem engineers. One key landscape feature that indicates intense herbivore activity and potentially impacts biodiversity is the formation of grazing trails. Grazing trails are formed by the continuous trampling activity of large herbivores that can produce complex networks of tracks of bare soil. Here, we evaluated different algorithms based on machine learning techniques to identify grazing trails. Our goal is to automatically detect potential areas with intense herbivory activity, which might be beneficial for conservation and management plans. We have applied five semantic segmentation methods combined with fourteen encoders aimed at mapping grazing trails on aerial images. Our results indicate that in most cases the chosen methodology successfully mapped the trails, although there were a few instances where the actual trail structure was underestimated. The UNet architecture with the MambaOut encoder was the best architecture for mapping trails. The proposed approach could be applied to develop tools for mapping and monitoring temporal changes in these landscape structures to support habitat conservation and land management programs. This is the first time, to the best of our knowledge, that competitive image segmentation results are obtained for the detection and delineation of trails of large herbivorous mammals.
对存在生物多样性风险的区域进行空间检测对于生态系统保护和监测至关重要。大型陆生食草哺乳动物是关键物种,因为它们的活动不仅会对土壤、植物和动物产生深远影响,还会塑造景观,因为大型食草动物作为异质生态工程师在其中扮演重要角色。一个重要的景观特征,表明有强烈的食草动物活动并可能对生物多样性产生影响的是踩踏小径(或称放牧小径)的形成。这些小径是由大型食草动物持续践踏所形成的裸土痕迹网络。 在此研究中,我们评估了基于机器学习技术的不同算法以识别放牧小径。我们的目标是自动检测潜在存在强烈食草活动的区域,这对保护和管理计划可能有益。我们应用了五种语义分割方法结合十四种编码器来对空中图像中的放牧小径进行映射。结果显示,在大多数情况下所选择的方法成功地绘制出了这些路径,尽管有少数情况低估了实际的小径结构。使用MambaOut编码器的UNet架构是最适合绘制定向路径的模型。 所提出的方法可以应用于开发工具以监测和跟踪这些景观结构随时间的变化,从而支持栖息地保护和土地管理计划。据我们所知,这是首次获得用于检测和界定大型食草哺乳动物小径的竞争性图像分割结果。
https://arxiv.org/abs/2504.12121
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as ViT and SSM architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: this https URL.
基于雷达的人体动作识别(HAR)作为一种有前景的替代方法,已经超越了传统的监测手段,如可穿戴设备和摄像头系统。这得益于其在隐私保护和鲁棒性方面的独特优势。然而,现有的解决方案通常依赖于卷积神经网络(CNNs)和循环神经网络(RNNs),尽管这些模型在部署时非常有效,但它们的计算需求很高。这种高计算要求限制了它们在资源受限或需要多传感器环境中的应用。更为先进的架构如视觉变换器(ViT)和序列到序列模型(SSM)提供了改进的建模能力,并且已经朝着轻量级设计方向努力。然而,这些架构的计算复杂度仍然相对较高。 为了充分利用变换器架构的优势,同时提高准确性和减少计算复杂性,本文提出了RadMamba——一种参数高效的雷达微多普勒导向型Mamba SSM模型,专为基于雷达的人体动作识别而设计。在三个不同的数据集上进行了测试:在DIAT数据集中,RadMamba达到了99.8%的分类准确率,并且只使用了现有最佳模型1/400的参数量;而在CI4R数据集中,它与领先模型取得了相同的92.0%的准确性,但使用的参数仅为这些模型的1/10。在UoG2020数据集上评估连续动作序列的情况下,RadMamba通过仅使用6,700个参数就比其他大量参数模型高出了至少3%的性能。 我们的代码可在以下链接获取:[此处插入链接]
https://arxiv.org/abs/2504.12039
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at this https URL.
在深度学习领域,带有噪声标签的学习是一个开放的研究课题,在过去十年中吸引了大量关注并迅速发展。带噪标签学习对于驾驶员分心行为识别至关重要,因为现实世界中的视频数据常常包含标注错误的样本,这会影响模型的可靠性和性能。然而,在驾驶员活动识别领域,对标签噪声的学习研究还很少。 在本文中,我们提出了首个用于驾驶员活动识别任务的带噪标签学习方法。基于聚类假设,我们首先使模型能够从给定的视频数据中学习出便于聚类的低维表示,并将生成的嵌入分配到不同的簇中。随后,在每个簇内执行协同细化以平滑分类器输出。此外,我们提出了一种灵活的样本选择策略,该策略结合了两个选择标准,无需依赖任何超参数即可从训练数据集中过滤出干净样本。同时,我们在样本选择过程中引入了一个自适应参数,以强制跨类别的平衡性。 在公共Drive&Act数据集上进行的一系列全面实验(涵盖所有粒度级别)证明了我们的方法相较于其他图像分类领域衍生的标签去噪方法具有优越性能。源代码可在[此处](https://URL)获取。
https://arxiv.org/abs/2504.11966
Inertial measurement units (IMUs), have been prevalently used in a wide range of mobile perception applications such as activity recognition and user authentication, where a large amount of labelled data are normally required to train a satisfactory model. However, it is difficult to label micro-activities in massive IMU data due to the hardness of understanding raw IMU data and the lack of ground truth. In this paper, we propose a novel fine-grained user perception approach, called Saga, which only needs a small amount of labelled IMU data to achieve stunning user perception accuracy. The core idea of Saga is to first pre-train a backbone feature extraction model, utilizing the rich semantic information of different levels embedded in the massive unlabelled IMU data. Meanwhile, for a specific downstream user perception application, Bayesian Optimization is employed to determine the optimal weights for pre-training tasks involving different semantic levels. We implement Saga on five typical mobile phones and evaluate Saga on three typical tasks on three IMU datasets. Results show that when only using about 100 training samples per class, Saga can achieve over 90% accuracy of the full-fledged model trained on over ten thousands training samples with no additional system overhead.
惯性测量单元(IMUs)在移动感知应用中广泛使用,例如活动识别和用户认证,在这些应用中通常需要大量标注数据来训练满意的模型。然而,由于难以理解原始的IMU数据以及缺乏真实情况下的验证信息,对大规模IMU数据中的微小活动进行标注变得非常困难。为此,本文提出了一种新的细粒度用户感知方法,称为Saga,该方法仅需少量标记的IMU数据即可实现惊人的用户感知精度。Saga的核心思想是首先利用大量未标记的IMU数据中嵌入的不同层次丰富的语义信息来预训练一个骨干特征提取模型。同时,在特定下游用户感知应用中,使用贝叶斯优化来确定不同语义级别预训练任务的最佳权重。 我们在五种典型的移动设备上实现了Saga,并在三个IMU数据集上的三项典型任务中对其进行了评估。实验结果表明,当仅使用每个类别大约100个训练样本时,Saga可以实现超过90%的精度,而这种精度通常是经过充分训练(基于数万个训练样本)后的全功能模型才能达到的,且Saga在性能上没有额外的系统开销。
https://arxiv.org/abs/2504.11726
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.
随着图像生成的成功,生成扩散模型越来越多地被应用于判别性任务中,因为像素生成提供了一个统一的感知接口。然而,直接将生成去噪过程用于判别性目标会暴露出以前很少被解决的关键差距。生成模型可以容忍中间采样误差,只要最终分布仍然是合理的,但判别性任务在整个过程中都需要严格的准确性,这在诸如参照图像分割等具有挑战性的多模态任务中尤其明显。为了解决这一差距,我们分析并增强了生成扩散过程与感知任务之间的对齐,并重点关注去噪过程中感知质量的演变。我们的研究发现如下: 1. 较早的去噪步骤对感知质量的贡献不成比例,这促使我们提出了反映不同时间步长贡献的定制化学习目标。 2. 后期的去噪步骤显示出意外的感知退化,强调了训练-去噪分布变化的敏感性,这一问题通过我们的扩散特制数据增强得到了解决。 3. 生成过程独特地支持互动性,作为可控用户界面,可以适应多轮交互中的校正提示。 我们的见解无需架构上的变更就显著改善了基于扩散的感知模型,并在深度估计、参照图像分割和一般感知任务上取得了最先进的性能。代码可在该链接获取:[https://this-url.com](https://this-url.com)
https://arxiv.org/abs/2504.11457
This study introduces a novel method for revealing human covert attention patterns using gameplay data alone, utilizing offline attention techniques from reinforcement learning (RL). We propose the contextualized, task-relevant (CTR) attention network, which generates attention maps from both human and RL agent gameplay in Atari environments. These maps are sparse yet retain the necessary information for the current player's decision making. We compare the CTR-derived attention maps with a temporally integrated overt attention (TIOA) model based on eye-tracking data, serving as a point of comparison and discussion. Visual inspection reveals distinct attention patterns: human CTR maps focus on the player and rather nearby opponents, occasionally shifting between stronger focus and broader views - sometimes even attending to empty space ahead. In contrast, agent maps maintain a consistent broad focus on most objects, including distant ones and the player. Quantitative analysis further demonstrates that human CTR maps align more closely with TIOA than agent maps do. Our findings indicate that the CTR attention network can effectively reveal human covert attention patterns from gameplay alone, without the need for additional data like brain activity recordings. This work contributes to understanding human-agent attention differences and enables the development of RL agents augmented with human covert attention.
这项研究介绍了一种仅使用游戏数据揭示人类隐蔽注意力模式的新方法,该方法采用了来自强化学习(RL)的离线注意技术。我们提出了上下文相关、任务相关的(CTR)注意网络,它能够从Atari环境中的人类玩家和RL智能体的游戏数据中生成注意图谱。这些地图是稀疏的,但保留了当前玩家决策所需的必要信息。 我们将CTR产生的注意图与基于眼动追踪数据的时间整合显性注意力(TIOA)模型进行比较,作为对比和讨论的参考点。通过视觉检查可以看出明显不同的注意模式:人类CTR映射主要集中在角色及其附近的对手上,并且偶尔会在更广视域和更强聚焦之间转换——有时甚至会关注前方空旷的空间;相比之下,智能体的映射则始终保持着对大多数对象的一致广泛视野,包括远处的对象以及玩家自身。 定量分析进一步表明,人类的CTR映射与TIOA模型更为一致,而智能体映射与其之间的吻合度较低。我们的研究结果表明,CTR注意力网络能够仅凭游戏数据有效地揭示人类隐蔽注意模式,无需额外的数据如脑电活动记录等。这项工作有助于理解人机在注意方面的差异,并促进了增强人类隐蔽注意的RL代理人的开发。
https://arxiv.org/abs/2504.11118
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
部分相关视频检索(PRVR)是文本到视频检索中的一项实用且具有挑战性的任务,其中的视频未经过修剪,并包含大量背景内容。该领域的目标在于寻找既有效又高效的解决方案,以捕捉文字查询与未修剪视频之间的局部对应关系。然而,现有的PRVR方法通常侧重于多尺度片段表示建模,但这些方法往往受到内容独立性和信息冗余的影响,从而损害了检索性能。 为了克服这些限制,我们提出了一种简单而有效的主动时刻发现(AMDNet)方法。我们的目标是发现与查询语义一致的视频时刻。通过使用可学习的时间间隔锚点来捕捉不同的时刻,并应用掩码多时刻注意力机制以强调显著性时刻同时抑制冗余背景,我们可以实现更为紧凑和信息丰富的视频表示。 为进一步增强时刻建模,我们引入了两种损失函数:一种是多样性损失(moment diversity loss),鼓励不同区域中不相同的时刻;另一种是相关性损失(moment relevance loss),促进与查询语义相关的时刻。这两种损失协同工作,并与部分相关检索损失共同作用于端到端优化。 在两个大规模视频数据集(即TVR和ActivityNet Captions)上的广泛实验表明,我们的AMDNet方法具有优越性和效率。具体而言,在TVR数据集中,AMDNet比当前最先进的GMMFormer方法小约15.5倍(参数数量),同时性能高出6.0分(SumR)。
https://arxiv.org/abs/2504.10920
Explanations for computer vision models are important tools for interpreting how the underlying models work. However, they are often presented in static formats, which pose challenges for users, including information overload, a gap between semantic and pixel-level information, and limited opportunities for exploration. We investigate interactivity as a mechanism for tackling these issues in three common explanation types: heatmap-based, concept-based, and prototype-based explanations. We conducted a study (N=24), using a bird identification task, involving participants with diverse technical and domain expertise. We found that while interactivity enhances user control, facilitates rapid convergence to relevant information, and allows users to expand their understanding of the model and explanation, it also introduces new challenges. To address these, we provide design recommendations for interactive computer vision explanations, including carefully selected default views, independent input controls, and constrained output spaces.
计算机视觉模型的解释对于理解这些模型的工作原理非常重要。然而,这些解释通常以静态格式呈现,这给用户带来了挑战,包括信息过载、语义与像素级别信息之间的差距以及探索机会有限等问题。我们调查了互动性作为解决热图基(heatmap-based)、概念基(concept-based)和原型基(prototype-based)三种常见解释类型中这些问题的机制。我们在一项研究(N=24)中使用了一个鸟类识别任务,参与者具有不同的技术和领域知识背景。研究表明,虽然交互性增强了用户的控制能力、促进了快速聚焦到相关信息,并允许用户扩展对模型及其解释的理解,但它也引入了新的挑战。为了解决这些挑战,我们提供了关于互动式计算机视觉解释的设计建议,包括精心选择的默认视图、独立的输入控件和受限的输出空间。
https://arxiv.org/abs/2504.10745
Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.
可控场景生成可以大幅降低自动驾驶中多样化数据收集的成本。先前的研究将交通布局的生成视为预测性进展,要么一次去除整个序列中的噪声,要么通过迭代预测下一帧来实现。然而,一次性全序列去噪会阻碍在线反应能力,而后者仅基于下一帧的短期预测又缺乏精确的目标状态指导。此外,由于开放数据集中存在大量安全和常规驾驶行为,学习模型难以生成复杂或具有挑战性的场景。 为了克服这些问题,我们引入了Nexus框架,这是一个解耦的场景生成框架,通过模拟带有独立噪声状态的细粒度令牌,来改善反应性和目标导向性,同时可以生成正常情况及有挑战性的场景。该框架的核心在于集成部分噪声屏蔽训练策略和感知噪声的时间表安排,以确保在整个去噪过程中及时更新环境。 为了补充对具有挑战性场景的生成,我们收集了一个包含复杂边缘案例的数据集,其中包括540小时模拟数据(如切入、突然刹车和碰撞等高风险互动)。Nexus在保持反应性和目标导向的同时实现了更真实的场景生成,并将位移误差减少了40%。此外,我们还展示了通过数据增强方法来提升闭环规划的20%,并证明了其在安全关键性数据生成方面的能力。
https://arxiv.org/abs/2504.10485
Background: Lower limb exoskeletons can enhance quality of life, but widespread adoption is limited by the lack of frameworks to assess their biomechanical and human-robot interaction effects, which are essential for developing adaptive and personalized control strategies. Understanding impacts on kinematics, muscle activity, and HRI dynamics is key to achieve improved usability of wearable robots. Objectives: We propose a systematic methodology evaluate an ankle exoskeleton's effects on human movement during walking and load-carrying (10 kg front pack), focusing on joint kinematics, muscle activity, and HRI torque signals. Materials and Methods: Using Xsens MVN (inertial motion capture), Delsys EMG, and a unilateral exoskeleton, three experiments were conducted: (1) isolated dorsiflexion/plantarflexion; (2) gait analysis (two subjects, passive/active modes); and (3) load-carrying under assistance. Results and Conclusions: The first experiment confirmed that the HRI sensor captured both voluntary and involuntary torques, providing directional torque insights. The second experiment showed that the device slightly restricted ankle range of motion (RoM) but supported normal gait patterns across all assistance modes. The exoskeleton reduced muscle activity, particularly in active mode. HRI torque varied according to gait phases and highlighted reduced synchronization, suggesting a need for improved support. The third experiment revealed that load-carrying increased GM and TA muscle activity, but the device partially mitigated user effort by reducing muscle activity compared to unassisted walking. HRI increased during load-carrying, providing insights into user-device dynamics. These results demonstrate the importance of tailoring exoskeleton evaluation methods to specific devices and users, while offering a framework for future studies on exoskeleton biomechanics and HRI.
背景:下肢外骨骼可以提高生活质量,但由于缺乏评估其生物力学和人机交互效应的框架,它们的广泛应用受到限制。这些评估对于开发适应性和个性化的控制策略至关重要。了解对运动学、肌肉活动以及人机交互扭矩信号的影响是实现可穿戴机器人更好可用性的关键。 目的:我们提出了一种系统性方法来评价踝关节外骨骼在步行和负重(10公斤背包)过程中对人体运动的影响,重点在于关节运动学、肌肉活动以及人机交互扭矩信号的评估。 材料与方法:利用Xsens MVN(惯性动作捕捉)、Delsys EMG和单侧外骨骼进行实验。进行了三个实验: (1) 单独测试背屈/跖屈; (2) 步态分析(两位受试者,被动模式和主动模式); (3) 在辅助下负重行走。 结果与结论:第一个实验确认了人机交互传感器能够捕捉到自愿和非自愿的扭矩,提供了关于方向扭矩的重要信息。第二个实验显示,该设备在一定程度上限制了踝关节的活动范围(RoM),但支持所有辅助模式下的正常步态模式。外骨骼减少了肌肉活动,在主动模式下尤为明显。人机交互扭矩根据步态阶段变化,并显示出减少同步性,这表明需要改进支撑方式。第三个实验发现负重增加了股直肌和胫骨前肌的活动,但该设备在一定程度上通过降低肌肉活动减轻了用户的负担,相对于未受辅助行走而言。人在负重时的人机交互增加,提供了有关用户与设备之间动态关系的信息。 这些结果证明了根据具体设备和用户提供量身定制评估方法的重要性,并为未来关于外骨骼生物力学和人机交互的研究提供了一个框架。
https://arxiv.org/abs/2504.10294
Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ($\epsilon=0.2$), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.
人体姿态估计(HPE)在医疗保健、活动识别和人机交互等多个领域变得至关重要。然而,处理敏感视觉数据的隐私问题构成了关键应用领域的重大部署障碍。传统的匿名化技术提供的保护有限,并且常常会损害用于广泛运动分析的数据效用。相比之下,差分隐私(DP)提供了正式的隐私保证,但当直接应用于模型训练时通常会导致性能下降。在这项工作中,我们首次提出了一个具有差分隐私保障的二维人体姿态估计方法(2D-HPE),通过将差分私有随机梯度下降法(DP-SGD)应用到这个任务中来实现。为了有效地平衡隐私与性能之间的关系,我们采用了投影式差分私有随机梯度下降法(PDP-SGD),这种方法将带有噪声的梯度投影到了一个低维子空间内。 此外,我们将TinyViT这种紧凑且高效的视觉变换器应用到坐标分类中,并将其用于人体姿态估计,提供了一个轻量级但功能强大的骨干网络,增强了资源受限设备上隐私保护部署的可能性。我们的方法对于多媒体解读任务特别有价值,它能够在不损害下游应用程序所需语义意义的情况下,在各种视觉媒体中实现对人类运动的隐私安全分析与理解。 在MPII人体姿态数据集上的全面实验表明,使用PDP-SGD的方法取得了显著的性能提升,当严格的隐私预算为$\epsilon=0.2$时,实现了78.48%的PCKh@0.5指标,而标准DP-SGD仅达到63.85%。这项工作为现实世界中敏感应用中的差分私有人体姿态估计奠定了基础。
https://arxiv.org/abs/2504.10190
Neuro-developmental disorders are manifested as dysfunctions in cognition, communication, behaviour and adaptability, and deep learning-based computer-aided diagnosis (CAD) can alleviate the increasingly strained healthcare resources on neuroimaging. However, neuroimaging such as fMRI contains complex spatio-temporal features, which makes the corresponding representations susceptible to a variety of distractions, thus leading to less effective in CAD. For the first time, we present a Comorbidity-Informed Transfer Learning(CITL) framework for diagnosing neuro-developmental disorders using fMRI. In CITL, a new reinforced representation generation network is proposed, which first combines transfer learning with pseudo-labelling to remove interfering patterns from the temporal domain of fMRI and generates new representations using encoder-decoder architecture. The new representations are then trained in an architecturally simple classification network to obtain CAD model. In particular, the framework fully considers the comorbidity mechanisms of neuro-developmental disorders and effectively integrates them with semi-supervised learning and transfer learning, providing new perspectives on interdisciplinary. Experimental results demonstrate that CITL achieves competitive accuracies of 76.32% and 73.15% for detecting autism spectrum disorder and attention deficit hyperactivity disorder, respectively, which outperforms existing related transfer learning work for 7.2% and 0.5% respectively.
神经发育障碍表现为认知、沟通、行为和适应能力方面的功能紊乱,基于深度学习的计算机辅助诊断(CAD)可以缓解在神经影像方面日益紧张的医疗资源压力。然而,像功能性磁共振成像(fMRI)这样的神经影像包含复杂的时空特征,这使得相应的表示容易受到各种干扰的影响,从而导致CAD效果不佳。 首次提出了一种基于共病信息的迁移学习框架(CITL),用于使用fMRI诊断神经发育障碍。在CITL中,提出了一个新的强化表示生成网络,该网络首先结合了迁移学习和伪标记技术,以去除来自fMRI时间域中的干扰模式,并利用编码器-解码器架构生成新的表示。接着,这些新表示将在一个结构简单的分类网络中进行训练,以获得CAD模型。特别地,该框架全面考虑神经发育障碍的共病机制,并有效整合了半监督学习和迁移学习,为跨学科研究提供了新的视角。 实验结果显示,CITL在检测自闭症谱系障碍和注意力缺陷多动障碍方面分别达到了76.32%和73.15%的竞争性准确率,这比现有的相关迁移学习工作高出7.2%和0.5%,表明该方法具有显著优势。
https://arxiv.org/abs/2504.09463
In modern society, Attention-Deficit/Hyperactivity Disorder (ADHD) is one of the common mental diseases discovered not only in children but also in adults. In this context, we propose a ADHD diagnosis transformer model that can effectively simultaneously find important brain spatiotemporal biomarkers from resting-state functional magnetic resonance (rs-fMRI). This model not only learns spatiotemporal individual features but also learns the correlation with full attention structures specialized in ADHD diagnosis. In particular, it focuses on learning local blood oxygenation level dependent (BOLD) signals and distinguishing important regions of interest (ROI) in the brain. Specifically, the three proposed methods for ADHD diagnosis transformer are as follows. First, we design a CNN-based embedding block to obtain more expressive embedding features in brain region attention. It is reconstructed based on the previously CNN-based ADHD diagnosis models for the transformer. Next, for individual spatiotemporal feature attention, we change the attention method to local temporal attention and ROI-rank based masking. For the temporal features of fMRI, the local temporal attention enables to learn local BOLD signal features with only simple window masking. For the spatial feature of fMRI, ROI-rank based masking can distinguish ROIs with high correlation in ROI relationships based on attention scores, thereby providing a more specific biomarker for ADHD diagnosis. The experiment was conducted with various types of transformer models. To evaluate these models, we collected the data from 939 individuals from all sites provided by the ADHD-200 competition. Through this, the spatiotemporal enhanced transformer for ADHD diagnosis outperforms the performance of other different types of transformer variants. (77.78ACC 76.60SPE 79.22SEN 79.30AUC)
在现代社会中,注意力缺陷多动障碍(ADHD)是一种不仅在儿童中而且在成人中常见的精神疾病。在此背景下,我们提出了一种用于ADHD诊断的Transformer模型,该模型能够有效地同时识别静息态功能磁共振成像(rs-fMRI)中的重要脑时空生物标志物。此模型不仅可以学习时空个体特征,还能专门学习与ADHD诊断相关的注意力结构关联。 特别地,它专注于学习局部血氧水平依赖信号(BOLD),并区分大脑中重要的感兴趣区域(ROI)。具体来说,所提出的用于ADHD诊断Transformer的三种方法如下:首先,我们设计了一个基于CNN的嵌入模块来获取脑区注意中的更具表现力的嵌入特征。该模型是根据先前基于CNN的ADHD诊断模型为Transformer重建的。 其次,在个体时空注意力方面,我们将注意力方法更改为局部时间注意力和ROI排名基础屏蔽。对于fMRI的时间特征,局部时间注意力能够仅通过简单的窗口屏蔽来学习局部BOLD信号特征。对于fMRI的空间特征,基于ROI排名的基础屏蔽可以根据注意分数区分具有高相关性的ROI区域,从而为ADHD诊断提供更具针对性的生物标志物。 实验中采用了各种类型的Transformer模型,并从由ADHD-200竞赛提供的所有站点收集了939名个体的数据来评估这些模型。通过这种方法,增强时空特性的用于ADHD诊断的Transformer在性能上优于其他不同类型的Transformer变体(准确率77.78%,特异性76.60%,灵敏度79.22%,AUC 79.30%)。
https://arxiv.org/abs/2504.11474
Sound events representing intestinal activity detection is a diagnostic tool with potential to identify gastrointestinal conditions. This article introduces BowelRCNN, a novel bowel sound detection system that uses audio recording, spectrogram analysys and region-based convolutional neural network (RCNN) architecture. The system was trained and validated on a real recording dataset gathered from 19 patients, comprising 60 minutes of prepared and annotated audio data. BowelRCNN achieved a classification accuracy of 96% and an F1 score of 71%. This research highlights the feasibility of using CNN architectures for bowel sound auscultation, achieving results comparable to those of recurrent-convolutional methods.
表示肠道活动检测的声音事件是一种具有潜在诊断价值的工具,可用于识别胃肠道疾病。本文介绍了 BowelRCNN,这是一种新型的肠音检测系统,采用了音频记录、频谱图分析以及基于区域的卷积神经网络(RCNN)架构。该系统是在从19名患者收集的真实录音数据集上进行训练和验证的,这些数据集包括了60分钟准备并标注好的音频数据。BowelRCNN 达到了 96% 的分类准确率和 71% 的 F1 分数。这项研究突显了使用卷积神经网络架构来进行肠音听诊的可行性,并且其结果与循环-卷积方法相当。
https://arxiv.org/abs/2504.08659
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at this https URL, inviting further exploration and development in this critical area.
动物机器人互动(ARI)在机器人学中仍然是一个未被充分探索的挑战,因为机器人难以解读动物复杂的多模态通信信号,如肢体语言、动作和声音。与人类-机器人交互相比,后者受益于已建立的数据集和框架,而动物-机器人交互则缺乏促进有意义双向交流的基础资源。为了弥合这一差距,我们提出了MBE-ARI(多模式双向参与的动物-机器人互动),这是一个新颖的多模态数据集,记录了腿部机器人与奶牛之间详细互动的情况。该数据集包括来自多个视角同步的RGB-D流,并附有在交互各阶段的身体姿态和活动标签,为ARI研究提供了前所未有的细节水平。此外,我们还引入了一个针对四足动物设计的全身姿态估计模型,能够追踪39个关键点,在平均精度(mAP)上达到了92.7%,超过了现有动物姿态估计基准的表现。 MBE-ARI数据集和我们的姿态估计框架为推进动物机器人互动研究奠定了坚实的基础,提供了开发感知、推理和交互框架所需的工具,这些框架对于机器人与动物之间的有效合作至关重要。该数据集及相关资源可公开获取(请在此处插入链接),欢迎进一步探索和发展这一关键领域。
https://arxiv.org/abs/2504.08646
Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the same visual decision-making task. In comparison to other large-scale models, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS's learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.
最近的研究表明,大规模的多动物模型是表征神经活动与行为之间关系的强大工具。然而,目前的大规模方法主要集中在从行为预测神经活动(编码)或从神经活动预测行为(解码),这限制了它们捕捉两者之间双向关系的能力。为了解决这一问题,我们引入了一种跨模态、多任务模型,该模型能够同时进行大规模的神经编码和解码(NEDS)。我们的方法的核心是一种新颖的多任务掩蔽策略,它在神经活动、行为、同模态以及跨模态之间交替掩蔽。 我们在国际脑实验室(IBL)重复点数据集上对这种方法进行了预训练,该数据集中包括83只动物执行相同视觉决策任务时记录的数据。与其他大规模模型相比,在多动物数据集进行预训练并在新动物身上微调后,我们证明NEDS在编码和解码方面均达到了最先进的性能水平。 令人惊讶的是,NEDS学习到的嵌入显示了新兴特性:即使没有明确训练,它们也能高度预测每个记录中的脑区。总体而言,我们的方法朝着开发一种能够无缝转换神经活动与行为之间的基础大脑模型迈进了一步。
https://arxiv.org/abs/2504.08201
Facial expressiveness plays a crucial role in a robot's ability to engage and interact with children. Prior research has shown that expressive robots can enhance child engagement during human-robot interactions. However, many robots used in therapy settings feature non-personalized, static faces designed with traditional facial feature considerations, which can limit the depth of interactions and emotional connections. Digital faces offer opportunities for personalization, yet the current landscape of robot face design lacks a dynamic, user-centered approach. Specifically, there is a significant research gap in designing robot faces based on child preferences. Instead, most robots in child-focused therapy spaces are developed from an adult-centric perspective. We present a novel study investigating the influence of child-drawn digital faces in child-robot interactions. This approach focuses on a design activity with children instructed to draw their own custom robot faces. We compare the perceptions of social intelligence (PSI) of two implementations: a generic digital face and a robot face, personalized using the user's drawn robot faces. The results of this study show the perceived social intelligence of a child-drawn robot was significantly higher compared to a generic face.
面部表情在机器人与儿童互动和交流的能力中扮演着至关重要的角色。先前的研究表明,具有表达能力的机器人可以在人机交互过程中增强孩子的参与度。然而,在治疗环境中使用的许多机器人配备了非个性化的静态面孔,这些设计考虑了传统的面部特征,这可能限制了互动的深度以及情感连接。 数字面孔提供了个性化的机会,但目前的机器人脸部设计方案缺乏动态且以用户为中心的方法。具体而言,基于儿童偏好的机器人脸的设计研究还存在很大的空白。相反,大多数在针对儿童治疗空间中使用的机器人都是从成人中心的角度开发出来的。我们介绍一项新的研究,探讨儿童绘制的数字面孔对孩子与机器人互动的影响。这项研究侧重于一种设计活动,在这种活动中指导孩子们画出他们自己的定制机器人面孔。我们将两种实施方式的社会智能感知(PSI)进行比较:一个是通用的数字面孔,另一个是使用用户自己绘制的机器人面孔个性化的脸部。 研究表明,由孩子绘制的机器人所表现出的被感知到的社会智力显著高于通用的面孔。
https://arxiv.org/abs/2504.08117