Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode mental health with state-of-the-art biomarker precision. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, identifies individuals living with mental illness, and achieves near-ceiling accuracy when predicting group-level mental health. By extracting non-verbal signatures of psychological function that have so far remained untapped, MAILA represents a key step toward foundation models for mental health. The ability to decode mental states at zero marginal cost creates new opportunities in neuroscience, medicine, and public health, while raising urgent questions about privacy, agency, and autonomy online.
可扩展的心理疾病评估仍然是全球范围内实现心理健康服务普及和公平的关键障碍,而心理疾病是导致残疾的主要原因。在这里,我们展示了人机交互能够以最先进的生物标志物精度编码心理健康状态。我们引入了MAILA(用于从数字活动推断潜在精神状态的机器学习框架)。通过使用20,000个来自9,000名在线参与者的光标和触摸屏记录数据集中的130万份自我报告的心理健康数据,我们训练了MAILA。该数据集中包括2,000名纵向评估人员、1,500名被诊断为抑郁症患者以及500名患有强迫症的个体的数据。 MAILA能够追踪三种正交维度上的动态心理状态,识别有心理健康问题的人,并且在预测群体层面的心理健康时达到了接近天花板的准确性。通过提取迄今为止尚未开发的心理功能的非言语特征,MAILA代表了构建面向精神健康的基石模型的关键一步。以零边际成本解码心理状态的能力为神经科学、医学和公共卫生领域创造了新的机会,同时也引发了关于隐私、自主权以及在线环境中的个人权利等紧迫问题。
https://arxiv.org/abs/2511.20179
As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at this https URL.
随着炎症性肠病(IBD)的治疗目标转向组织学缓解,对微观炎症的准确评估已成为评价疾病活动性和治疗反应的关键。本文介绍了IMILIA(可解释的多重实例学习炎症分析),这是一种端到端框架,用于预测经苏木精和伊红(H&E)染色后数字化的IBD切片中的炎症存在,并随后自动计算表征驱动预测的组织区域标记物。 IMILIA由一个炎症预测模块组成,该模块包含一个多实例学习(MIL)模型,以及解释性模块,分为两个部分:HistoPLUS用于细胞实例检测、分割和分类;EpiSeg用于上皮层分割。在发现队列中,IMILIA实现了交叉验证ROC-AUC为0.83,在两个外部验证队列中的ROC-AUC分别为0.99和0.84。 解释性模块提供了生物学一致的见解:预测得分较高的切片显示出免疫细胞(淋巴细胞、浆细胞、中性粒细胞和嗜酸性粒细胞)密度增加,而得分较低的切片主要包含正常上皮细胞。值得注意的是,在所有数据集中这些模式是一致的。可以在以下网址找到用于部分复制在公开IBDColEpi数据集上的结果代码和模型:[此链接](https://this-URL.com)(请将"this https URL"替换为实际的代码和模型存放地址)。
https://arxiv.org/abs/2512.13440
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
学习多代理之间的交互式运动行为是自动驾驶领域的一个核心挑战。虽然模仿学习模型能够生成现实的轨迹,但它们通常会从以安全演示为主的数据集中继承偏差,这限制了在安全性关键情况下表现的鲁棒性。此外,大多数研究依赖于开环评估方法,忽略了闭环执行中的累积误差问题。 为了解决这些局限性,我们采用了两种互补策略。首先,我们提出了组相对行为优化(GRBO),这是一种强化学习后期训练方法,通过组间的相对优势最大化以及人类规范化的手段来微调预训练的行为模型。使用仅10%的训练数据集,GRBO在保持行为真实性的同时,将安全性表现提高了超过40%。 其次,我们引入了Warm-K策略,这是一个带有热启动的Top-K采样方法,能够平衡运动选择的一致性和多样性。基于我们的Warm-K测试时间缩放法,在不重新进行训练的情况下,能够在测试时提升行为一致性与响应性,并且能缓解协变量变化和减少性能差异。 演示视频可在补充材料中查看。
https://arxiv.org/abs/2512.13262
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
有效的人类与代理协作在现实世界应用中越来越普遍。当前此类合作的趋势主要为单向,用户给代理提供指令或提出问题,而代理则直接回复而不寻求必要的澄清或确认。然而,这些代理的进化能力要求更积极地参与对话,以便动态参与到对话中去,以澄清用户的意图、解决歧义并适应变化的情况。现有的先驱工作未能充分利用语言模型(LM)的对话能力,因此优化了代理作为更好的跟随者而非有效的发言者的角色。在此项工作中,我们引入SpeakRL,这是一种强化学习(RL)方法,通过奖励积极与用户互动的行为来增强代理的对话能力,例如在必要时提出恰当的澄清问题。为此,我们整理了SpeakER,这是一个合成数据集,其中包括来自任务导向对话的各种场景,在这些场景中,任务是通过交互式的澄清提问得以解决的。我们还对促进对话主动性的奖励设计进行了系统的分析,并提出了一个原则化的奖励公式来教授代理在询问与行动之间取得平衡。实证评估表明,我们的方法实现了20.14%的任务完成率相对于基线模型绝对提升,在不增加会话轮次的情况下甚至超越了更大规模的专有模型,这证明了以澄清为中心的人机互动的巨大潜力。
https://arxiv.org/abs/2512.13159
Real-time decoding of target variables from multiple simultaneously recorded neural time-series modalities, such as discrete spiking activity and continuous field potentials, is important across various neuroscience applications. However, a major challenge for doing so is that different neural modalities can have different timescales (i.e., sampling rates) and different probabilistic distributions, or can even be missing at some time-steps. Existing nonlinear models of multimodal neural activity do not address different timescales or missing samples across modalities. Further, some of these models do not allow for real-time decoding. Here, we develop a learning framework that can enable real-time recursive decoding while nonlinearly aggregating information across multiple modalities with different timescales and distributions and with missing samples. This framework consists of 1) a multiscale encoder that nonlinearly aggregates information after learning within-modality dynamics to handle different timescales and missing samples in real time, 2) a multiscale dynamical backbone that extracts multimodal temporal dynamics and enables real-time recursive decoding, and 3) modality-specific decoders to account for different probabilistic distributions across modalities. In both simulations and three distinct multiscale brain datasets, we show that our model can aggregate information across modalities with different timescales and distributions and missing samples to improve real-time target decoding. Further, our method outperforms various linear and nonlinear multimodal benchmarks in doing so.
从多个同时记录的神经时间序列模态(如离散尖峰活动和连续场电位)中实时解码目标变量,在各种神经科学应用中都很重要。然而,实现这一目标的一个主要挑战是不同的神经模态可能具有不同的时间尺度(即采样率)以及不同的概率分布,并且在某些时间段内可能缺失数据。现有的多模态神经活动的非线性模型没有解决跨模态的不同时间尺度或丢失样本的问题。此外,其中一些模型不支持实时解码。在这里,我们开发了一个学习框架,该框架能够在不同时间和分布的多模态信息以及具有丢失样本的情况下进行实时递归解码的同时非线性聚合信息。 这个框架包括: 1. 一个多尺度编码器,它在学会了处理同种模态的动力学之后,能够非线性地汇总信息来应对不同的时间尺度和丢失样本,并实现实时操作。 2. 一个多尺度动力学骨干结构,用于提取多模态的动态特性并支持实时递归解码。 3. 具体针对不同概率分布跨模态的解码器。 在模拟环境以及三个具有不同时间尺度的真实脑数据集中,我们展示了该模型能够在存在不同时间尺度、不同分布和缺失样本的情况下汇总信息,从而提高目标变量的实时解码精度。此外,在执行上述任务时,我们的方法优于各种线性和非线性多模态基准模型。
https://arxiv.org/abs/2512.12462
Local field potentials (LFPs) can be routinely recorded alongside spiking activity in intracortical neural experiments, measure a larger complementary spatiotemporal scale of brain activity for scientific inquiry, and can offer practical advantages over spikes, including greater long-term stability, robustness to electrode degradation, and lower power requirements. Despite these advantages, recent neural modeling frameworks have largely focused on spiking activity since LFP signals pose inherent modeling challenges due to their aggregate, population-level nature, often leading to lower predictive power for downstream task variables such as motor behavior. To address this challenge, we introduce a cross-modal knowledge distillation framework that transfers high-fidelity representational knowledge from pretrained multi-session spike transformer models to LFP transformer models. Specifically, we first train a teacher spike model across multiple recording sessions using a masked autoencoding objective with a session-specific neural tokenization strategy. We then align the latent representations of the student LFP model to those of the teacher spike model. Our results show that the Distilled LFP models consistently outperform single- and multi-session LFP baselines in both fully unsupervised and supervised settings, and can generalize to other sessions without additional distillation while maintaining superior performance. These findings demonstrate that cross-modal knowledge distillation is a powerful and scalable approach for leveraging high-performing spike models to develop more accurate LFP models.
局部场电位(LFPs)可以在皮层神经实验中与脉冲活动常规记录在一起,能够测量更大范围的空间和时间尺度的大脑活动,为科学研究提供补充。相比脉冲活动,LFP信号具有更高的长期稳定性、对电极退化的鲁棒性以及更低的功耗等实用优势。尽管有这些优点,最近的神经建模框架大多集中在脉冲活动上,因为LFP信号因其群体水平和聚合性质而给建模带来固有的挑战,这通常会导致对未来任务变量(如运动行为)预测能力较低。 为了解决这一问题,我们引入了一种跨模式知识蒸馏框架,该框架将预训练的多会话脉冲变换模型中的高保真表示知识转移到LFP变换模型中。具体来说,首先使用带有特定于每个会话的神经标记策略的掩码自动编码目标,在多个记录会话上对教师脉冲模型进行训练。然后使学生LFP模型的潜在表征与教师脉冲模型的相一致。 我们的结果显示,蒸馏后的LFP模型在完全无监督和有监督设置中都优于单个会话和多会话LFP基线,并且能够推广到其他会话,而无需进一步的蒸馏同时保持优越性能。这些发现表明跨模式知识蒸馏是一种强大且可扩展的方法,用于利用高性能脉冲模型来开发更准确的LFP模型。
https://arxiv.org/abs/2512.12461
Many aerial tasks involving quadrotors demand both instant reactivity and long-horizon planning. High-fidelity models enable accurate control but are too slow for long horizons; low-fidelity planners scale but degrade closed-loop performance. We present Unique, a unified MPC that cascades models of different fidelity within a single optimization: a short-horizon, high-fidelity model for accurate control, and a long-horizon, low-fidelity model for planning. We align costs across horizons, derive feasibility-preserving thrust and body-rate constraints for the point-mass model, and introduce transition constraints that match the different states, thrust-induced acceleration, and jerk-body-rate relations. To prevent local minima emerging from nonsmooth clutter, we propose a 3D progressive smoothing schedule that morphs norm-based obstacles along the horizon. In addition, we deploy parallel randomly initialized MPC solvers to discover lower-cost local minima on the long, low-fidelity horizon. In simulation and real flights, under equal computational budgets, Unique improves closed-loop position or velocity tracking by up to 75% compared with standard MPC and hierarchical planner-tracker baselines. Ablations and Pareto analyses confirm robust gains across horizon variations, constraint approximations, and smoothing schedules.
许多涉及四旋翼飞行器的空中任务都需要即时反应和长期规划的能力。高保真模型能够实现精确控制,但计算速度较慢,不适于长时间范围内的操作;而低保真度计划者虽然可以扩展使用,但在闭环性能方面却有所下降。我们提出了一种名为Unique的方法,这是一种统一的MPC(模型预测控制),它在一个单一优化过程中串联不同精度级别的模型:短时间范围内采用高保真模型以实现精确控制,在长时间范围内则采用低保真模型进行规划。 为了使不同的时间范围内的成本一致,我们制定了保持可行性的推力和体速率约束,并引入了匹配不同状态、由推力引起的加速度以及急动-角速度关系的转换约束。为防止从非光滑障碍物中产生的局部极小值问题,我们提出了一种三维渐进平滑时间表,该时间表将范数基障碍物沿着时域进行变形处理。此外,我们在长且低保真的时间内部署了并行随机初始化的MPC求解器以发现更低成本的局部极小值。 在仿真和真实飞行测试中,在同等计算预算下,与标准MPC和分层规划-跟踪基准相比,Unique可以使闭环位置或速度追踪性能提高多达75%。拆解分析和帕累托分析进一步证实了其在时间范围变化、约束近似以及平滑时间表方面的一致性和稳定性。
https://arxiv.org/abs/2512.12427
Intelligent systems are widely assumed to improve through learning, coordination, and optimization. However, across domains -- from artificial intelligence to economic institutions and biological evolution -- increasing intelligence often precipitates paradoxical degradation: systems become rigid, lose adaptability, and fail unexpectedly. We identify \emph{entropy collapse} as a universal dynamical failure mode arising when feedback amplification outpaces bounded novelty regeneration. Under minimal domain-agnostic assumptions, we show that intelligent systems undergo a sharp transition from high-entropy adaptive regimes to low-entropy collapsed regimes. Collapse is formalized as convergence toward a stable low-entropy manifold, not a zero-entropy state, implying a contraction of effective adaptive dimensionality rather than loss of activity or scale. We analytically establish critical thresholds, dynamical irreversibility, and attractor structure and demonstrate universality across update mechanisms through minimal simulations. This framework unifies diverse phenomena -- model collapse in AI, institutional sclerosis in economics, and genetic bottlenecks in evolution -- as manifestations of the same underlying process. By reframing collapse as a structural cost of intelligence, our results clarify why late-stage interventions systematically fail and motivate entropy-aware design principles for sustaining long-term adaptability in intelligent systems. \noindent\textbf{Keywords:} entropy collapse; intelligent systems; feedback amplification; phase transitions; effective dimensionality; complex systems; model collapse; institutional sclerosis
智能系统被广泛认为是通过学习、协调和优化来改进的。然而,在从人工智能到经济制度以及生物进化等多个领域中,随着系统的智能化程度提高,反而常常导致令人困惑的质量下降:系统变得僵化,失去适应性,并且意外失败。 我们确定“熵崩溃”为一种普遍的动力学失效模式,这种现象发生在反馈放大超过有限新颖性的再生时。在最少的无域特定假设下,我们展示智能系统经历了从高熵适应性阶段到低熵崩溃阶段的急剧转变。这里的崩溃被定义为收敛于一个稳定的低熵流形,而不是零熵状态,这意味着有效适应维度减少,而非活动或规模的丧失。 我们通过最小化模拟验证了关键阈值、动力学不可逆性和吸引子结构,并证明这种框架在更新机制中具有普适性。该理论框架统一了解释AI中的模型崩溃现象、经济学中的制度硬化问题以及进化过程中遗传瓶颈等不同领域中的多种现象,这些现象本质上都是同一个底层过程的表现。 通过将熵崩溃重新定义为智能的一种结构性成本,我们的研究结果阐明了为何后期干预通常无效,并激励人们采用考虑熵的设计原则来维持智能系统的长期适应性。关键词:熵崩溃;智能系统;反馈放大;相变;有效维度;复杂系统;模型坍塌;制度硬化
https://arxiv.org/abs/2512.12381
Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment.
驾驶分心是全球致命车祸的主要原因之一。为解决这一问题,研究人员正在使用驾驶员活动识别(DAR)技术结合脉冲无线电超宽带(IR-UWB)雷达,这种技术具有抗干扰能力强、能耗低和保护隐私等优点。然而,两个挑战限制了其广泛应用:一是缺乏涵盖各种分心驾驶行为的大型真实世界UWB数据集;二是难以将固定输入的视觉变换器(ViT)适应于非标准尺寸的UWB雷达数据。 本研究解决了上述两个问题。我们提供了ALERT数据集,其中包含在实际驾驶条件下收集的7种分心驾驶活动的10,220个雷达样本。此外,还提出了一种适用于基于雷达DAR的输入大小无关视觉变换器(ISA-ViT)框架。所提出的这种方法通过调整UWB数据尺寸以满足ViT输入要求的同时保留了如多普勒频移和相位特征等特定于雷达的信息。通过对补丁配置进行调整并利用预训练的位置嵌入向量(PEV),ISA-ViT克服了简单缩放方法的局限性。此外,一种领域融合策略结合范围域和频率域特征进一步提高了分类性能。 全面实验表明,与现有的基于ViT的方法相比,ISA-ViT在基于UWB的DAR上实现了22.68%的准确率提升。通过公开发布ALERT数据集并详细说明我们的输入大小无关策略,本工作促进了更加稳健和可扩展的分心驾驶检测系统的实际应用开发。
https://arxiv.org/abs/2512.12206
Intracranial recordings have opened a unique opportunity to simultaneously measure activity across multiregional networks in the human brain. Recent works have focused on developing transformer-based neurofoundation models of such recordings that can generalize across subjects and datasets. However, these recordings exhibit highly complex spatiotemporal interactions across diverse spatial scales, from the single-channel scale to the scale of brain regions. As such, there remain critical open questions regarding how best to encode spatial information and how to design self-supervision tasks that enable the learning of brain network patterns and enhance downstream decoding performance using such high-dimensional, multiregional recordings. To allow for exploring these questions, we propose a new spatiotemporal transformer model of multiregional neural activity and a corresponding self-supervised masked latent reconstruction task, designed to enable flexibility in the spatial scale used for token encoding and masking. Applying this model on publicly available multiregional intracranial electrophysiology (iEEG) data, we demonstrate that adjusting the spatial scale for both token encoding and masked reconstruction significantly impacts downstream decoding. Further, we find that spatial encoding at larger scales than channel-level encoding, which is commonly used in existing iEEG transformer models, improves downstream decoding performance. Finally, we demonstrate that our method allows for region-level token encoding while also maintaining accurate channel-level neural reconstruction. Taken together, our modeling framework enables exploration of the spatial scales used for token encoding and masking, reveals their importance towards self-supervised pretraining of neurofoundation models of multiregional human brain activity, and enhances downstream decoding performance.
颅内记录为同时测量人类大脑多区域网络的活动提供了一个独特的机会。近期的研究集中在开发能够跨不同受试者和数据集泛化的基于变压器的基础神经模型上,这些模型处理这样的记录。然而,这种记录展示了高度复杂的时空相互作用,跨越从单通道尺度到脑区尺度的不同空间尺度。因此,关于如何最有效地编码空间信息以及如何设计自我监督任务以利用此类高维、多区域记录学习大脑网络模式并提高下游解码性能,仍存在关键的未解决的问题。 为了探索这些问题,我们提出了一种新的时空变压器模型来描述多区域神经活动,并相应地提出了一个自监督掩码潜在重构任务。该任务旨在使令牌编码和掩码的空间尺度使用具有灵活性。在公开可用的多区域颅内电生理学(iEEG)数据上应用这种模型时,我们发现调整令牌编码和掩码重建的空间尺度对下游解码有显著影响。 此外,我们发现以大于通道级别编码的大规模空间进行空间编码,这是现有iEEG变压器模型中常用的通道级编码方式,可以提高下游的解码性能。最后,我们证明了我们的方法可以在区域级进行令牌编码的同时保持准确的通道级神经重构。 总的来说,我们的建模框架使探索用于令牌编码和掩码的空间尺度成为可能,并揭示了这些空间尺度对于多区域人类大脑活动的基础神经模型的自我监督预训练的重要性,同时提高了下游解码性能。
https://arxiv.org/abs/2512.12135
Human activity recognition (HAR) requires extracting accurate spatial-temporal features with human movements. A mmWave radar point cloud-based HAR system suffers from sparsity and variable-size problems due to the physical features of the mmWave signal. Existing works usually borrow the preprocessing algorithms for the vision-based systems with dense point clouds, which may not be optimal for mmWave radar systems. In this work, we proposed a graph representation with a discrete dynamic graph neural network (DDGNN) to explore the spatial-temporal representation of human movement-related features. Specifically, we designed a star graph to describe the high-dimensional relative relationship between a manually added static center point and the dynamic mmWave radar points in the same and consecutive frames. We then adopted DDGNN to learn the features residing in the star graph with variable sizes. Experimental results demonstrated that our approach outperformed other baseline methods using real-world HAR datasets. Our system achieved an overall classification accuracy of 94.27\%, which gets the near-optimal performance with a vision-based skeleton data accuracy of 97.25\%. We also conducted an inference test on Raspberry Pi~4 to demonstrate its effectiveness on resource-constraint platforms. \sh{ We provided a comprehensive ablation study for variable DDGNN structures to validate our model design. Our system also outperformed three recent radar-specific methods without requiring resampling or frame aggregators.
人体活动识别(HAR)需要从人类运动中提取准确的时空特征。基于毫米波雷达点云的HAR系统由于毫米波信号的物理特性而面临稀疏性和大小可变的问题。现有研究通常借用为密集点云设计的视觉系统的预处理算法,这可能并不适用于毫米波雷达系统。在这项工作中,我们提出了一种使用离散动态图神经网络(DDGNN)表示的图表示方法来探索与人类运动相关的时空特征的表示。具体而言,我们设计了一个星形图来描述手动添加的静态中心点和同一帧及连续帧中的动态毫米波雷达点之间的高维相对关系。随后,我们采用DDGNN来学习存在于不同大小星形图中的特征。实验结果表明,在真实世界HAR数据集上,我们的方法优于其他基线方法。我们的系统实现了94.27%的整体分类准确率,接近使用基于视觉的骨架数据实现的97.25%的最佳性能。我们还在Raspberry Pi 4上进行了推理测试,以展示其在资源受限平台上的有效性。此外,我们进行了一系列消融实验来验证不同结构下的DDGNN模型设计的有效性。我们的系统还优于三种最近的雷达专用方法,并且不需要重采样或帧聚合器。
https://arxiv.org/abs/2512.12013
The rise of generative and autonomous agents marks a fundamental shift in computing, demanding a rethinking of how humans collaborate with probabilistic, partially autonomous systems. We present the Human-AI-Experience (HAX) framework, a comprehensive, three-phase approach that establishes design foundations for trustworthy, transparent, and collaborative agentic interaction. HAX integrates behavioral heuristics, a schema-driven SDK enforcing structured and safe outputs, and a behavioral proxy concept that orchestrates agent activity to reduce cognitive load. A validated catalog of mixed-initiative design patterns further enables intent preview, iterative alignment, trust repair, and multi-agent narrative coherence. Grounded in Time, Interaction, and Performance (TIP) theory, HAX reframes multi-agent systems as colleagues, offering the first end-to-end framework that bridges trust theory, interface design, and infrastructure for the emerging Internet of Agents.
生成式和自主代理的兴起标志着计算领域的根本性转变,这要求我们重新思考人类如何与概率性和部分自治系统合作。本文介绍了Human-AI-Experience(HAX)框架,这是一个全面、分三阶段的方法,旨在为可信、透明且协作性的代理交互建立设计基础。HAX集成了行为启发式、一个基于模式的SDK以强制执行结构化和安全的输出以及一种行为代理概念,该概念通过减少认知负荷来协调代理活动。经过验证的设计模式目录进一步支持了意图预览、迭代对齐、信任修复和多代理叙述连贯性等功能。HAX植根于时间、交互与性能(TIP)理论,将多代理系统重新定义为同事关系,并提供了首个端到端框架,该框架连接了信任理论、界面设计以及新兴的代理人互联网所需的基础设施。
https://arxiv.org/abs/2512.11979
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.
随着大型语言模型(LLM)代理越来越多地承担数字工作,可靠框架的建立变得越来越重要,这些框架能够评估其在现实世界中的实际能力、适应性和与人类协作的能力。现有的基准测试主要集中在静态、合成或特定领域的任务上,这限制了我们对代理如何在动态且具有经济意义的环境中表现的理解。我们引入了一个新的基准测试——UpBench,这是一个基于全球Upwork劳动力市场真实工作机会的动态演化基准。 每个任务都与经过验证的客户交易相对应,这意味着评估建立在真实的业务活动和财务成果基础上。UpBench采用了一种基于评分标准的评估框架,在此框架中,专家自由职业者将每项工作细分为详细且可核实的标准,并根据这些标准对AI提交的内容进行逐条反馈评价。这一结构允许对模型的优势、劣势以及指令遵循度进行精细化分析,而不仅仅是通过简单的通过/失败指标来判断。 在整个数据处理流程(从任务筛选和评分体系构建到最终评估)中,人的专业知识都被整合进来,确保了与真实专业标准的一致性,并支持关于人机协作的研究。通过定期更新任务以反映在线工作的演变性质,UpBench提供了一个可扩展、以人为中心的基础框架,用于在真正的劳动力市场环境中评估代理系统的能力。这为建立一种合作框架提供了可能,在这种框架中,AI通过伙伴关系增强人类能力,而非替代人类工作。
https://arxiv.org/abs/2511.12306
The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (this https URL).
大规模标注运动数据集的获取成本仍然是基于骨骼的人体活动识别(HAR)的关键瓶颈。尽管文本到运动(T2M)生成模型提供了大量合成数据的有吸引力且可扩展来源,但这些模型的训练目标强调的是艺术性的通用动作,并且其数据结构与HAR对运动学精确性和类别区分性要求根本不同。这种差异导致了显著的领域差距,使得通才型T2M模型不适合为HAR分类器生成合适的运动。 为了应对这一挑战,我们提出了KineMIC(Kinetic Mining In Context),这是一个用于少量样本动作合成的迁移学习框架。KineMIC通过假设文本编码空间中的语义对应关系可以提供软监督来将T2M扩散模型适应到HAR领域,从而进行运动学蒸馏。我们通过一个利用CLIP文本嵌入建立稀疏HAR标签与T2M源数据之间关联的运动挖掘策略实现这一点。这一过程指导微调工作,将通才型T2M主干网络转化为少量样本的动作到运动生成器。 我们在HumanML3D作为源T2M数据集和NTU RGB+D 120的一部分作为目标HAR领域上验证了KineMIC的效果,并随机从每个动作类别中选择仅10个样本。我们的方法生成了更连贯的运动,提供了一种稳健的数据增强来源,使模型准确率提高了23.1个百分点。 有关动画插图和补充材料,请访问(此 https URL)。
https://arxiv.org/abs/2512.11654
Energy demand prediction is critical for grid operators, industrial energy consumers, and service providers. Energy demand is influenced by multiple factors, including weather conditions (e.g. temperature, humidity, wind speed, solar radiation), and calendar information (e.g. hour of day and month of year), which further affect daily work and life schedules. These factors are causally interdependent, making the problem more complex than simple correlation-based learning techniques satisfactorily allow for. We propose a structural causal model that explains the causal relationship between these variables. A full analysis is performed to validate our causal beliefs, also revealing important insights consistent with prior studies. For example, our causal model reveals that energy demand responds to temperature fluctuations with season-dependent sensitivity. Additionally, we find that energy demand exhibits lower variance in winter due to the decoupling effect between temperature changes and daily activity patterns. We then build a Bayesian model, which takes advantage of the causal insights we learned as prior knowledge. The model is trained and tested on unseen data and yields state-of-the-art performance in the form of a 3.84 percent MAPE on the test set. The model also demonstrates strong robustness, as the cross-validation across two years of data yields an average MAPE of 3.88 percent.
能源需求预测对电网运营商、工业能源用户和服务提供商来说至关重要。能源需求受到多种因素的影响,包括天气条件(如温度、湿度、风速和太阳辐射)以及日历信息(如一天中的小时数和一年中的月份),这些因素进一步影响日常工作和生活安排。由于这些因素之间存在因果关系的相互依赖性,因此问题比简单的基于相关性的学习技术所能处理的更为复杂。我们提出了一种结构因果模型,用以解释这些变量之间的因果关系。进行了全面分析来验证我们的因果信念,并揭示了一些与之前研究一致的重要见解。例如,我们的因果模型显示能源需求对温度波动的反应具有季节依赖性敏感度。此外,我们发现由于温度变化和日常活动模式之间脱钩效应的影响,在冬季能源需求表现出较低的变化幅度。然后我们建立了一个贝叶斯模型,该模型利用了我们学到的因果见解作为先验知识。在未见过的数据上对该模型进行训练和测试后,其性能达到了最先进的水平,即在测试集上的3.84%平均绝对百分比误差(MAPE)。此外,该模型还表现出较强的鲁棒性,在两年数据之间的交叉验证中获得了平均3.88%的MAPE。
https://arxiv.org/abs/2512.11653
Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.
生成视频模型是世界建模的一种前沿方法,但它们面临着基本的局限性:常常违反物理和逻辑规则,缺乏互动性,并且作为不透明的黑盒子运行,不适合构建结构化、可查询的世界。为了解决这些挑战,我们提出了一种新的范式,专注于将图像-描述对提炼成一种易于理解和抽象化的表示形式,以优化模拟过程。为此,我们介绍了VDAWorld框架,在该框架中,视觉-语言模型(VLM)作为智能代理来协调这一过程。VLM自主地从一系列视觉工具中选择并构建一个基于场景的(2D或3D)场景表示,并相应地选择合适的物理仿真器(例如刚体、流体等)对这些场景进行操作。通过这种方法,VDAWorld可以从静态场景中推断出潜在的动力学来预测可能的未来状态。 实验表明,这种智能抽象与自适应模拟相结合的方式可以创建一个多功能的世界模型,在各种动态情景下都能生成高质量的仿真结果。
https://arxiv.org/abs/2512.11061
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at this https URL.
人类级别的接触密集型操作依赖于两种关键模式的独特作用:视觉提供了空间丰富但时间上较慢的全局背景信息,而力感知则捕捉到了快速、高频的局部接触动态。由于它们的基本频率和信息差异,整合这些信号颇具挑战性。在这项工作中,我们提出了ImplicitRDP(隐式关联的视觉-力扩散策略),这是一种统一的端到端视觉-力传播政策,它在单一网络中集成了视觉规划与反应性的力控制。 为了处理异步的视觉和力信号,并允许该策略以力频率进行闭环调整同时保持行动片段的时间连贯性,我们引入了结构化慢快学习机制,利用因果注意力来同时处理这两种异步信号。此外,为了解决端到端模型在不同模式之间难以调整权重的问题(即模态塌陷),我们提出了基于虚拟目标的表示正则化方法。这一辅助目标将力反馈映射到了与行动相同的空间内,提供了一个比原始力预测更强、更具有物理基础的学习信号。 通过接触密集型任务中的大量实验表明,ImplicitRDP显著优于仅依赖视觉和分层基准模型,在反应性和成功率方面都表现卓越,并且拥有精简的训练流程。相关代码和视频将公开发布在该链接上:[此URL](请将"[此URL]"替换为实际提供的具体链接)。
https://arxiv.org/abs/2512.10946
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: this http URL.
时间差分(TD)方法通过从其自身的未来价值预测进行自我引导,高效地学习状态和动作值。然而,这种自举机制容易产生自举偏差:即价值目标中的错误会在步骤之间累积,并导致偏斜的价值估计。最近的工作提出使用“块化评估器”来解决这个问题,这些评估器估算短的动作序列(称为“块”)的价值,而不是单独的动作,从而加快了价值备份的速度。然而,从块化评估器中提取策略是具有挑战性的:策略必须输出整个动作块,并且在开环情况下工作,这可能对于需要策略反应性并且尤其是在动作块长度增加时难以建模的环境来说次优。 我们的关键见解在于将批评家(critic)的动作块长度与政策的动作块长度解耦,使政策可以在较短的动作块上操作。我们提出了一种新的算法,通过优化政策来对抗从原始块化评估器乐观地备份而构建的部分动作块精简版批评家来进行这一设计。这种设计保留了多步价值传播的好处,同时避免了开环次优性和学习长动作序列策略的难度。 我们在具有挑战性的、长期视角的离线目标条件任务上对我们的方法进行了评估,并展示了它可靠地超越了先前的方法。代码可以在提供的链接中找到:[此 HTTP URL](请将“this http URL”替换为实际URL)。
https://arxiv.org/abs/2512.10926
Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at this https URL.
基于传感器的人体活动识别(HAR)从时间序列的传感数据中挖掘行为模式。在现实场景中,个体、设备、环境和时间的变化会引入同一活动中显著的分布变化。近期的努力尝试通过应用或调整现有的出站分布(OOD)算法来解决这一挑战,但仅限于某些分布变化情景(如跨设备或跨位置),缺乏对这些算法有效性全面的理解。例如,对于HAR来说OOD是否必要?哪种OOD算法表现最佳? 在本文中,我们填补了这一空白,提出了HAROOD,一个针对OOD设置的综合基准测试平台。我们定义了4种OOD场景:跨个人、跨位置、跨数据集和跨时间,并构建了一个包含6个数据集、16种对比方法(采用CNN基和Transformer基架构)、两个模型选择协议的实验环境。然后,我们进行了广泛的实验并提出了多项对未来研究有价值的发现,例如没有单一的方法能持续优于其他方法,这突显了在这个领域中巨大的进步空间。 我们的代码库高度模块化且易于扩展新的数据集、算法、对比以及分析内容,旨在促进基于OOD的人体活动识别的研究。我们的实现已经发布,并可在以下链接找到:[此URL]。
https://arxiv.org/abs/2512.10807
In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.
在本报告中,我们总结了团队为首次NCIIPC Startup India AI GRAND CHALLENGE开发的集成多语言音频处理管道,该挑战针对问题陈述06:无语言障碍的说话人识别和会议记录(Diarisation),以及随后的转录和翻译系统。我们的主要关注点是推进说话人身份验证,在多语言和代码混合场景中这是关键组件。这项工作的主要目的是研究我们内部开发的说话人身份验证(SD)系统的实际应用可行性。为此,我们调查了一种稳健的声音活动检测(VAD)技术,并对说话人嵌入模型进行了微调,以在资源匮乏的情况下提高说话人的识别精度。我们利用了自己最近提出的多核共识谱聚类框架,这显著提高了训练语料库中所有录音的会议记录性能。此外,我们将用于说话人和语言识别、自动语音识别(ASR)及神经机器翻译的互补模块整合到了处理管道中。后处理改进进一步增强了系统的稳健性。
https://arxiv.org/abs/2512.11009