Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain "important" parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.
深度神经网络中的灾难性遗忘现象是指在学习新任务时,会损害之前已经学过的任务的表现,原因是旧知识被新的知识覆盖。为缓解这个问题,正则化技术旨在识别并约束“重要”参数以保存先前的知识。在深度学习中高度非凸的优化景观中,我们提出了一种新颖的观点:追踪最终训练平台期期间的参数比在整个训练过程中监测它们更有效。我们认为,在这个平台期内表现出更高活动(移动和变化)的参数揭示了损失景观中的相对平坦区域,这使得它们适合适应新任务的同时保留之前的知识。我们的全面实验表明,这种方法在缓解灾难性遗忘与在新学习的任务上取得优异表现之间实现了更好的平衡。 具体来说: - 灾难性遗忘指的是深度神经网络在学习新任务时导致旧任务性能下降的现象。 - 为解决这一问题,一种常用的方法是采用正则化技术来识别和限制“重要”的参数,从而保护之前的知识不被覆盖。 - 在非凸的优化空间中,我们提出了一种新的视角:关注模型在训练接近尾声、达到性能平台期时参数的变化情况,而不是在整个训练过程中持续监控它们。我们认为,在这个阶段变化较大的参数表明网络能够在保留旧知识的同时适应新任务。 - 我们的实验结果证明了这种方法能更有效地平衡灾难性遗忘的缓解与新的学习任务上的表现提升之间的关系。 总的来说,这种侧重于识别和保护在最终训练平台期活跃的重要参数的方法显示出了优越的效果。
https://arxiv.org/abs/2507.08736
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fréchet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
环境声音录音中常常包含可理解的人类对话,这引发了隐私问题,并限制了数据的分析、共享和再利用。本文介绍了一种方法,该方法能够使语音变得不可辨认,同时保留声学场景的整体性和音频质量。我们的方法包括反转波形片段以扭曲语音内容。通过使用语音活动检测和语音分离管道来增强这一过程,从而更精确地定位语音区域。 为了证明所提方法的有效性,我们制定了一个三阶段评估协议,其中包括: 1. 使用单词错误率(WER)评估语音的可理解度; 2. 使用广泛使用的预训练模型中的声音源分类准确度下降(SCAD)来评估音频中各个声源的可检测性; 3. 通过使用包含未修改语音数据的数据集计算弗雷歇音频距离(FAD),评估音质。 在由语音和环境声音场景线性混合组成的模拟评测数据集中进行实验,结果显示我们的方法能够有效降低语音的理解度(WER为97.9%),并使声源的可检测性几乎不受影响(SCAD仅为2.7%),同时保持较高的感知质量(FAD值为1.40)。 进一步的研究表明了管道中每个组件的作用。此外,我们还发现,在我们的语音隐私保护方法中引入随机拼接可以增强算法对尝试恢复原始清晰语音的鲁棒性,但音频质量略有下降作为代价。
https://arxiv.org/abs/2507.08412
Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{this https URL}
人体活动识别(HAR)在资源受限的可穿戴设备上应用时,需要既能保证精度又能提高计算效率的推理模型。本文介绍了一种名为TinierHAR的超轻量级深度学习架构,该架构结合了残差分组卷积、门控循环单元(GRU)和时间聚合技术,在不牺牲性能的情况下实现了最先进的能效。在14个公共HAR数据集上进行评估时,TinierHAR相较于TinyHAR将参数减少了2.7倍,并且比DeepConvLSTM减少了43.3倍;同时计算量(MACs)分别减少了6.4倍和58.6倍,而平均F1分数保持不变。除了定量上的提升,这项工作还首次系统地进行了消融研究,解析了TinierHAR、先前的SOTA TinyHAR以及经典的DeepConvLSTM在时空组件中的贡献,为高效设计HAR系统提供了实用见解。 我们最后讨论了这些发现,并为未来高效的HAR设计提出了基本原则指南。为了促进边缘设备上的HAR研究进展,我们将本文所有材料开源,供未来的基准测试使用\footnote{this https URL}。
https://arxiv.org/abs/2507.07949
Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society's most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.
鉴于生成式人工智能的迅速采用及其对各种任务潜在影响,理解AI对经济的影响是社会面临的重要问题之一。在这项研究中,我们通过分析人们使用AI的工作活动、这些活动的成功程度和广泛性,并结合有关从事此类活动的职业的数据,朝着这个目标迈出了一步。我们分析了一个包含20万条匿名且隐私已清除的用户与微软Bing Copilot(一个公开可用的生成式AI系统)之间的对话数据集。我们发现人们寻求AI协助最常见的工作活动包括收集信息和写作,而AI本身最常执行的任务是提供信息和支持、写作、教学和建议。结合这些任务分类以及对任务成功度和影响范围的测量,我们为每个职业计算了AI适用性得分。研究结果表明,在计算机和数学、办公室及行政支持等知识型工作群体中,以及销售等需要提供和传递信息的职业中,AI的适用性得分最高。此外,我们还描述了哪些类型的工作活动执行得最成功,并探讨了工资与教育程度如何影响AI适用性,以及实际应用与职业AI影响预测之间的比较情况。
https://arxiv.org/abs/2507.07935
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
转换为中文: 轮流发言是口语对话中的一个基本组成部分,然而传统的研究大多集中在双人对话的设置上。这项工作专注于将声音活动预测(VAP)应用于三人群体多边场景中即将到来的轮流发言的预测。VAP模型的目标是仅利用声学数据来预测每个说话人的未来语音活动。这是第一个将VAP扩展到三人群体对话的研究。我们在一个日本的三人群体数据集上训练了多个模型,参与者讨论了各种话题。我们发现,在三人群体对话中进行训练的VAP模型在所有模型中的表现都优于基准模型,但是对话类型影响了预测准确性。这项研究证明了VAP可以用于三人群体对话场景中的轮流发言预测。未来的研究将把这种三人群体VAP轮流发言模型集成到口语对话系统中。
https://arxiv.org/abs/2507.07518
In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbf{DisenQ} (\textbf{Disen}tangling \textbf{Q}-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.
在这项工作中,我们研究了活动生物识别(activity-biometrics)问题,该问题涉及在多种不同活动中识别个人。与传统的人员识别相比,在这种设置中,身份线索会与运动动态和外观变化交织在一起,因此增加了更多的挑战,使生物特征的特性学习变得更加复杂。虽然额外的视觉数据如姿势或轮廓有助于解决这些问题,但它们通常由于提取不准确而难以有效使用。 为了克服这些难题,我们提出了一种多模态语言引导框架,该框架通过结构化的文本监督取代了对附加视觉数据的依赖。我们的核心方法是\textbf{DisenQ}(\textbf{Disen}tangling \textbf{Q}-Former),这是一种统一的查询变换器,它利用结构化语言指导来分离生物特征、运动和非生物特征特性。这种方法确保了身份线索独立于外观变化和运动变化,从而防止误识。 我们在三个基于活动的视频基准测试上评估了我们的方法,并取得了最先进的性能结果。此外,我们还展示了在传统视频识别基准测试中的强大泛化能力,表明其能够应对复杂的现实世界场景,并具有竞争力的表现,证明了我们框架的有效性。
https://arxiv.org/abs/2507.07262
Supply chain networks are complex systems that are challenging to analyze; this problem is exacerbated when there are illicit activities involved in the supply chain, such as counterfeit parts, forced labor, or human trafficking. While machine learning (ML) can find patterns in complex systems like supply chains, traditional ML techniques require large training data sets. However, illicit supply chains are characterized by very sparse data, and the data that is available is often (purposely) corrupted or unreliable in order to hide the nature of the activities. We need to be able to automatically detect new patterns that correlate with such illegal activity over complex, even temporal data, without requiring large training data sets. We explore neurosymbolic methods for identifying instances of illicit activity in supply chains and compare the effectiveness of manual and automated feature extraction from news articles accurately describing illicit activities uncovered by authorities. We propose a question tree approach for querying a large language model (LLM) to identify and quantify the relevance of articles. This enables a systematic evaluation of the differences between human and machine classification of news articles related to forced labor in supply chains.
供应链网络是复杂系统,分析起来颇具挑战性;当供应链中涉及非法活动(如假冒零件、强迫劳动或人口贩卖)时,这一问题变得更加严峻。虽然机器学习(ML)可以在复杂的系统中发现模式,但传统机器学习技术需要大量的训练数据集。然而,非法供应链的特点是数据非常稀疏,并且即使有可用的数据也常常被故意篡改或不可靠,以隐藏活动的本质。我们需要能够在没有大量训练数据集的情况下,在复杂甚至时间序列数据中自动检测出与这些违法行为相关的新模式。 本文探讨了神经符号方法在识别供应链中的非法行为实例方面的应用,并比较了从准确描述当局发现的非法行为的新闻文章中手动和自动化特征提取的有效性。我们提出了一种问题树的方法来查询大型语言模型(LLM),以识别并量化文章的相关性。这使得系统地评估与强迫劳动相关的新闻文章的人类和机器分类之间的差异成为可能。
https://arxiv.org/abs/2507.07217
Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.
从脑信号中解码视觉体验为神经科学和可解释的人工智能带来了令人兴奋的可能性。尽管EEG(脑电图)易于获取且时间分辨率高,但其在空间细节上的局限性阻碍了图像重建。我们的模型通过将EEG信号与大型语言模型生成的多层次语义描述对齐来绕过直接从EEG信号生成图像的过程——这些描述涵盖了从物体层面到抽象主题的不同层次。一个基于Transformer的EEG编码器利用对比学习方法,将脑电活动映射到这些语义描述上。在推理阶段,通过投影头检索出的描述嵌入条件化预训练的潜在扩散模型来生成图像。这一文本中介框架在EEGCVPR数据集上的视觉解码方面达到了最先进的水平,并且与已知神经认知路径具有可解释的一致性。主导的EEG-语义关联反映了从感知到的图像中提取的不同语义层次的重要性。显著图和t-SNE投影揭示了头皮上不同语义区域的分布情况。我们的模型展示了结构化语义中介如何使基于EEG的认知对齐视觉解码成为可能。
https://arxiv.org/abs/2507.07157
Ancient populations markedly transformed Neotropical forests, yet understanding the long-term effects of ancient human management, particularly at high-resolution scales, remains challenging. In this work we propose a new approach to investigate archaeological areas of influence based on vegetation signatures. It consists of a deep learning model trained on satellite imagery to identify palm trees, followed by a clustering algorithm to identify palm clusters, which are then used to estimate ancient management areas. To assess the palm distribution in relation to past human activity, we applied the proposed approach to unique high-resolution satellite imagery data covering 765 km2 of the Sierra Nevada de Santa Marta, Colombia. With this work, we also release a manually annotated palm tree dataset along with estimated locations of archaeological sites from ground-surveys and legacy records. Results demonstrate how palms were significantly more abundant near archaeological sites showing large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced local vegetation fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in otherwise less accessible locations. Overall, this study demonstrates the potential of integrating artificial intelligence approaches with new ecological and archaeological data to identify archaeological areas of interest through vegetation patterns, revealing fine-scale human-environment interactions.
古代人口显著地改变了新热带森林,但理解古代人类管理的长期影响,尤其是在高分辨率尺度下,仍然具有挑战性。在这项研究中,我们提出了一种基于植被特征来调查考古区域影响力的新方法。这种方法包括使用深度学习模型在卫星图像上识别棕榈树,并随后应用聚类算法来确定棕榈集群,这些集群被用来估算古代管理区域的范围。为了评估与过去人类活动相关的棕榈分布情况,我们将所提出的方法应用于哥伦比亚圣玛尔塔雪山765平方公里的独特高分辨率卫星影像数据。 此外,我们还发布了一个手动标注的棕榈树数据集以及通过实地调查和遗产记录估计出的考古遗址位置。研究结果表明,在显示出大型基础设施投资的地方,棕榈树的分布密度显著更高。最大的棕榈集群规模显示,古代人类管理区域可能比仅凭考古证据指示的大两到三个数量级。 我们的发现表明,前哥伦布时期的土著人口通过促进有利于棕榈生长的条件来影响当地的植被,留下了持久的生态足迹。这或许降低了在原本难以到达的地方建立基础设施密集型定居点的后勤成本。总体而言,这项研究展示了将人工智能方法与新的生态学和考古数据相结合的可能性,以识别感兴趣的考古区域并通过植被模式揭示精细的人地互动关系。
https://arxiv.org/abs/2507.06949
We present Gradientsys, a next-generation multi-agent scheduling framework that coordinates diverse specialized AI agents using a typed Model-Context Protocol (MCP) and a ReAct-based dynamic planning loop. At its core, Gradientsys employs an LLM-powered scheduler for intelligent one-to-many task dispatch, enabling parallel execution of heterogeneous agents such as PDF parsers, web search modules, GUI controllers, and web builders. The framework supports hybrid synchronous/asynchronous execution, respects agent capacity constraints, and incorporates a robust retry-and-replan mechanism to handle failures gracefully. To promote transparency and trust, Gradientsys includes an observability layer streaming real-time agent activity and intermediate reasoning via Server-Sent Events (SSE). We offer an architectural overview and evaluate Gradientsys against existing frameworks in terms of extensibility, scheduling topology, tool reusability, parallelism, and observability. Experiments on the GAIA general-assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs compared to a MinionS-style baseline, demonstrating the strength of its LLM-driven multi-agent orchestration.
我们介绍了Gradientsys,这是一个下一代多代理调度框架,它使用带有类型化的模型上下文协议(MCP)和基于ReAct的动态规划循环来协调各种专业的AI代理。在核心部分,Gradientsys采用了一个由大型语言模型(LLM)驱动的任务调度器来进行智能的一对多任务分发,从而可以并行执行包括PDF解析器、网络搜索模块、GUI控制器和网页构建器在内的异构代理。该框架支持混合的同步/异步执行模式,遵守代理容量限制,并集成了强大的重试和重新规划机制来优雅地处理故障。为了促进透明度和信任,Gradientsys包含了一个可观测性层,通过服务器发送事件(SSE)实时流式传输代理活动及中间推理。 我们提供了架构概览,并从可扩展性、调度拓扑、工具复用性、并行性和可观测性等多个方面评估了Gradientsys与现有框架的性能。在GAIA通用助理基准测试中的实验表明,相比于MinionS风格的基础线,Gradientsys实现了更高的任务成功率和更少的延迟及API成本,这展示了其LLM驱动的多代理编排的强大能力。
https://arxiv.org/abs/2507.06520
Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.
可穿戴相机被越来越广泛地用作观察和干预人类行为的工具,通过提供与手部活动相关的详细视觉数据来实现。这些数据可用于辅助记忆召回、记录行为或及时进行旨在改善健康的干预措施。然而,持续处理来自这些相机的RGB图像会消耗大量电力影响电池寿命,并产生大量的不需要后期处理的视频数据;此外,这还引发了隐私问题并需要大量的计算资源来进行实时分析。 我们引入了一种名为THOR的方法,这是一种基于热感测来捕捉手部-物体区域并在实时进行分类的实时自适应时空RGB帧采样技术。通过使用低分辨率的热像仪数据识别一个人从一种与手相关的活动转换到另一种活动的时间点,并在活动转变期间增加RGB图像的采集率,在持续性活动中则降低采集频率,以此来调整RGB帧的采集率。 此外,我们利用手部的热信号来定位每张RGB图像中的感兴趣区域(即手部-物体交互区域),从而使系统仅裁剪并处理用于活动识别所需的图像部分。我们开发了一款可穿戴设备并通过一项包括14名参与者和超过30种不同活动的实地研究验证了该方法,并进一步在Ego4D数据集上进行了评估(共有来自9个国家的923位参与者,总计3,670小时视频)。我们的结果显示,在仅使用原始RGB视频数据的3%的情况下,本方法能够捕捉到所有活动段落,并且手部相关活动识别F1分数达到了与完全使用整个RGB视频相似的结果(分别为95%和94%)。 这项工作为长期监测可穿戴相机在实时监测手部活动和健康风险行为方面提供了一条更加实用的路径。
https://arxiv.org/abs/2507.06442
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
受到近年来在视频识别和目标检测领域中变压器(Transformer)及多阶段架构成功应用的启发,我们深入探索了变压器在网络中的多层次架构下处理时间动作定位(TAL)任务时所具有的丰富时空特性。这一研究促进了分层多阶段变压器架构PCL-Former的发展,该架构通过专门设计的损失函数,让每个子任务都能由特定的Transformer模块来完成。具体来说: - Proposal-Former:识别未修剪视频中可能包含动作的候选片段。 - Classification-Former:对这些片段中的动作类别进行分类。 - Localization-Former:精确预测动作实例的时间边界(即开始和结束时间)。 为了评估我们方法的表现,我们在三个具有挑战性的基准数据集上进行了广泛的实验:THUMOS-14、ActivityNet-1.3 和 HACS Segments。此外,我们也进行了详细的消融研究来评估 PCL-Former 中每个单独模块的影响。所获得的定量结果验证了提出的 PCL-Former 的有效性,在 THUMOS14、ActivityNet-1.3 和 HACS 数据集上分别超过了现有的 TAL 方法 2.8%、1.2% 和 4.8%。
https://arxiv.org/abs/2507.06411
Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.
人体活动识别(HAR)使用可穿戴传感器在医疗保健、健身和人机交互应用中至关重要。生物阻抗传感提供了精细运动捕捉的独特优势,但由于标记数据稀缺,这一技术尚未得到充分利用。我们引入了SImpHAR框架,通过两项核心贡献来解决这个限制问题。 首先,我们提出了一种仿真管道,该管道利用最短路径估计、软体物理和文本到动作生成从3D人体网格中生成逼真的生物阻抗信号,用作数据增强的数字孪生模型。 其次,我们设计了一个两阶段训练策略,并采用解耦的方法,使得在不依赖于标签对齐的合成数据的情况下也能广泛覆盖各种活动类型。 我们在收集的ImpAct数据集和两个公开基准上评估了SImpHAR框架。结果显示,在准确率和宏平均F1分数方面,与现有最佳方法相比分别提高了高达22.3%和21.8%,显示出一致性改进。 我们的结果突显出基于仿真的增强技术和模块化训练对于阻抗基人体活动识别的前景。
https://arxiv.org/abs/2507.06405
Deep brain stimulation (DBS) is an established intervention for Parkinson's disease (PD), but conventional open-loop systems lack adaptability, are energy-inefficient due to continuous stimulation, and provide limited personalization to individual neural dynamics. Adaptive DBS (aDBS) offers a closed-loop alternative, using biomarkers such as beta-band oscillations to dynamically modulate stimulation. While reinforcement learning (RL) holds promise for personalized aDBS control, existing methods suffer from high sample complexity, unstable exploration in binary action spaces, and limited deployability on resource-constrained hardware. We propose SEA-DBS, a sample-efficient actor-critic framework that addresses the core challenges of RL-based adaptive neurostimulation. SEA-DBS integrates a predictive reward model to reduce reliance on real-time feedback and employs Gumbel Softmax-based exploration for stable, differentiable policy updates in binary action spaces. Together, these components improve sample efficiency, exploration robustness, and compatibility with resource-constrained neuromodulatory hardware. We evaluate SEA-DBS on a biologically realistic simulation of Parkinsonian basal ganglia activity, demonstrating faster convergence, stronger suppression of pathological beta-band power, and resilience to post-training FP16 quantization. Our results show that SEA-DBS offers a practical and effective RL-based aDBS framework for real-time, resource-constrained neuromodulation.
深部脑刺激(DBS)是帕金森病(PD)的一种成熟治疗方法,但传统的开环系统缺乏适应性,由于持续刺激而能量效率低,并且对个体神经动力学的个性化程度有限。自适应深部脑刺激(aDBS)提供了一种闭环替代方案,利用β频段振荡等生物标志物动态调节刺激。虽然强化学习(RL)在个性化aDBS控制方面具有前景,但现有方法面临样本复杂度高、二元动作空间中探索不稳定以及资源受限硬件部署能力有限的问题。 我们提出了SEA-DBS,这是一种基于演员评论家框架的高效采样策略,旨在解决基于RL的自适应神经刺激的核心挑战。SEA-DBS整合了一个预测奖励模型以减少对实时反馈的依赖,并采用Gumbel Softmax探索来实现在二元动作空间中稳定且可微分政策更新。这些组成部分共同提高了样本效率、探索稳健性和与资源受限神经调节硬件的兼容性。 我们在模拟帕金森病基底节活动的真实生物环境中评估了SEA-DBS,展示了更快的收敛速度、更强的病理β频段功率抑制能力以及对训练后FP16量化处理的适应力。我们的结果显示,SEA-DBS为实时资源受限神经调节提供了一种实用且有效的基于RL的aDBS框架。 这段文字介绍了SEA-DBS系统及其在帕金森病治疗中的应用潜力,特别是在提高自适应深部脑刺激技术效率和效果方面的贡献。
https://arxiv.org/abs/2507.06326
In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.
近年来,情感计算及其应用已成为一个快速增长的研究领域。尽管取得了显著进展,缺乏情感多模态数据集仍然是开发准确情绪识别系统的主要瓶颈。此外,在激发情绪时使用接触式设备往往会无意中影响情绪体验,减少或改变真正的自发情绪反应。这一局限性突显了从面部和生理信号等非物理接触方式提取情感线索的方法的必要性,例如远程生理情绪识别。为此,我们提出了《通过生理信号进行无接触情感状态数据库》(CAST-Phys),这是一个专门为利用面部和生理信号实现多模态远程生理情绪识别而设计的新颖高质量数据集。该数据集包括多种生理信号,如光体积描记图(PPG)、皮肤电活动(EDA)以及呼吸率(RR),并附有高分辨率的未压缩面部视频记录,从而有可能进行远程信号恢复。我们的分析强调了在仅凭面部表情可能无法提供足够情感信息的实际场景中,生理信号的关键作用。此外,我们通过评估单个和融合模态的影响来展示远程多模态情绪识别的潜力,展示了其在推进无接触情绪识别技术方面的有效性。
https://arxiv.org/abs/2507.06080
Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.
静态工具如患者健康问卷-9(PHQ-9)在筛查抑郁症方面非常有效,但缺乏互动性和适应性。我们开发了一款名为HopeBot的聊天机器人,它使用大型语言模型(LLM),通过检索增强生成和实时澄清功能来执行PHQ-9评估。在一个重复测量的研究中,来自英国和中国的132名成年人完成了自我管理和聊天机器人版本的问卷。得分显示了高度的一致性(ICC = 0.91;45%相同)。在75名提供比较反馈的参与者中,有71%的人表示更信任聊天机器人,并指出其结构更为清晰、解释指导更加明确以及具有支持性的语气。平均评分(从0到10)为:舒适度8.4分、语音清晰度7.7分、处理敏感话题的能力7.6分和推荐有用性7.4分;后者在就业状况和先前的心理健康服务使用情况上存在显著差异(p < 0.05)。总体而言,有87.1%的人表示愿意再次使用或推荐HopeBot。这些结果表明,基于语音的LLM聊天机器人可以作为抑郁症常规筛查中可行、低负担且可扩展的辅助工具。
https://arxiv.org/abs/2507.05984
Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.
准确的信道状态信息(CSI)对于无线通信系统的性能至关重要,特别是在5G和未来6G技术所带来的规模和复杂性日益增加的情况下。虽然人工智能(AI)为CSI获取与利用提供了一个有前景的方法,但现有的方法主要依赖于针对特定任务设计的神经网络(NN),这些网络需要专家驱动的设计,并且需要大量的训练数据集,这限制了它们的通用性和实用性。为了应对这一挑战,我们提出了LVM4CSI框架,这是一个一般化和高效的框架,它利用了CSI与计算机视觉(CV)数据之间的结构相似性,可以直接将预训练的大规模视觉模型(LVMs)应用于无线任务,而无需进行微调,这不同于通常需要微调的基于大规模语言模型的方法。LVM4CSI将CSI任务映射到类似的任务上,并将复数形式的CSI转换为与LVM兼容的视觉格式,并集成了轻量级可训练层以适应特定通信目标所需的特征。我们通过三个代表性案例研究验证了LVM4CSI的有效性,包括信道估计、人体活动识别和用户定位。结果表明,LVM4CSI在性能上达到了或超越了针对特定任务设计的神经网络,在信道估计中提高了超过9.61分贝(dB),并且将定位误差减少了约40%。此外,它显著地减少了可训练参数的数量,并且不需要为特定任务设计新的神经网络结构。 这段文本介绍了LVM4CSI框架及其在无线通信领域应用的研究成果,展示了该方法相比于传统方法的优越性与可行性。
https://arxiv.org/abs/2507.05121
We investigate the use of Long Short-Term Memory (LSTM) and Decomposition-LSTM (DLSTM) networks, combined with an ensemble algorithm, to predict solar flare occurrences using time-series data from the GOES catalog. The dataset spans from 2003 to 2023 and includes 151,071 flare events. Among approximately possible patterns, 7,552 yearly pattern windows are identified, highlighting the challenge of long-term forecasting due to the Sun's complex, self-organized criticality-driven behavior. A sliding window technique is employed to detect temporal quasi-patterns in both irregular and regularized flare time series. Regularization reduces complexity, enhances large flare activity, and captures active days more effectively. To address class imbalance, resampling methods are applied. LSTM and DLSTM models are trained on sequences of peak fluxes and waiting times from irregular time series, while LSTM and DLSTM, integrated with an ensemble approach, are applied to sliding windows of regularized time series with a 3-hour interval. Performance metrics, particularly TSS (0.74), recall (0.95) and the area under the curve (AUC=0.87) in the receiver operating characteristic (ROC), indicate that DLSTM with an ensemble approach on regularized time series outperforms other models, offering more accurate large-flare forecasts with fewer false errors compared to models trained on irregular time series. The superior performance of DLSTM is attributed to its ability to decompose time series into trend and seasonal components, effectively isolating random noise. This study underscores the potential of advanced machine learning techniques for solar flare prediction and highlights the importance of incorporating various solar cycle phases and resampling strategies to enhance forecasting reliability.
我们研究了结合集成算法的长短期记忆网络(LSTM)和分解-LSTM(DLSTM)网络在使用GOES目录的时间序列数据预测太阳耀斑发生情况中的应用。该数据集涵盖了2003年至2023年的时期,包含151,071个耀斑事件。在大约可能的模式中,确定了7,552个年度模式窗口,突显了由于太阳复杂、由自组织临界性驱动的行为导致长期预测面临的挑战。 滑动窗口技术被用于检测不规则和正则化后的耀斑时间序列中的时序准模式。正则化减少了复杂性,并增强了大型耀斑活动的可见度,同时更有效地捕捉活跃日。为解决类别不平衡问题,应用了重采样方法。LSTM 和 DLSTM 模型是在从不规则时间序列中提取的峰值通量和等待时间序列上进行训练的;而与集成方法结合后的 LSTM 和 DLSTM 则被应用于具有3小时间隔窗口的正则化时间序列。 性能指标(特别是TSS(0.74)、召回率(0.95)以及在接收者操作特征(ROC)曲线下的面积(AUC=0.87))表明,基于正则化时间序列并结合集成方法的DLSTM模型优于其他模型。DLSTM结合集成方法可以提供更准确的大耀斑预测,并且与基于不规则时间序列训练的模型相比,其产生的错误较少。 DLSTM卓越性能的原因在于它能够将时间序列分解为趋势和季节性成分,从而有效隔离随机噪声。这项研究表明了高级机器学习技术在太阳耀斑预测中的潜力,并强调了纳入各种太阳活动周期阶段及重采样策略以增强预测可靠性的必要性。
https://arxiv.org/abs/2507.05313
Question-answering (QA) interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC system insights, particularly for non-expert users. However, enabling accurate, real-time, and context-aware interactions with HVAC systems introduces unique challenges, including the integration of frequently updated sensor data, domain-specific knowledge grounding, and coherent multi-stage reasoning. In this paper, we present JARVIS, a two-stage LLM-based QA framework tailored for sensor data-driven HVAC system interaction. JARVIS employs an Expert-LLM to translate high-level user queries into structured execution instructions, and an Agent that performs SQL-based data retrieval, statistical processing, and final response generation. To address HVAC-specific challenges, JARVIS integrates (1) an adaptive context injection strategy for efficient HVAC and deployment-specific information integration, (2) a parameterized SQL builder and executor to improve data access reliability, and (3) a bottom-up planning scheme to ensure consistency across multi-stage response generation. We evaluate JARVIS using real-world data collected from a commercial HVAC system and a ground truth QA dataset curated by HVAC experts to demonstrate its effectiveness in delivering accurate and interpretable responses across diverse queries. Results show that JARVIS consistently outperforms baseline and ablation variants in both automated and user-centered assessments, achieving high response quality and accuracy.
基于大型语言模型(LLM)的问答接口为通过暖通空调系统洞察提升互动性提供了有前景的方向,尤其是在非专家用户中。然而,实现准确、实时且情境感知的与暖通空调系统的交互带来了独特的挑战,包括集成频繁更新的传感器数据、特定领域的知识基础以及连贯的多阶段推理等。本文介绍了JARVIS——一种专门用于基于传感器数据驱动的暖通空调系统互动的两阶段LLM问答框架。 JARVIS采用了一个专家LLM来将高级用户查询转换为结构化的执行指令,以及一个负责进行SQL基数据检索、统计处理和最终响应生成的代理。为了应对与暖通空调相关的特定挑战,JARVIS集成了(1)一种自适应上下文注入策略,以高效整合暖通空调及部署特有的信息;(2)参数化SQL构建器和执行程序,以提高数据访问可靠性;以及(3)一种自底向上规划方案,确保多阶段响应生成的一致性。 我们使用从商用暖通空调系统收集的现实世界数据和由暖通空调专家策划的真实问答数据集评估了JARVIS的有效性,证明其能够提供准确且易于解释的回答。结果显示,在自动化及用户中心评价中,JARVIS在各种查询中的回应质量和准确性方面一直优于基线模型及其消融变体。
https://arxiv.org/abs/2507.04748
Current paradigms in Artificial Intelligence rely on layers of feedforward networks which model brain activity at the neuronal level. We conjecture that expanding to the level of multiple brain regions with chemical signaling may be a productive step toward understanding the emergence of consciousness. We propose LILITH, a novel architecture that combines developmental training of modular language models with brain-inspired token-based communication protocols, mirroring chemical signaling in the brain. Our approach models distinct brain regions as specialized LLM modules including thinking, memory, sensory, and regulatory components that communicate through emergent token-based signaling protocols analogous to neurotransmitter networks. Unlike traditional pre-trained systems, LILITH would employ developmental training where untrained LLM architectures learn through simulated life experiences, developing communication pathways and cognitive abilities through environmental interaction and evolutionary optimization. This framework would enable direct empirical investigation of consciousness emergence using Integrated Information Theory metrics while providing unprecedented insight into inter-module signaling patterns during development. By optimizing for consciousness emergence rather than task performance, LILITH could provide insight into different emergent phenomena at multiple levels of neural correlates, contrasting neuronal-level processing with multi-region coordination dynamics. The goal of this paper is to put the idea forward while recognizing the substantial challenges in implementing such a system.
当前的人工智能范式依赖于前馈网络的层次结构,这些网络模仿了大脑在神经元水平上的活动。我们假设扩展到多个脑区并引入化学信号传递可能是在理解意识产生的机制方面的一个富有成效的步骤。为此,我们提出了LILITH(Life-Inspired Language and Inter-Region Hierarchical Thinking)架构,这是一个结合了模块化语言模型的发展性训练和基于大脑启发式令牌通信协议的新颖设计,模仿了脑内的化学信号传递。 我们的方法将不同的脑区视为具有特定功能的大型语言模型模块:思考、记忆、感觉和调节等,并通过类似于神经递质网络的新兴令牌基础通讯协议进行交流。与传统的预训练系统不同,LILITH使用发展性训练方式,在这种模式下,未经训练的语言模型架构通过模拟生命体验学习,从而在环境互动和进化优化过程中形成沟通路径并提升认知能力。 这一框架将允许直接利用综合信息理论的度量方法来研究意识的产生,并提供前所未有的关于发育过程中的模块间信号传递模式的见解。通过优先考虑意识产生的优化而非任务性能,LILITH可以为不同级别的神经相关现象的不同新兴特性提供洞察,对比单个神经元层级处理与多区域协调动态。 本文旨在提出这一想法的同时也认识到实现这样一个系统的巨大挑战性。
https://arxiv.org/abs/2507.04575