Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
机器人在现实环境中往往难以根据自由形式的人类指令执行任务,这主要是由于计算能力和感知能力的限制。我们提出了一种轻量级、完全在设备上运行的管道系统,可以将自然语言命令转换为可靠的操纵动作。我们的方法包括两个阶段: (i) 指令到动作模块(Instruct2Act):这是一个紧凑型BiLSTM模型结合多头注意力自动编码器,能够解析指令,并将其转化为一系列原子动作(例如,到达、抓取、移动和放置)的有序序列。 (ii) 机器人行动网络(RAN):该阶段使用动态自适应轨迹径向网络(DATRN)与基于视觉环境分析器(YOLOv8)共同工作,为每个子动作生成精确的控制轨迹。 整个系统在没有任何云端服务支持的情况下就可以在一个相对简单的小型设备上运行。在我们定制的专有数据集上,Instruct2Act在保持较小存储需求的同时,实现了91.5%的子动作预测准确率。实机评估显示,在四个任务(取放、倒水、擦拭和递送)中,成功率总体达到了90%,每个子行动推断时间少于3.8秒,而整个操作执行的时间则根据任务复杂度在30到60秒之间。 这些结果表明,细粒度的指令解析与DATRN基础轨迹生成以及视觉引导定位相结合的方法,在资源受限、单摄像头设置中提供了确定性实时操控实现的实际路径。
https://arxiv.org/abs/2602.09940
Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.
计算机视觉(CV)已被用于步态过程中的环境分类,并常常被用来为辅助系统提供控制信息;然而,预测脚如何与变化的环境接触的能力则鲜少被探索。我们在本研究中评估了在平地过渡到上楼梯时,在足部着地之前预测前后方向(AP)的足底压力中心(COP)和触地时刻(TOI)的可能性。八个受试者在其右小腿佩戴RGB-D相机,并在执行上楼任务的同时穿有仪器化的鞋垫内衬。我们训练了一个卷积神经网络-递归神经网络(CNN-RNN),以在步态足部着地前的250毫秒窗口内连续预测COP和TOI,我们将这一时间范围称为预测视界(FH)。150、100和50毫秒FH下的COP平均绝对误差分别为29.42毫米、26.82毫米和23.72毫米。TOI的平均绝对误差在150、100和50毫秒时分别记录为21.14毫秒、20.08毫秒和17.73毫秒。虽然躯干速度对任一任务中的错误没有影响,但在足部着地前较高的趾摆动速度被发现可以提高COP预测的准确性,但对TOI案例的影响不显著。此外,我们还发现了更靠前的足着地点会降低COP的预测准确度,但这不会影响TOI的预测准确度。我们还发现我们的轻量级模型能够在消费级笔记本电脑或边缘计算设备上以每秒60帧的速度运行。这项研究表明,使用轻量级模型从视觉数据中预测COP和TOI是可行的,这可能对辅助系统的预见性控制有重要影响。
https://arxiv.org/abs/2602.09209
Linear recurrent neural networks (LRNNs) provide a structured approach to sequence modeling that bridges classical linear dynamical systems and modern deep learning, offering both expressive power and theoretical guarantees on stability and trainability. In recent years, multiple LRNN-based architectures have been proposed, each introducing distinct parameterizations, discretization schemes, and implementation constraints. However, existing implementations are fragmented across different software frameworks, often rely on framework-specific optimizations, and in some cases require custom CUDA kernels or lack publicly available code altogether. As a result, using, comparing, or extending LRNNs requires substantial implementation effort. To address this, we introduce $\texttt{lrnnx}$, a unified software library that implements several modern LRNN architectures under a common interface. The library exposes multiple levels of control, allowing users to work directly with core components or higher-level model abstractions. $\texttt{lrnnx}$ aims to improve accessibility, reproducibility, and extensibility of LRNN research and applications. We make our code available under a permissive MIT license.
线性递归神经网络(LRNN)提供了一种结构化的方法来建模序列,它连接了经典的线性动态系统和现代深度学习技术,既具有表达能力又能保证稳定性和可训练性的理论保障。近年来,提出了多种基于LRNN的架构,每个架构都引入了不同的参数设置、离散化方案以及实现约束条件。然而,现有的实现方式分散在不同的软件框架中,常常依赖于特定框架的优化,并且有些需要定制CUDA内核或根本没有公开可用的代码。因此,使用、比较或扩展LRNN都需要大量的实现工作。 为解决这一问题,我们引入了$\texttt{lrnnx}$,这是一个统一的软件库,实现了几种现代的LRNN架构,在一个通用接口下进行操作。该库暴露了多个级别的控制,使用户可以直接与核心组件或更高层次的模型抽象进行交互。$\texttt{lrnnx}$旨在提高LRNN研究和应用的可访问性、重现性和扩展性。我们以宽松的MIT许可证提供代码。 --- 这段文本主要介绍了线性递归神经网络(LRNN)及其在现代深度学习中的重要地位,同时指出了当前实现这些模型时面临的一些挑战,并提出了一个名为$\texttt{lrnnx}$的新软件库来解决这些问题,通过统一接口和多层级控制提高了研究的可访问性和重现性。
https://arxiv.org/abs/2602.08810
In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.
在乔姆斯基(Noam Chomsky)的尖锐批评文章《CHATGPT 的虚假承诺》中,大型语言模型(LLMs)被描述为仅仅是模式预测器,它们不像人类那样通过内在因果和自我纠正机制来学习语言。因此,这些模型无法区分不可能存在的语言。这篇评论代表了对人工智能基础理论的一个根本性挑战,因为它综合了LLM方法论中的主要问题,并具有典型的先验理性主义视角。 我们从现有的语言学和心理学文献以及基于实验的研究两方面审视这一著名的批评。该实验研究调查了大型语言模型(LLMs)学习可能与不可能的语言的能力。我们通过应用特定的变换到英语上,构建了一组语法上不可能的语言。这些变换包括反向整个句子,并根据单词数量的奇偶性添加否定。我们在GPT-2小型模型和长短期记忆(LSTM)模型上分别进行了两轮受控实验。统计分析(Welch t检验)显示,在学习所有不可能语言方面,GPT-2小型模型的表现明显不如在可能的语言上的表现(p<.001)。另一方面,LSTM模型的表现支持了乔姆斯基的论点,表明变换架构演进的独特作用。 基于理论分析和实证发现,我们提出了乔姆斯基理论中关于LLMs的新视角,并建议从他的“理性主义浪漫”范式转向功能主义和经验主义在LLMs研究中的新理论框架。
https://arxiv.org/abs/2602.08437
Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.
日志解析是软件系统中的一个关键标准操作程序,它能够实现监控、异常检测和故障诊断。然而,由于异构的日志格式、训练数据与部署数据之间的分布变化以及基于规则方法的脆弱性,自动化日志解析仍然具有挑战性。本研究旨在系统地评估序列建模架构的选择、表示方式选择、序列长度及可用训练数据量如何影响自动化日志解析的表现和计算成本。 我们进行了一项受控实证研究,比较了四种序列建模架构:Transformer、Mamba状态空间模型、单向LSTM(Long Short-Term Memory)以及双向LSTM模型。总计396个模型在不同的数据集配置下进行了训练,并使用相对Levenshtein编辑距离进行评估,且包含统计显著性检验。 结果表明: - Transformer实现了最低的平均相对编辑距离(0.111),其次是Mamba(0.145)、单向LSTM(0.186)和双向LSTM(0.265)。较低值代表更好的表现。 - Mamba在计算成本大幅降低的情况下提供了竞争力的准确性。 - 字符级别的标记化通常改善了性能,而Transformer的精度受序列长度的影响较小。 - 与循环模型相比,Mamba和Transformer展示了更强的数据样本效率。 总体而言,使用Transformer能够将解析错误减少23.4%,而在数据或计算资源受限的情况下,Mamba是一个强有力的选择。这些结果还阐明了表示选择、序列长度及样本效率的作用,并为研究者和实践者提供了实用指导。
https://arxiv.org/abs/2602.07698
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
标准注意力机制无损地存储键值,但通过每个头的凸平均读取它们,这阻碍了通道级别的选择。我们提出了自由能量混合器(FEM):一种基于自由能(对数和指数函数)的读取方法,它为快速先验(例如来自标准注意中的查询/键的优先级)提供了一个由值驱动、每通道的对数线性倾斜。与试图改进和丰富$(q,k)$评分分布的方法不同,FEM将其视为一种先验,并在不改变复杂度的情况下生成一个值感知的后读取,随着可学习逆温度的增加,它从平均操作平滑过渡到每通道选择,同时仍然保持并行性和原有的渐近复杂度(对于softmax为$O(T^2)$;线性变体为$O(T)$)。 我们实现了一个两层门控FEM,它可以与标准注意、线性注意、线性RNN和SSM无缝集成。在匹配的参数预算下,它在自然语言处理、视觉以及时间序列任务上始终优于强大的基线方法。 这段翻译解释了Free Energy Mixer(自由能量混合器)的概念及其相对于传统注意力机制的优势,并描述了它的实现方式及应用场景中的性能表现。
https://arxiv.org/abs/2602.07160
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
在资源有限的可穿戴设备上进行人体活动识别(HAR)需要能够平衡精度和严格内存及计算预算的模型。尽管像TinierHAR(含34K参数)和TinyHAR(含55K参数)这样的轻量级架构实现了强大的准确性,但一旦考虑了操作系统的开销后,这些模型还是超出了资源有限的微控制器的SRAM存储限制。 我们提出了一种名为MicroBi-ConvLSTM的超轻量化卷积递归架构,在两阶段卷积特征提取的基础上(其中时间池化比为4倍),配合一个单向双向LSTM层,平均参数量仅为11.4K。这相对于TinierHAR减少了2.9倍参数,并且相比DeepConvLSTM减少达11.9倍的参数数量,同时保持线性O(N)复杂度。 在八个不同的人体活动识别基准测试中,MicroBi-ConvLSTM展示了其在超轻量级模型下的竞争力:UCI-HAR基准上达到93.41%的宏F1分数,在SKODA装配手势检测上为94.46%,而Daphnet步态冻结检测则达到了88.98%。系统性消融实验揭示了任务依赖性的组件贡献,其中双向性在事件检测中表现出了明显的优势,但在周期性运动检测中的边际收益有限。 INT8后训练量化仅导致平均F1分数下降0.21%,但可以将模型的部署大小降至约23.0KB,这对于内存受限的边缘设备来说是理想的。
https://arxiv.org/abs/2602.06523
Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.
口恶声症(Misophonia)是一种障碍,其特点是对于特定的日常声音(触发声音)容忍度降低,这些声音能够引发强烈的负面情绪反应如愤怒、恐慌或焦虑。这些反应会严重影响日常生活和生活质量。选择性检测触发声音的辅助技术可以帮助减少不适感并提高幸福感。在这项研究中,我们探讨了声学事件检测(SED),以在连续环境音频中定位触发声音的时间段,作为提供此类支持的基础步骤。 由于现实世界中的口恶声症数据稀缺,我们使用音频合成技术生成针对口恶声症触发声音检测的合成音景。然后,我们使用基于混合卷积神经网络(CNN)的模型执行触发声音检测任务。这些模型结合了使用冻结预训练的CNN主干进行特征提取与可训练的时间序列模块(如门控循环单元GRUs、长短时记忆LSTMs以及回声状态网络ESNs及其双向变体)。 检测性能通过常见的SED度量评估,包括复音声音检测得分1(PSDS1)。在多类触发声学事件检测任务中,双向时间建模一致提高了检测性能。其中,双向GRU(BiGRU)实现了最佳整体准确性。值得注意的是,优化仅读出部分的双向ESN(BiESN)获得了竞争性表现,同时需要数个量级更少的可训练参数。 我们进一步通过一个最多五段支持片段的小样本“进食声音”检测任务模拟用户个性化设置,在这种严格的适应环境设定下,BiGRU和BiESN进行比较。在此环境下,BiESN展示了稳健且稳定的性能,表明轻量级时间模块对于个性化的口恶声症触发声学事件检测具有前景。
https://arxiv.org/abs/2602.06271
The increasing penetration of photovoltaic (PV) generation introduces significant uncertainty into power system operation, necessitating forecasting approaches that extend beyond deterministic point predictions. This paper proposes an any-quantile probabilistic forecasting framework for multi-regional PV power generation based on the Any-Quantile Recurrent Neural Network (AQ-RNN). The model integrates an any-quantile forecasting paradigm with a dual-track recurrent architecture that jointly processes series-specific and cross-regional contextual information, supported by dilated recurrent cells, patch-based temporal modeling, and a dynamic ensemble mechanism. The proposed framework enables the estimation of calibrated conditional quantiles at arbitrary probability levels within a single trained model and effectively exploits spatial dependencies to enhance robustness at the system level. The approach is evaluated using 30 years of hourly PV generation data from 259 European regions and compared against established statistical and neural probabilistic baselines. The results demonstrate consistent improvements in forecast accuracy, calibration, and prediction interval quality, underscoring the suitability of the proposed method for uncertainty-aware energy management and operational decision-making in renewable-dominated power systems.
光伏(PV)发电渗透率的增加给电力系统操作带来了显著的不确定性,因此需要超越确定性点预测的方法来进行预报。本文提出了一种基于任意分位数递归神经网络(AQ-RNN)的多区域光伏发电概率预报框架。该模型结合了任意分位数预测方法和双轨递归架构,同时处理特定序列信息及跨地区背景信息,并通过膨胀递归细胞、基于块的时间建模以及动态集成机制来支持这一过程。所提出的框架能够在单一训练模型中估计任意概率水平下的校准条件分位数,并有效利用空间依赖性以提高系统的整体稳健性。 该方法使用来自欧洲259个区域的30年每小时光伏发电数据进行了评估,并与现有的统计和神经网络概率基线方法进行了比较。结果显示,在预报准确性、校准以及预测区间质量方面均有持续改进,这表明所提出的方法对于不确定性感知能源管理和可再生能源主导电力系统的运营决策具有适用性。
https://arxiv.org/abs/2602.05660
Electricity price forecasting (EPF) is essential for energy markets stakeholders (e.g. grid operators, energy traders, policymakers) but remains challenging due to the inherent volatility and nonlinearity of price signals. Traditional statistical and deep learning (DL) models often struggle to capture complex temporal dependencies and integrate heterogeneous data effectively. While time series foundation models (TSFMs) have shown strong performance in general time series forecasting tasks, such as traffic forecasting and weather forecasting. However, their effectiveness in day-ahead EPF, particularly in volatile markets, remains underexplored. This paper presents a spike regularization strategy and evaluates a wide range of TSFMs, including Tiny Time Mixers (TTMs), MOIRAI, MOMENT, and TimesFM, against traditional statistical and DL models such as Autoregressive Integrated Moving Average (ARIMA), Long-short Term Memory (LSTM), and Convolutional Neural Network - LSTM (CNN-LSTM) using half-hourly wholesale market data with volatile trends in Singapore. Exogenous factors (e.g. weather and calendar variables) are also incorporated into models where applicable. Results demonstrate that TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in MAPE across various evaluation settings. The findings offer practical guidance for improving forecast accuracy and decision-making in volatile electricity markets.
电力价格预测(EPF)对于能源市场参与者(如电网运营商、能源交易员和政策制定者)至关重要,但鉴于电价信号的固有波动性和非线性特性,这一任务仍然充满挑战。传统的统计模型和深度学习(DL)模型往往难以捕捉复杂的时序依赖关系,并且在有效整合异构数据方面表现不佳。虽然时间序列基础模型(TSFMs)在一般的时间序列预测任务中表现出色,例如交通流量预测和天气预报,但在具有波动性的市场中的日前电力价格预测的有效性仍需进一步探索。 本文提出了一种尖峰正则化策略,并评估了一系列时间序列基础模型(如Tiny Time Mixers (TTMs)、MOIRAI、MOMENT 和 TimesFM),同时对比了传统的统计和深度学习模型(例如自回归移动平均模型(ARIMA)、长短期记忆网络(LSTM) 以及卷积神经网络-长短时记忆网络(CNN-LSTM))。评估基于新加坡批发市场每半小时的具有波动性趋势的数据。当适用时,外部因素(如天气和日历变量)也被纳入到相应的模型中。 结果表明,时间序列基础模型在各种评估设置下始终优于传统方法,其中最大改进幅度达到了37.4%的平均绝对百分比误差(MAPE)降低。这些发现为提高波动性电力市场的预测准确性和决策制定提供了实用指导。
https://arxiv.org/abs/2602.05430
Time-series foundation models have emerged as a new paradigm for forecasting, yet their ability to effectively leverage exogenous features -- critical for electricity demand forecasting -- remains unclear. This paper empirically evaluates foundation models capable of modeling cross-channel correlations against a baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities. We systematically assess MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos-2 under three feature configurations: all features, selected features, and target-only. Our findings reveal highly variable effectiveness: while Chronos-2 achieves the best performance among foundation models (in zero-shot settings), the simple baseline frequently outperforms all foundation models in Singapore's stable climate, particularly for short-term horizons. Model architecture proves critical, with synergistic architectural implementations (TTM's channel-mixing, Chronos-2's grouped attention) consistently leveraging exogenous features, while other approaches show inconsistent benefits. Geographic context emerges as equally important, with foundation models demonstrating advantages primarily in variable climates. These results challenge assumptions about universal foundation model superiority and highlight the need for domain-specific models, specifically in the energy domain.
时间序列基础模型作为预测的新范式已经出现,但它们在有效利用外生特征(这对电力需求预测至关重要)方面的能力仍然不明朗。本文通过实证研究,在新加坡和澳大利亚的电力市场中,对能够建模跨通道相关性的基础模型与具有可逆实例归一化的基线LSTM模型进行了比较评估,涵盖了小时级和日级别粒度的数据。我们系统地评估了MOIRAI、MOMENT、TinyTimeMixers(TTM)、ChronosX和Chronos-2这五种基础模型,在三种特征配置下:所有特征、选定特征以及仅目标变量。 研究发现这些模型的有效性差异显著:在零样本设置中,Chronos-2在所有基础模型中的表现最佳;然而,在新加坡稳定气候条件下,简单的基线模型通常能够超越所有其他基础模型,尤其是在短期预测时。此外,模型架构至关重要,具有协同作用的架构实现(如TTM的通道混合、Chronos-2的分组注意力)一致地利用外生特征,而其他方法则表现出不稳定的收益。 地理背景的重要性也得到了证实:在气候变化显著的情况下,基础模型显示出更大的优势。这些结果挑战了关于基础模型普遍优越性的假设,并强调了领域特定模型(特别是在能源领域中)的需求。
https://arxiv.org/abs/2602.05390
MOOC recommendation systems have received increasing attention to help learners navigate and select preferred learning content. Traditional methods such as collaborative filtering and content-based filtering suffer from data sparsity and over-specialization. To alleviate these limitations, graph-based approaches have been proposed; however, they still rely heavily on manually predefined metapaths, which often capture only superficial structural relationships and impose substantial burdens on domain experts as well as significant engineering costs. To overcome these limitations, we propose AMR (Aspect-aware MOOC Recommendation), a novel framework that models path-specific multiple aspects by embedding the semantic content of nodes within each metapath. AMR automatically discovers metapaths through bi-directional walks, derives aspect-aware path representations using a bi-LSTM-based encoder, and incorporates these representations as edge features in the learner-learner and KC-KC subgraphs to achieve fine-grained semantically informed KC recommendations. Extensive experiments on the large-scale MOOCCube and PEEK datasets show that AMR consistently outperforms state-of-the-art graph neural network baselines across key metrics such as HR@K and nDCG@K. Further analysis confirms that AMR effectively captures rich path-specific aspect information, allowing more accurate recommendations than those methods that rely solely on predefined metapaths. The code will be available upon accepted.
MOOC推荐系统越来越受到关注,旨在帮助学习者导航并选择他们偏好的学习内容。传统的方法如协同过滤和基于内容的过滤会因为数据稀疏性和过度专业化而存在局限性。为了缓解这些问题,图基方法已经被提出;然而,这些方法仍然严重依赖于手动预定义的元路径(metapaths),这往往只能捕捉到表层结构关系,并且给领域专家带来了沉重负担以及显著的技术成本。 为了解决这些限制,我们提出了AMR(Aspect-aware MOOC Recommendation)这一新颖框架。该框架通过在每个元路径中嵌入节点的语义内容来建模特定于路径的多个方面。AMR能够自动发现元路径,通过双向行走的方式,并使用基于双层长短期记忆网络(bi-LSTM-based encoder)的方法生成考虑了各方面信息的路径表示。此外,这些表示被用作学习者-学习者以及知识点-知识点子图中的边特征,从而实现细粒度且语义丰富的知识点推荐。 在大规模MOOCCube和PEEK数据集上的广泛实验显示,AMR在诸如HR@K和nDCG@K等关键指标上始终优于最先进的基于图神经网络的方法。进一步分析证实了AMR能够有效地捕捉丰富、特定于路径的方面信息,从而比那些仅仅依赖于预定义元路径的方法做出更准确的推荐。 研究团队将在论文被接受后开放代码供下载使用。
https://arxiv.org/abs/2602.05297
Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine-scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short-Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio-Temporal Transformer (ST-Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009-2020 Our training data consists of 2009-2020 hindcast data from a coupled hydrodynamic-biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature-dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST-Transformer achieving the highest performance across all metrics and tests periods (AUC-ROC: 0.982-0.992). We also employed McNemar's method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real-time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available this https URL
海岸低氧问题,尤其是在墨西哥湾北部地区,是一个持续的生态和经济关注点。季节性模型只能提供粗略预测,无法捕捉到日常生态系统管理所需的细尺度变化。我们提出了一项研究,比较了四种深度学习架构在每日低氧分类上的表现:双向长短期记忆网络(BiLSTM)、医疗变换器(Medformer)、时空变换器(ST-Transformer)和时间卷积网络(TCN)。我们使用2009年至2020年的十二年逐日回溯数据对模型进行了训练。我们的训练数据来自于一个耦合的水动力-生物地球化学模型,同样地,我们将2020年至2024年的回溯数据作为测试数据。 我们构建了分类模型,包括水柱分层、沉积物耗氧和温度依赖性分解速率等因素。我们在相同的预处理方法、输入/输出格式以及验证协议下评估了每种架构的表现。所有模型均实现了高分类准确率和强鉴别能力,其中ST-Transformer在所有指标和测试期内表现最佳(AUC-ROC: 0.982-0.992)。此外,我们还采用了麦克内马尔检验方法来识别模型预测中的统计显著性差异。 我们的贡献在于提供了一个可重复的框架,用于操作实时低氧预测,并支持环境及海洋建模系统社区以及生态系统恢复力方面的更广泛努力。源代码可在以下链接获取:[https://this-url.com](https://this-url.com)
https://arxiv.org/abs/2602.05178
Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.
混合专家(Mixture-of-Experts, MoE)模型通过稀疏激活专家机制实现高效运行,这意味着在每次推理过程中仅使用了部分参数。然而,为了将这种稀疏性转化为实际性能提升,需要有效的专家缓存策略来支持。以往的研究工作提出了基于硬件的缓存政策,但这些不同缓存策略如何相互作用以及它们与各种硬件规格的关系仍不清楚。为了解决这一知识空白,我们开发了**SpecMD**,这是一个标准化框架,用于在不同的硬件配置上对自定义缓存策略进行基准测试。 利用SpecMD,我们在具有现实约束条件的控制环境中对几种MoE缓存策略进行了详尽的基准测试,并复制和扩展了先前的方法。我们的实验结果表明,MoE专家访问与时间局部性假设(如LRU、LFU)不符。基于这一观察,我们提出了**Least-Stale**,这是一种新的驱逐政策,利用MoE可预测的专家访问模式来减少冲突丢失,相比LRU最多可以提高85倍。 通过这种改进,我们在使用仅占5%或0.6GB VRAM缓存容量的情况下,在OLMoE模型上实现了超过88%的命中率,并且时间到第一个标记(Time-to-first-token, TTFT)减少了34.7%。
https://arxiv.org/abs/2602.03921
Lane detection is a crucial perception task for all levels of automated vehicles (AVs) and Advanced Driver Assistance Systems, particularly in mixed-traffic environments where AVs must interact with human-driven vehicles (HDVs) and challenging traffic scenarios. Current methods lack versatility in delivering accurate, robust, and real-time compatible lane detection, especially vision-based methods often neglect critical regions of the image and their spatial-temporal (ST) salience, leading to poor performance in difficult circumstances such as serious occlusion and dazzle lighting. This study introduces a novel sequential neural network model with a spatial-temporal attention mechanism to focus on key features of lane lines and exploit salient ST correlations among continuous image frames. The proposed model, built on a standard encoder-decoder structure and common neural network backbones, is trained and evaluated on three large-scale open-source datasets. Extensive experiments demonstrate the strength and robustness of the proposed model, outperforming state-of-the-art methods in various testing scenarios. Furthermore, with the ST attention mechanism, the developed sequential neural network models exhibit fewer parameters and reduced Multiply-Accumulate Operations (MACs) compared to baseline sequential models, highlighting their computational efficiency. Relevant data, code, and models are released at this https URL.
车道检测是所有级别自动驾驶汽车(AV)和高级驾驶员辅助系统的一项关键感知任务,尤其是在混合交通环境中,自动驾驶汽车必须与人力驾驶的车辆(HDV)互动,并处理复杂的交通场景。目前的方法在提供准确、稳健且实时兼容的车道检测方面缺乏灵活性,特别是基于视觉的方法常常忽视图像的关键区域及其空间-时间(ST)显著性,导致在严重遮挡和眩光照明等困难情况下表现不佳。 本研究提出了一种具有空间-时间注意力机制的新颖序列神经网络模型,专注于车道线的关键特征,并利用连续图像帧中的显著ST相关性。所提出的模型基于标准的编码器-解码器结构和通用的神经网络骨干,在三个大规模开源数据集上进行了训练和评估。广泛的实验展示了该模型的强大性和鲁棒性,在各种测试场景中超越了现有的最先进方法。此外,凭借ST注意力机制,开发出的序列神经网络模型相比基准序列模型拥有更少的参数和减少的乘积累加操作(MACs),突显了其计算效率。 相关数据、代码及模型可在[该链接](https://example.com)获取。
https://arxiv.org/abs/2602.03669
Manual endoscopic submucosal dissection (ESD) is technically demanding, and existing single-segment robotic tools offer limited dexterity. These limitations motivate the development of more advanced solutions. To address this, DESectBot, a novel dual segment continuum robot with a decoupled structure and integrated surgical forceps, enabling 6 degrees of freedom (DoFs) tip dexterity for improved lesion targeting in ESD, was developed in this work. Deep learning controllers based on gated recurrent units (GRUs) for simultaneous tip position and orientation control, effectively handling the nonlinear coupling between continuum segments, were proposed. The GRU controller was benchmarked against Jacobian based inverse kinematics, model predictive control (MPC), a feedforward neural network (FNN), and a long short-term memory (LSTM) network. In nested-rectangle and Lissajous trajectory tracking tasks, the GRU achieved the lowest position/orientation RMSEs: 1.11 mm/ 4.62° and 0.81 mm/ 2.59°, respectively. For orientation control at a fixed position (four target poses), the GRU attained a mean RMSE of 0.14 mm and 0.72°, outperforming all alternatives. In a peg transfer task, the GRU achieved a 100% success rate (120 success/120 attempts) with an average transfer time of 11.8s, the STD significantly outperforms novice-controlled systems. Additionally, an ex vivo ESD demonstration grasping, elevating, and resecting tissue as the scalpel completed the cut confirmed that DESectBot provides sufficient stiffness to divide thick gastric mucosa and an operative workspace adequate for large this http URL results confirm that GRU-based control significantly enhances precision, reliability, and usability in ESD surgical training scenarios.
手动内镜下粘膜剥离术(ESD)在技术上具有挑战性,现有的单节段机器人工具灵活性有限。这些限制促使开发更先进的解决方案。为此,在这项工作中开发了DESectBot——一种新型双节段连续体机器人,它配备了解耦结构和集成手术夹钳,实现了6个自由度(DoFs)的末端灵巧性,从而提高了ESD中病灶定位的能力。提出了一种基于门控循环单元(GRU)的深度学习控制器来同时控制末端的位置和方向,并有效地处理连续体节段之间的非线性耦合问题。在嵌套矩形轨迹跟踪任务和李萨茹轨迹跟踪任务中,GRU分别实现了最低的位置/方向均方根误差:1.11毫米/4.62度和0.81毫米/2.59度。 对于固定位置的定向控制(四个目标姿态),GRU达到了平均RMSE 0.14毫米和0.72度,优于所有其他方法。在钉子传输任务中,GRU实现了100%的成功率(120次成功尝试/120次总尝试)以及11.8秒的平均传输时间,标准差显著优于新手操控系统。 此外,在离体ESD演示中,DESectBot展示了夹取、提升和切除组织的能力,当手术刀完成切割时,这证实了DESectBot提供了足够的刚度来分割厚胃黏膜,并且操作空间足够大以适应此任务。这些结果确认基于GRU的控制在ESD手术培训场景中显著提高了精度、可靠性和易用性。 总的来说,这项工作展示了使用深度学习技术提高机器人辅助内镜下粘膜剥离术(ESD)效率和精确性的潜力,为未来的临床应用提供了一种有前景的方法。
https://arxiv.org/abs/2602.03406
Deploying pretrained policies in real-world applications presents substantial challenges that fundamentally limit the practical applicability of learning-based control systems. When autonomous systems encounter environmental changes in system dynamics, sensor drift, or task objectives, fixed policies rapidly degrade in performance. We show that employing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible algorithm for online adaptation, can effectively fine-tune a pretrained policy to improve autonomous agents' performance on driving tasks. We further show that RTRRL synergizes with a recent biologically inspired recurrent network model, the Liquid-Resistance Liquid-Capacitance RNN. We demonstrate the effectiveness of this closed-loop approach in a simulated CarRacing environment and in a real-world line-following task with a RoboRacer car equipped with an event camera.
https://arxiv.org/abs/2602.02236
Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% $\pm$ 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% $\pm$ 2.98% to 96.9% $\pm$ 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.
https://arxiv.org/abs/2602.01855
Long-term satellite image time series (SITS) analysis in heterogeneous landscapes faces significant challenges, particularly in Mediterranean regions where complex spatial patterns, seasonal variations, and multi-decade environmental changes interact across different scales. This paper presents the Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF ), an extended framework that advances beyond purely temporal analysis to integrate spatial context modeling with temporal sequence prediction. STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture, capturing both local neighborhood relationships and regional climate influences. The framework employs comprehensive self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies, enabling robust model training from 40 years of unlabeled Landsat imagery. Unlike autoregressive approaches, STT-LTF directly predicts arbitrary future time points without error accumulation, incorporating spatial patch embeddings, cyclical temporal encoding, and geographic coordinates to learn complex dependencies across heterogeneous Mediterranean ecosystems. Experimental evaluation on Landsat data (1984-2024) demonstrates that STT-LTF achieves a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers. The framework's ability to handle irregular temporal sampling and variable prediction horizons makes it particularly suitable for analysis of heterogeneous landscapes experiencing rapid ecological transitions.
https://arxiv.org/abs/2602.01799
Agroecosystem, which heavily influenced by human actions and accounts for a quarter of global greenhouse gas emissions (GHGs), plays a crucial role in mitigating global climate change and securing environmental sustainability. However, we can't manage what we can't measure. Accurately quantifying the pools and fluxes in the carbon, nutrient, and water nexus of the agroecosystem is therefore essential for understanding the underlying drivers of GHG and developing effective mitigation strategies. Conventional approaches like soil sampling, process-based models, and black-box machine learning models are facing challenges such as data sparsity, high spatiotemporal heterogeneity, and complex subsurface biogeochemical and physical processes. Developing new trustworthy approaches such as AI-empowered models, will require the AI-ready benchmark dataset and outlined protocols, which unfortunately do not exist. In this work, we introduce a first-of-its-kind spatial-temporal agroecosystem GHG benchmark dataset that integrates physics-based model simulations from Ecosys and DayCent with real-world observations from eddy covariance flux towers and controlled-environment facilities. We evaluate the performance of various sequential deep learning models on carbon and nitrogen flux prediction, including LSTM-based models, temporal CNN-based model, and Transformer-based models. Furthermore, we explored transfer learning to leverage simulated data to improve the generalization of deep learning models on real-world observations. Our benchmark dataset and evaluation framework contribute to the development of more accurate and scalable AI-driven agroecosystem models, advancing our understanding of ecosystem-climate interactions.
https://arxiv.org/abs/2602.01614