The prediction of foreign exchange rates, such as the US Dollar (USD) to Bangladeshi Taka (BDT), plays a pivotal role in global financial markets, influencing trade, investments, and economic stability. This study leverages historical USD/BDT exchange rate data from 2018 to 2023, sourced from Yahoo Finance, to develop advanced machine learning models for accurate forecasting. A Long Short-Term Memory (LSTM) neural network is employed, achieving an exceptional accuracy of 99.449%, a Root Mean Square Error (RMSE) of 0.9858, and a test loss of 0.8523, significantly outperforming traditional methods like ARIMA (RMSE 1.342). Additionally, a Gradient Boosting Classifier (GBC) is applied for directional prediction, with backtesting on a $10,000 initial capital revealing a 40.82% profitable trade rate, though resulting in a net loss of $20,653.25 over 49 trades. The study analyzes historical trends, showing a decline in BDT/USD rates from 0.012 to 0.009, and incorporates normalized daily returns to capture volatility. These findings highlight the potential of deep learning in forex forecasting, offering traders and policymakers robust tools to mitigate risks. Future work could integrate sentiment analysis and real-time economic indicators to further enhance model adaptability in volatile markets.
外汇汇率预测,如美元(USD)对孟加拉塔卡(BDT)的汇率,在全球金融市场中扮演着至关重要的角色,影响贸易、投资和经济稳定性。本研究利用来自雅虎财经2018年至2023年的历史USD/BDT汇率数据,开发先进的机器学习模型来进行准确预测。所采用的是长短期记忆(LSTM)神经网络模型,在精度方面达到了99.449%,均方根误差(RMSE)为0.9858和测试损失为0.8523,显著优于传统的ARIMA方法(RMSE 1.342)。此外,还应用了梯度提升分类器(GBC)进行方向性预测,在初始资本为10,000美元的情况下进行了回测,结果显示有40.82%的盈利交易率,但经过49笔交易后净亏损达到20,653.25美元。研究分析了历史趋势,并发现BDT/USD汇率从0.012降至0.009,同时利用归一化的每日回报来捕捉波动性。这些发现突显了深度学习在外汇预测中的潜力,为交易者和政策制定者提供了强有力的工具以降低风险。未来的工作可以整合情绪分析和实时经济指标,进一步增强模型在动荡市场环境下的适应能力。
https://arxiv.org/abs/2506.09851
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at this https URL.
动作分割是高级视频理解中的核心挑战,旨在将未裁剪的视频划分成若干片段,并为每个片段分配一个来自预定义的动作集的标签。现有方法主要解决单人活动且具有固定动作序列的问题,忽视了多人场景的情况。在这项工作中,我们首次提出在多个人设置中使用文本参考引导的人体行为分割,其中一段文字描述指定了要进行分割的目标人物。我们推出了首个用于引用人体动作分割(RHAS133)的数据集,该数据集基于133部电影,并注释了包含137个精细粒度的动作以及33小时的视频数据,并附有针对这项新任务的文字描述。使用VLM(视觉语言模型)基特征提取器在RHAS133上对现有动作识别方法进行基准测试,显示出了有限的表现和不良的目标人物视觉线索聚合能力。 为了应对这一挑战,我们提出了一种全局-局部感知的傅里叶条件扩散框架——Hopadiff,该框架利用一种新颖的交叉输入门注意力xLSTM来增强全局-局部长距离推理,并引入了新的傅立叶条件以提供更多的细粒度控制从而提高动作分割生成的质量。Hopadiff在RHAS133的各种评估设置中均达到了最先进的性能水平。 相关代码可在以下链接获得:[请在此处插入具体的网址]
https://arxiv.org/abs/2506.09650
Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. These features are then fused and processed by a CNN-BLSTM with attention, followed by multi-task learning to simultaneously predict PESQ and STOI. Evaluations on the LRS3-TED dataset, augmented with noise from the DEMAND corpus, show that our model outperforms the audio-only baseline. Under seen noise conditions, it improves LCC by 9.61% (0.8397->0.9205) for PESQ and 11.47% (0.7403->0.8253) for STOI. These results highlight the effectiveness of incorporating visual cues in enhancing the accuracy of non-intrusive speech assessment.
当没有干净的参考信号时,非侵入性地评估语音质量和可懂度是至关重要的。在这项工作中,我们提出了一种多模态框架,该框架结合了音频特征和视觉线索来预测PESQ(Perceptual Evaluation of Speech Quality)和STOI(Speech Transmission Index)评分。此框架采用双分支架构,在该架构中,使用短时傅里叶变换(STFT)提取频谱特征,并通过视觉编码器获取视觉嵌入。然后将这些特征融合并由卷积神经网络-双向长短期记忆网络(CNN-BLSTM)与注意力机制处理,最后进行多任务学习以同时预测PESQ和STOI评分。 在使用DEMAND语料库中的噪声扩增的LRS3-TED数据集上的评估表明,我们的模型优于仅基于音频的方法。在已见噪声条件下,该模型将PESQ的相关系数(LCC)提高了9.61%(从0.8397提升到0.9205),STOI提升了11.47%(从0.7403提高至0.8253)。这些结果表明,结合视觉线索可以显著增强非侵入性语音评估的准确性。
https://arxiv.org/abs/2506.09549
The extent to which neural networks are able to acquire and represent symbolic rules remains a key topic of research and debate. Much current work focuses on the impressive capabilities of large language models, as well as their often ill-understood failures on a wide range of reasoning tasks. In this paper, in contrast, we investigate the generalization behavior of three key neural architectures (Transformers, Graph Convolution Networks and LSTMs) in a controlled task rooted in propositional logic. The task requires models to generate satisfying assignments for logical formulas, making it a structured and interpretable setting for studying compositionality. We introduce a balanced extension of an existing dataset to eliminate superficial patterns and enable testing on unseen operator combinations. Using this dataset, we evaluate the ability of the three architectures to generalize beyond the training distribution. While all models perform well in-distribution, we find that generalization to unseen patterns, particularly those involving negation, remains a significant challenge. Transformers fail to apply negation compositionally, unless structural biases are introduced. Our findings highlight persistent limitations in the ability of standard architectures to learn systematic representations of logical operators, suggesting the need for stronger inductive biases to support robust rule-based reasoning.
神经网络能够获取和表示符号规则的程度仍然是研究和辩论的关键话题。目前许多工作都集中在大型语言模型令人印象深刻的性能上,以及它们在广泛推理任务中常被误解的失败原因上。然而,在本文中,我们通过一个基于命题逻辑的受控任务来探讨三种关键神经架构(Transformer、图卷积网络和LSTM)的一般化行为。该任务要求模型为逻辑公式生成满足赋值,这使得它成为研究组合性的结构化且可解释的环境。我们引入了一个现有数据集的平衡扩展,以消除浅层模式并使测试在未见过的操作符组合上进行。利用这个数据集,我们评估了三种架构超越训练分布的一般化能力。尽管所有模型在训练数据内表现良好,但我们发现向未知模式(特别是涉及否定的情况)推广仍然是一项重大挑战。除非引入结构偏差,否则Transformer无法合乎逻辑地应用否定操作符。我们的研究结果突显了标准架构学习系统性表示逻辑运算符的持久局限性,并表明需要更强的归纳偏置来支持基于规则的强大推理能力。
https://arxiv.org/abs/2506.08978
The increasing complexity of AI models requires flexible hardware capable of supporting diverse precision formats, particularly for energy-constrained edge platforms. This work presents PARV-CE, a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations using a unified data-path for 4/8/16-bit fixed-point, floating point, and posit formats. The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity, optimizing both performance and energy usage. PARV-CE integrates quantization-aware execution with a reconfigurable SIMD pipeline, enabling high-throughput processing with minimal overhead through hardware-software co-design. The results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs, while retaining accuracy within 1.8% FP32 baseline. The architecture supports both on-device training and inference across a range of workloads, including DNNs, RNNs, RL, and Transformer models. The empirical analysis establish PARVCE incorporated POLARON as a scalable and energy-efficient solution for precision-adaptive AI acceleration at edge.
人工智能模型复杂性的增加要求硬件具备支持多种精度格式的灵活性,特别是在能源受限的边缘平台上。本文介绍了PARV-CE,这是一种SIMD(单指令多数据)启用、多精度MAC(乘积累加)引擎,能够在统一的数据路径上高效地执行4/8/16位定点数、浮点数和posit格式的乘积累加操作。该架构采用了层适应性精度策略,以使计算准确性与工作负载敏感度相匹配,从而优化性能和能耗。 PARV-CE通过量化感知执行以及可重构SIMD流水线集成了硬件和软件协同设计,实现了高吞吐量处理,并且最小化了开销。实验结果表明,与现有最佳设计相比,PARV-CE在PDP(每周期操作数)方面提高了高达2倍,在资源使用上减少了高达3倍,同时保持了与FP32基准线误差不超过1.8%的精度。 该架构支持设备上的训练和推理,并且能够处理各种工作负载,包括DNNs(深度神经网络)、RNNs(递归神经网络)、RL(强化学习)以及Transformer模型。实证分析表明,结合使用POLARON后,PARV-CE成为了一种可扩展的、节能的人工智能精度自适应加速解决方案,适用于边缘计算环境。
https://arxiv.org/abs/2506.08785
Memory and long-range temporal processing are core requirements for sequence modeling tasks across natural language processing, time-series forecasting, speech recognition, and control. While nonlinear recurrence has long been viewed as essential for enabling such mechanisms, recent work suggests that linear dynamics may often suffice. In this study, we go beyond performance comparisons to systematically dissect the functional role of nonlinearity in recurrent networks--identifying both when it is computationally necessary, and what mechanisms it enables. We use Almost Linear Recurrent Neural Networks (AL-RNNs), which allow fine-grained control over nonlinearity, as both a flexible modeling tool and a probe into the internal mechanisms of memory. Across a range of classic sequence modeling tasks and a real-world stimulus selection task, we find that minimal nonlinearity is not only sufficient but often optimal, yielding models that are simpler, more robust, and more interpretable than their fully nonlinear or linear counterparts. Our results provide a principled framework for selectively introducing nonlinearity, bridging dynamical systems theory with the functional demands of long-range memory and structured computation in recurrent neural networks, with implications for both artificial and biological neural systems.
记忆和长程时间处理是自然语言处理、时间序列预测、语音识别以及控制等领域中序列建模任务的核心要求。虽然非线性递归长期以来被视为实现此类机制的关键,但最近的研究表明,线性动力学往往已经足够。在本研究中,我们超越了性能比较的范围,系统地剖析了循环网络中非线性的功能角色——即识别出其何时是计算上必要的,并且它实现了哪些机制。我们使用几乎线性递归神经网络(AL-RNN),这些网络允许对非线性进行精细控制,既作为灵活建模工具又作为一种探究记忆内部机理的方法。在一系列经典的序列建模任务和现实世界的刺激选择任务中,我们发现极少量的非线性不仅足够而且常常是最优的选择,这产生了比完全非线性或纯线性模型更简单、更稳健且更具解释性的模型。我们的研究结果为有原则地引入非线性提供了一个框架,并将动力系统理论与循环神经网络中的长程记忆和结构化计算的功能需求联系起来,这对人工和生物神经系统都有重要影响。
https://arxiv.org/abs/2506.07919
Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at this https URL
事件相机解锁了传统帧基摄像头无法实现的新领域。一个典型的例子是低延迟运动估计(光学流),这对许多实时应用来说至关重要。在这些应用场景中,算法的计算效率尤为重要。尽管最近的深度学习范式如CNN、RNN或ViT展现了卓越的表现力,但它们往往缺乏所需的计算效率。相反,异步事件基方法包括SNN和GNN虽然计算效率高,但却无法捕捉足够的时空信息,这是实现更好的光学流估计性能的关键特征之一。 在这项工作中,我们引入了时空状态空间模型(STSSM)模块,并开发了一种新的网络架构,以构建一个极其高效的解决方案并具有竞争性的性能。我们的STSSM模块利用状态空间模型有效地捕获事件数据中的时空相关性,在相似设置下与ViT和基于CNN的架构相比,实现了更高性能的同时降低了复杂度。我们模型在DSEC基准测试中达到比TMA快4.5倍的推理速度,并且计算量减少了8倍;相较于EV-FlowNet,计算量减少了一半,同时保持了竞争性的性能。 我们的代码将在以下网址提供:[此链接应由原作者补充]
https://arxiv.org/abs/2506.07878
Type 1 Diabetes (T1D) affects millions worldwide, requiring continuous monitoring to prevent severe hypo- and hyperglycemic events. While continuous glucose monitoring has improved blood glucose management, deploying predictive models on wearable devices remains challenging due to computational and memory constraints. To address this, we propose a novel Lightweight Sequential Transformer model designed for blood glucose prediction in T1D. By integrating the strengths of Transformers' attention mechanisms and the sequential processing of recurrent neural networks, our architecture captures long-term dependencies while maintaining computational efficiency. The model is optimized for deployment on resource-constrained edge devices and incorporates a balanced loss function to handle the inherent data imbalance in hypo- and hyperglycemic events. Experiments on two benchmark datasets, OhioT1DM and DiaTrend, demonstrate that the proposed model outperforms state-of-the-art methods in predicting glucose levels and detecting adverse events. This work fills the gap between high-performance modeling and practical deployment, providing a reliable and efficient T1D management solution.
1型糖尿病(T1D)影响了全世界数百万人,需要持续监测以防止严重的低血糖和高血糖事件。虽然连续葡萄糖监测已经改善了血糖管理,但在可穿戴设备上部署预测模型仍然由于计算能力和内存限制而面临挑战。为了解决这个问题,我们提出了一种新颖的轻量级序列变换器模型,专门用于T1D患者的血糖预测。通过结合变压器注意力机制和循环神经网络的顺序处理能力,我们的架构能够捕捉长期依赖关系并保持计算效率。该模型经过优化,可以在资源受限的边缘设备上部署,并且采用了平衡损失函数以应对低血糖和高血糖事件中固有的数据不平衡问题。 在两个基准数据集(OhioT1DM和DiaTrend)上的实验表明,所提出的模型在预测葡萄糖水平和检测不良事件方面优于现有的最佳方法。这项工作填补了高性能建模与实际部署之间的空白,为T1D管理提供了一种可靠且高效的解决方案。
https://arxiv.org/abs/2506.07864
Understanding the behavior of deep reinforcement learning (DRL) agents -- particularly as task and agent sophistication increase -- requires more than simple comparison of reward curves, yet standard methods for behavioral analysis remain underdeveloped in DRL. We apply tools from neuroscience and ethology to study DRL agents in a novel, complex, partially observable environment, ForageWorld, designed to capture key aspects of real-world animal foraging -- including sparse, depleting resource patches, predator threats, and spatially extended arenas. We use this environment as a platform for applying joint behavioral and neural analysis to agents, revealing detailed, quantitatively grounded insights into agent strategies, memory, and planning. Contrary to common assumptions, we find that model-free RNN-based DRL agents can exhibit structured, planning-like behavior purely through emergent dynamics -- without requiring explicit memory modules or world models. Our results show that studying DRL agents like animals -- analyzing them with neuroethology-inspired tools that reveal structure in both behavior and neural dynamics -- uncovers rich structure in their learning dynamics that would otherwise remain invisible. We distill these tools into a general analysis framework linking core behavioral and representational features to diagnostic methods, which can be reused for a wide range of tasks and agents. As agents grow more complex and autonomous, bridging neuroscience, cognitive science, and AI will be essential -- not just for understanding their behavior, but for ensuring safe alignment and maximizing desirable behaviors that are hard to measure via reward. We show how this can be done by drawing on lessons from how biological intelligence is studied.
理解深度强化学习(DRL)代理的行为,尤其是在任务和代理复杂度增加时,仅仅通过比较奖励曲线是不够的。然而,标准的行为分析方法在DRL领域仍然发展不足。我们应用来自神经科学和动物行为学的研究工具来研究在ForageWorld中运行的DRL代理,这是一个新颖且复杂的部分可观测环境,旨在捕捉现实世界中的动物觅食活动的关键方面——包括稀疏、可耗尽的食物资源、捕食者威胁以及空间扩展的比赛场。我们利用这个环境作为平台,对代理进行行为和神经分析的综合研究,揭示出有关代理策略、记忆及规划的详细且具有量化的见解。 与常见的假设相反,我们发现基于模型无关(model-free)循环神经网络(RNN)的DRL代理可以通过自身动态机制表现出结构化的行为,类似于计划行为——无需显式的记忆模块或世界模型。我们的研究结果表明,用类似动物学的方式研究DRL代理,并利用受神经行为科学启发的方法来揭示其行为和神经动力学中的结构特征,可以发现这些代理在学习过程中具有丰富的内在结构,而这些在常规分析方法下是看不见的。 我们从这种方法中提炼出一套通用分析框架,该框架将核心的行为和表示特性与诊断方法联系起来,并可适用于广泛的任务和代理。随着代理变得更加复杂和自主化,连接神经科学、认知科学以及人工智能(AI)将会变得至关重要——不仅是为了理解它们的行为,也是为了确保安全对齐并最大化那些难以通过奖励测量来评估的期望行为。 本文展示了如何借鉴研究生物智能的方式来进行这些跨学科的研究工作。
https://arxiv.org/abs/2506.06981
Inertial-based Motion capture system has been attracting growing attention due to its wearability and unsconstrained use. However, accurate human joint estimation demands several complex and expertise demanding steps, which leads to expensive software such as the state-of-the-art MVN Awinda from Xsens Technologies. This work aims to study the use of Neural Networks to abstract the complex biomechanical models and analytical mathematics required for pose estimation. Thus, it presents a comparison of different Neural Network architectures and methodologies to understand how accurately these methods can estimate human pose, using both low cost(MPU9250) and high end (Mtw Awinda) Magnetic, Angular Rate, and Gravity (MARG) sensors. The most efficient method was the Hybrid LSTM-Madgwick detached, which achieved an Quaternion Angle distance error of 7.96, using Mtw Awinda data. Also, an ablation study was conducted to study the impact of data augmentation, output representation, window size, loss function and magnetometer data on the pose estimation error. This work indicates that Neural Networks can be trained to estimate human pose, with results comparable to the state-of-the-art fusion filters.
基于惯性的动作捕捉系统因其可穿戴性和无约束使用而越来越受到关注。然而,准确的人体关节估计需要几个复杂且技术要求高的步骤,这导致了如Xsens Technologies公司推出的业界领先的MVN Awinda等昂贵软件的出现。本研究旨在探讨神经网络在抽象复杂的生物力学模型和解析数学方面的作用,以实现姿态估计。因此,本文比较了不同的神经网络架构和方法论,以便了解这些方法使用低成本(MPU9250)和高端(Mtw Awinda)磁力计、角速度计和重力(MARG)传感器时对人体姿态的估算准确度如何。在使用Mtw Awinda数据的情况下,最有效的方法是混合LSTM-Madgwick独立法,其四元数角度距离误差为7.96。 此外,还进行了一项消融研究以探究数据增强、输出表示、窗口大小、损失函数和磁力计数据对人体姿态估算误差的影响。这项工作表明,神经网络可以被训练来估计人体姿态,并且其结果可与现有的最先进的融合滤波器相媲美。
https://arxiv.org/abs/2506.06850
Long Short-Term Memory (LSTM) neural network models have become the cornerstone for sequential data modeling in numerous applications, ranging from natural language processing to time series forecasting. Despite their success, the problem of model selection, including hyperparameter tuning, architecture specification, and regularization choice remains largely heuristic and computationally expensive. In this paper, we propose a unified statistical framework for systematic model selection in LSTM networks. Our framework extends classical model selection ideas, such as information criteria and shrinkage estimation, to sequential neural networks. We define penalized likelihoods adapted to temporal structures, propose a generalized threshold approach for hidden state dynamics, and provide efficient estimation strategies using variational Bayes and approximate marginal likelihood methods. Several biomedical data centric examples demonstrate the flexibility and improved performance of the proposed framework.
长期短期记忆(LSTM)神经网络模型已经成为众多应用中序列数据建模的基石,从自然语言处理到时间序列预测无所不包。尽管取得了成功,但包括超参数调整、架构指定和正则化选择在内的模型选择问题仍然主要依靠直觉,并且计算成本高昂。在本文中,我们提出了一种统一的统计框架,用于LSTM网络中的系统性模型选择。我们的框架扩展了经典模型选择理念,如信息准则和收缩估计,以适用于序列神经网络。我们定义了适应时间结构的惩罚似然函数,提出了隐藏状态动态的一般化阈值方法,并提供了使用变分贝叶斯和近似边际似然方法的有效估算策略。几个基于生物医学数据的例子展示了所提框架的灵活性和性能改进。
https://arxiv.org/abs/2506.06840
Infertility has a considerable impact on individuals' quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.
不孕症对个人的生活质量有相当大的影响,它在社会和心理层面都带来了挑战,并且预计在未来几年内问题将会加剧。体外受精(IVF)作为经济发达国家主要的技术手段之一,被用来应对日益增长的低生育率问题。专家胚胎学家通常通过审查囊胚图像来评估并选择最佳的胚胎进行移植,但这一过程耗时且效率低下。囊胚图像是评估胚胎活力的重要资源。 在这项研究中,我们提出了一种使用解释性人工智能(XAI)框架对胚胎进行分类的方法,该方法结合了卷积神经网络(CNN)和长短期记忆网络(LSTM),被称为CNN-LSTM架构。通过深度学习技术,我们的模型实现了高精度的胚胎分类,并且借助XAI保持了模型的可解释性。
https://arxiv.org/abs/2506.06680
Automated driving (AD) has substantially improved vehicle safety and driving comfort, but their impact on passenger well-being, particularly infant sleep, is not sufficiently studied. Sudden acceleration, abrupt braking, and sharp maneuvers can disrupt infant sleep, compromising both passenger comfort and parental convenience. To solve this problem, this paper explores the integration of reinforcement learning (RL) within AD to personalize driving behavior and optimally balance occupant comfort and travel efficiency. In particular, we propose an intelligent cruise control framework that adapts to varying driving conditions to enhance infant sleep quality by effectively synergizing wearable sensing and vehicle data. Long short-term memory (LSTM) and transformer-based neural networks are integrated with RL to model the relationship between driving behavior and infant sleep quality under diverse traffic and road conditions. Based on the sleep quality indicators from the wearable sensors, driving action data from vehicle controllers, and map data from map applications, the model dynamically computes the optimal driving aggressiveness level, which is subsequently translated into specific AD control strategies, e.g., the magnitude and frequency of acceleration, lane change, and overtaking. Simulation results demonstrate that the proposed solution significantly improves infant sleep quality compared to baseline methods, while preserving desirable travel efficiency.
自动驾驶(AD)在提高车辆安全性和驾驶舒适性方面取得了显著进展,但其对乘客福祉的影响,尤其是婴儿睡眠质量的潜在影响,尚未得到充分研究。突然加速、急刹车和剧烈变道会扰乱婴儿睡眠,从而降低乘车舒适度并给家长带来不便。为了解决这个问题,本文探讨了在AD中集成强化学习(RL)以个性化驾驶行为,并优化乘客舒适性和旅行效率之间的平衡的方法。特别地,我们提出了一种智能巡航控制框架,该框架能够适应不同的驾驶条件,通过有效整合穿戴式传感和车辆数据来提升婴儿的睡眠质量。 长短期记忆(LSTM)和基于变压器的神经网络与RL相结合,用于建模在各种交通和道路条件下驾驶行为与婴儿睡眠质量之间的关系。根据可穿戴传感器提供的睡眠质量指标、来自车辆控制器的驾驶动作数据以及地图应用程序中的地图数据,该模型能够动态计算最优的驾驶激进程度,并将其转化为具体的AD控制策略,例如加速幅度和频率、变道及超车等。 模拟结果显示,与基线方法相比,所提出的解决方案显著提高了婴儿睡眠质量,同时保持了令人满意的旅行效率。
https://arxiv.org/abs/2506.06459
Predicting future states of dynamic agents is a fundamental task in autonomous driving. An expressive representation for this purpose is Occupancy Flow Fields, which provide a scalable and unified format for modeling motion, spatial extent, and multi-modal future distributions. While recent methods have achieved strong results using this representation, they often depend on high-quality vectorized inputs, which are unavailable or difficult to generate in practice, and the use of transformer-based architectures, which are computationally intensive and costly to deploy. To address these issues, we propose \textbf{Coupled Convolutional LSTM (CCLSTM)}, a lightweight, end-to-end trainable architecture based solely on convolutional operations. Without relying on vectorized inputs or self-attention mechanisms, CCLSTM effectively captures temporal dynamics and spatial occupancy-flow correlations using a compact recurrent convolutional structure. Despite its simplicity, CCLSTM achieves state-of-the-art performance on occupancy flow metrics and, as of this submission, ranks \(1^{\text{st}}\) in all metrics on the 2024 Waymo Occupancy and Flow Prediction Challenge leaderboard.
预测动态代理的未来状态是自动驾驶中的基本任务。为了这一目的,一种具有表现力的表示方法是占用流场(Occupancy Flow Fields),它为建模运动、空间范围和多模式未来分布提供了一种可扩展且统一的格式。虽然最近的方法在使用这种表示时取得了很好的结果,但它们往往依赖于高质量的向量化输入,在实际应用中这些输入不可用或难以生成,并且还大量使用了计算密集型和部署成本高昂的基于变压器的架构。为了解决这些问题,我们提出了**耦合卷积LSTM (CCLSTM)**,这是一种仅基于卷积操作、轻量级且可端到端训练的架构。无需依赖向量化输入或自注意力机制,CCLSTM使用一个紧凑的递归卷积结构有效地捕捉时间动态和空间占用流相关性。尽管设计简单,但CCLSTM在占用流指标上实现了最先进的性能,并且截至提交时,在2024年Waymo占用与流预测挑战赛排行榜的所有指标中均排名第一。
https://arxiv.org/abs/2506.06128
Recent advancements in robot navigation, especially with end-to-end learning approaches like reinforcement learning (RL), have shown remarkable efficiency and effectiveness. Yet, successful navigation still relies on two key capabilities: mapping and planning, whether explicit or implicit. Classical approaches use explicit mapping pipelines to register ego-centric observations into a coherent map frame for the planner. In contrast, end-to-end learning achieves this implicitly, often through recurrent neural networks (RNNs) that fuse current and past observations into a latent space for planning. While architectures such as LSTM and GRU capture temporal dependencies, our findings reveal a key limitation: their inability to perform effective spatial memorization. This skill is essential for transforming and integrating sequential observations from varying perspectives to build spatial representations that support downstream planning. To address this, we propose Spatially-Enhanced Recurrent Units (SRUs), a simple yet effective modification to existing RNNs, designed to enhance spatial memorization capabilities. We introduce an attention-based architecture with SRUs, enabling long-range navigation using a single forward-facing stereo camera. Regularization techniques are employed to ensure robust end-to-end recurrent training via RL. Experimental results show our approach improves long-range navigation by 23.5% compared to existing RNNs. Furthermore, with SRU memory, our method outperforms the RL baseline with explicit mapping and memory modules, achieving a 29.6% improvement in diverse environments requiring long-horizon mapping and memorization. Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-shot transfer to diverse and complex real-world environments.
最近在机器人导航领域的进展,特别是采用端到端学习方法如强化学习(RL)的成果,显示出显著的效率和有效性。然而,成功的导航仍然依赖于两个关键能力:地图构建与路径规划,无论这些过程是显式的还是隐式的。传统的方法使用明确的地图生成管道,将自中心观察注册为一个连贯的地图框架以供路径规划器使用。相比之下,端到端学习通过递归神经网络(RNN)实现了这一目标的隐式方式,在这个过程中,当前和过去的观测数据被融合进用于计划的潜在空间中。虽然诸如LSTM和GRU这样的架构可以捕捉时间依赖性,但我们发现了一个关键限制:它们无法有效进行空间记忆。这项技能对于转换并整合从不同视角连续获得的观察序列、构建支持下游规划的空间表示至关重要。 为了解决这一问题,我们提出了增强型递归单元(SRUs),这是一种简单而有效的对现有RNN的改进设计,旨在提升其空间记忆能力。我们引入了一种基于注意力机制的架构与SRU结合使用,使其能够仅通过单个前向立体相机实现长距离导航。为了确保强化学习环境下的端到端递归训练稳健进行,我们也采用了正则化技术。实验结果显示,我们的方法相较于现有的RNN,在长范围导航方面提高了23.5%的性能表现。此外,在需要长期规划和记忆功能的多样环境中,利用SRU内存的方法优于带有显式地图构建及记忆模块的RL基准方法,改进幅度达到了29.6%。 最后,为了弥合仿真与现实之间的差距,我们通过在大规模合成深度数据上进行预训练,使得我们的模型能够在多种复杂的真实世界场景中实现零样本迁移。
https://arxiv.org/abs/2506.05997
We present a method for 3D ball trajectory estimation from a 2D tracking sequence. To overcome the ambiguity in 3D from 2D estimation, we design an LSTM-based pipeline that utilizes a novel canonical 3D representation that is independent of the camera's location to handle arbitrary views and a series of intermediate representations that encourage crucial invariance and reprojection consistency. We evaluated our method on four synthetic and three real datasets and conducted extensive ablation studies on our design choices. Despite training solely on simulated data, our method achieves state-of-the-art performance and can generalize to real-world scenarios with multiple trajectories, opening up a range of applications in sport analysis and virtual replay. Please visit our page: this https URL.
我们提出了一种从二维跟踪序列估计三维球轨迹的方法。为了克服从二维到三维估算中的模糊性,我们设计了一个基于LSTM的管道,该管道利用一种新颖且与相机位置无关的三维标准表示法来处理任意视角,并使用一系列中间表示鼓励关键不变性和重新投影一致性。我们在四个合成数据集和三个真实数据集上评估了我们的方法,并进行了广泛的设计选择消融研究。尽管仅在模拟数据上进行训练,我们的方法实现了最先进的性能,并且能够推广到包含多条轨迹的真实世界场景中,这为体育分析和虚拟回放等领域开辟了一系列应用可能性。请访问我们的页面:[此链接](https://this-url.com)。
https://arxiv.org/abs/2506.05763
The COVID-19 pandemic's severe impact highlighted the need for accurate, timely hospitalization forecasting to support effective healthcare planning. However, most forecasting models struggled, especially during variant surges, when they were needed most. This study introduces a novel Long Short-Term Memory (LSTM) framework for forecasting daily state-level incident hospitalizations in the United States. We present a spatiotemporal feature, Social Proximity to Hospitalizations (SPH), derived from Facebook's Social Connectedness Index to improve forecasts. SPH serves as a proxy for interstate population interaction, capturing transmission dynamics across space and time. Our parallel LSTM architecture captures both short- and long-term temporal dependencies, and our multi-horizon ensembling strategy balances consistency and forecasting error. Evaluation against COVID-19 Forecast Hub ensemble models during the Delta and Omicron surges reveals superiority of our model. On average, our model surpasses the ensemble by 27, 42, 54, and 69 hospitalizations per state on the $7^{th}$, $14^{th}$, $21^{st}$, and $28^{th}$ forecast days, respectively, during the Omicron surge. Data-ablation experiments confirm SPH's predictive power, highlighting its effectiveness in enhancing forecasting models. This research not only advances hospitalization forecasting but also underscores the significance of spatiotemporal features, such as SPH, in refining predictive performance in modeling the complex dynamics of infectious disease spread.
新冠病毒大流行对住院人数产生了严重影响,这凸显了准确及时的住院预测对于有效医疗规划的重要性。然而,在变异株激增期间,大多数预测模型的表现不佳,尤其是在它们最需要发挥作用的时候。本研究提出了一种新颖的长短期记忆(LSTM)框架,用于预测美国各州级别的每日新增住院病例数。我们引入了一个时空特征——“与医院住院情况的社会接近度”(SPH),该特征基于Facebook的社会连接性指数,并改进了预测模型的效果。SPH作为州际人口互动的一个代理指标,捕捉到了跨时间和空间的传播动态。 我们的并行LSTM架构能够捕捉短期和长期的时间依赖关系,而多时间跨度的集成策略则平衡了一致性和预测误差之间的关系。在德尔塔和奥密克戎激增期间与COVID-19预测中心模型进行对比评估后发现,我们所提出的模型具有明显优势。特别是在奥密克戎激增期,在第七、十四、二十一天及第二十八天的预测中,平均而言,我们的模型分别比集成模型更准确地预测出了每个州多27、42、54和69个住院病例。 数据消减实验进一步证实了SPH在提高预测能力方面的有效性,并强调其在改进预测模型性能方面的重要作用。这项研究不仅推进了对住院人数的预测,还突显了时空特征(如SPH)在准确模拟传染病传播复杂动态中的重要性。
https://arxiv.org/abs/2506.05752
During the wake of the Covid-19 pandemic, the educational paradigm has experienced a major change from in person learning traditional to online platforms. The change of learning convention has impacted the teacher-student especially in non-verbal communication. The absent of non-verbal communication has led to a reliance on verbal feedback which diminished the efficacy of the educational experience. This paper explores the integration of sentiment analysis into learning management systems (LMS) to bridge the student-teacher's gap by offering an alternative approach to interpreting student feedback beyond its verbal context. The research involves data preparation, feature selection, and the development of a deep neural network model encompassing word embedding, LSTM, and attention mechanisms. This model is compared against a logistic regression baseline to evaluate its efficacy in understanding student feedback. The study aims to bridge the communication gap between instructors and students in online learning environments, offering insights into the emotional context of student feedback and ultimately improving the quality of online education.
在新冠疫情的冲击下,教育模式经历了一场从传统面对面学习向在线平台的重大转变。这种学习方式的变化对师生之间的交流产生了影响,尤其是非言语沟通方面的影响更为显著。缺乏非言语沟通导致了对口头反馈的高度依赖,从而削弱了教学体验的有效性。本文探讨了将情感分析技术融入到学习管理系统(LMS)中,以期通过提供一种超越语言语境解读学生反馈的替代方法来弥合师生之间的沟通差距。研究包括数据准备、特征选择以及开发一个深度神经网络模型,该模型涵盖词嵌入、长短期记忆网络(LSTM)和注意力机制等技术。通过将此模型与逻辑回归基准进行比较,评估其在理解学生反馈方面的有效性。本研究旨在弥合在线学习环境中教师与学生之间的沟通差距,并揭示学生反馈中的情感背景,从而提高在线教育的质量。
https://arxiv.org/abs/2506.05490
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
当前的序列建模主要由使用softmax自注意力机制的因果Transformer架构主导。尽管Transformer被广泛采用,但它们在推理时需要线性增加内存和计算资源。最近的工作通过线性化softmax操作,开发出了具有恒定内存和计算成本的强大递归神经网络(RNN)模型,例如DeltaNet、Mamba或xLSTM。这些模型可以通过观察到其递归层动态均源自于一种上下文回归目标而得到统一,并且该目标是通过在线学习规则近似优化的。 我们加入了这一研究行列,介绍了一种最近提出的Mesa层(von Oswald等人, 2024)的数值稳定、分块并行版本,并在百亿参数量的语言模型中对其进行了研究。这个层同样来源于上下文损失,但现在它通过快速共轭梯度求解器在每个时间点上优化到最优状态。通过一系列广泛的实验,我们展示了这种最佳测试时训练能够使语言建模困惑度更低、下游基准性能更高,特别是在需要长语境理解的任务中超越了之前的RNN模型。这一性能提升的代价是在推理时增加了更多的浮点运算次数(FLOPs)。我们的结果因此与最近的趋势相关联——即通过增加测试时间计算来提高性能,在这里表现为在神经网络内部解决序列优化问题所需的成本。 简而言之,这项工作提出了一种新颖且高效的递归层设计方案,并证明了它能够在大规模语言建模任务中超越传统的RNN模型,尽管这需要在推理阶段付出更高的计算成本。
https://arxiv.org/abs/2506.05233
Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.
归纳偏差存在于每一个机器学习系统中,决定了模型如何从有限的数据集中泛化。对于神经语言模型(LM),人们仍在争论这些偏见是否与人类处理约束一致或是相异。为了解决这一问题,我们提出了一种定量框架,该框架允许对这些偏见的本质进行受控研究。在我们的框架内,引入了$m$-局部熵——一种从平均损失上下文意外度推导出的信息理论指标,用于捕捉语言的局部不确定性,并量化前$m-1$个符号如何使下一个符号去歧化。 通过扰动自然语言语料库和由概率有限状态自动机(PFSAs)定义的语言进行实验,我们展示了具有较高$m$-局部熵的语言对于Transformer和LSTM语言模型来说更难以学习。这些结果表明,神经语言模型与人类一样对语言的局部统计结构非常敏感。
https://arxiv.org/abs/2506.05136