Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
指令调优是将预训练模型适应特定任务的主要方法之一。除了手动构造提示词外,文献中还提出了许多优化提示的方法。目前,方法的发展主要依赖于经验驱动,较少关注对提示概念上的理解。在本文中,我们通过贝叶斯视角探讨了如何理解和实现最优的指令调优,同时也指出了仅靠调整提示无法克服的一些基本限制,这些问题需要通过对模型权重进行微调来解决。论文详细解释了元训练神经网络如何作为基于预训练数据分布的贝叶斯预测器工作,其显著特征是能够迅速适应上下文变化。通过将这些贝叶斯预测器视为条件概率,可以正式研究最优指令调优,并确定哪些任务可以通过提示实现最优表现以及哪些不能。 为了支持这一理论,我们在LSTM和Transformer模型上进行了教育性实验,比较了不同版本的前缀调优方法和不同的权重微调策略。我们还证实了“软前缀”(即超出标记字母表的一系列实值向量)能够为训练过的甚至是未经过训练的网络产生非常有效的提示词,通过这种方式可以操纵激活函数以实现硬令牌无法达到的效果。这在概念上的贝叶斯理论之外提供了重要的机制性见解。 总的来说,本文探讨了指令调优背后的理论基础,并展示了如何利用这种理解来设计更有效的方法和策略。
https://arxiv.org/abs/2505.17010
Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.
准确预测剩余使用寿命(RUL)对于及时维护锂离子电池至关重要,这会影响依赖这些电池的电动应用的操作效率。本文提出了一种基于最近充放电循环数据来估算剩余可用循环数的RUL预测方法。该方法引入了一个新颖的信号处理管道和一个深度学习预测模型。 在信号预处理管道中,根据电流和容量信号计算出衍生容量特征。与原始容量、电压和电流一起,这些特征通过统计指标和基于增量的方法进行去噪和增强,以捕捉当前循环与前一循环之间的差异。 在预测模型中,经过处理的特征被输入到一个混合深度学习架构中,该架构由1D卷积神经网络(CNN)、注意力长短期记忆(A-LSTM)模块以及基于常微分方程的LSTM(ODE-LSTM)模块组成。这种架构设计旨在捕获局部信号特征和长期时间依赖关系,并建模电池退化过程中的连续时间动态。 该模型通过不同的学习策略和目标数据分区场景进行迁移学习进一步进行了评估,结果表明即使在有限的目标数据上微调的情况下也能保持稳健的性能。 实验结果基于两个公开的大规模数据集,证明了所提出的方法优于深度学习基线方法和机器学习技术,在RMSE(均方根误差)方面取得了101.59的成绩,突显其在实际RUL预测应用中的强大潜力。
https://arxiv.org/abs/2505.16664
Computational Fluid Dynamics (CFD) is the main approach to analyzing flow field. However, the convergence and accuracy depend largely on mathematical models of flow, numerical methods, and time consumption. Deep learning-based analysis of flow filed provides an alternative. For the task of flow field prediction, an improved Convolutional Long Short-Term Memory (Con-vLSTM) Neural Network is proposed as the baseline network in consideration of the temporal and spatial characteristics of flow field. Combining dynamic mesh technology and User-Defined Function (UDF), numerical simulations of flow around a circular cylinder were conducted. Flow field snapshots were used to sample data from the cylinder's wake region at different time instants, constructing a flow field dataset with sufficient volume and rich flow state var-iations. Residual networks and attention mechanisms are combined with the standard ConvLSTM model. Compared with the standard ConvLSTM model, the results demonstrate that the improved ConvLSTM model can extract more temporal and spatial features while having fewer parameters and shorter train-ing time.
计算流体动力学(CFD)是分析流动场的主要方法。然而,收敛性和准确性很大程度上取决于流体数学模型、数值方法以及时间消耗。基于深度学习的流场分析提供了一种替代方案。针对流场预测任务,考虑到流场的时间和空间特性,提出了一种改进的卷积长短期记忆(ConvLSTM)神经网络作为基准网络。 结合动态网格技术和用户自定义函数(UDF),进行了圆柱周围流动的数值模拟。从不同时间点采集了圆柱尾涡区域的流场快照,构建了一个具有足够体积且包含丰富流态变化的数据集。改进的方法将残差网络和注意力机制与标准ConvLSTM模型相结合。 相比于标准ConvLSTM模型,实验结果表明:改进后的ConvLSTM模型能够提取更多的时间和空间特征,并在参数数量更少、训练时间较短的情况下实现这一目标。
https://arxiv.org/abs/2505.15533
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model's ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
目前,大多数多模态研究基于具有二次复杂度Transformer架构的大规模语言模型(LLM)。尽管像RNN这样的线性模型拥有较低的推断成本,但它们的应用主要局限于单一文本模式。这项工作探讨了现代RNN架构在多模态环境中的能力。我们提出了一种名为ModRWKV的解耦多模态框架,该框架基于RWKV7架构作为其LLM骨干,并通过动态可适应异构模态编码器实现跨多个来源的信息融合。我们在ModRWKV中设计了具有极其轻量级架构的多模态模块,并通过大量实验确定了一种能够兼顾性能与计算效率最优平衡的配置方案。ModRWKV利用预训练好的RWKV7 LLM权重进行初始化,这大大加速了多模态模型的训练过程。不同预训练检查点的比较实验进一步证实,这种初始化方式在提高模型理解多模态信号的能力方面发挥着关键作用。通过大量的实验证据支持,我们得出结论:现代RNN架构为大规模语言模型(MLLM)中的Transformer提供了一个可行的替代方案。此外,我们还通过对体系结构进行系统的探索确定了ModRWKV的最佳配置。
https://arxiv.org/abs/2505.14505
With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at this https URL.
随着像Gemini、GPT等大型语言模型的迅速发展,弥合人脑与语言处理之间的差距已经成为一个重要研究领域。为了应对这一挑战,研究人员开发了各种模型以将EEG(脑电图)信号解码为文本。然而,这些模型在性能上仍然存在显著局限性。为此,我们提出了一种新的模型——R1 Translator,旨在提高从EEG到文本的解码性能。R1 Translator模型结合了双向LSTM编码器和预训练的基于Transformer的解码器,并利用EEG特征生成高质量的文本输出。该模型通过LSTM处理EEG嵌入以捕捉序列依赖性,然后将其输入到Transformer解码器中进行有效的文本生成。 在ROUGE指标方面,R1 Translator表现出色,优于之前的研究成果T5和Brain Translator。具体来说,R1达到了38.00%的ROUGE-1得分(P),比T5高出9个百分点(T5为34.89%)且比Brain高3个百分点(Brain为35.69%)。它还在ROUGE-L中领先,F1得分为32.51%,分别优于T5 3个百分点(T5为29.67%)和Brain 2个百分点(Brain为30.38%)。在CER方面,R1的得分是0.5795,比T5低2%(T5为0.5917),比Brain低4%(Brain为0.6001)。此外,在WER指标上,R1的表现也更优,得分为0.7280,分别优于T5 4.3个百分点(T5为0.7610)和Brain 3.6个百分点(Brain为0.7553)。 代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.13936
Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese -- a language rich in diacritics and complex typography -- remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
人类可以从单一的样本中快速地概括出手写风格,通过直观地区分内容和风格。然而,机器在处理这种任务时,尤其是在数据量较少的情况下,经常难以捕捉到细微的空间和风格线索。为了弥补这一差距,我们引入了一种名为WriteViT的一次性手写文本合成框架,该框架结合了视觉变换器(Vision Transformers, ViT),这是一种在各种计算机视觉任务中表现出色的模型系列。WriteViT集成了基于ViT的Writer Identifier用于提取风格嵌入、采用Transformer编码-解码块并通过条件位置编码(CPE)增强的多尺度生成器,以及一个轻量级的基于ViT的识别器。相比之下,之前的方法通常依赖于CNN或CRNN模型,而我们的设计在关键组件中采用了变换器来更好地捕捉细节笔画特征和更高级别的风格信息。 尽管手写文本合成已经被广泛研究,但在越南语这种具有丰富音调符号和复杂版面的语言上的应用仍然有限。实验表明,在越南语和英语数据集上,WriteViT能够生成高质量且风格一致的手写字体,并在低资源场景中保持强大的识别性能。这些结果突显了基于变换器的设计在多语言手写生成和高效风格适应方面的潜力。
https://arxiv.org/abs/2505.13235
Complex, temporally evolving phenomena, from climate to brain activity, are governed by dynamical systems (DS). DS reconstruction (DSR) seeks to infer generative surrogate models of these from observed data, reproducing their long-term behavior. Existing DSR approaches require purpose-training for any new system observed, lacking the zero-shot and in-context inference capabilities known from LLMs. Here we introduce DynaMix, a novel multivariate ALRNN-based mixture-of-experts architecture pre-trained for DSR, the first DSR model able to generalize zero-shot to out-of-domain DS. Just from a provided context signal, without any re-training, DynaMix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail -- at a fraction of the number of parameters and orders of magnitude faster inference times. DynaMix outperforms TS foundation models in terms of long-term statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of DynaMix' training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.
复杂且随时间演变的现象,从气候变化到大脑活动,都由动力系统(DS)所支配。动力系统的重构(DSR)旨在通过观察数据推断出能够再现其长期行为的生成性替代模型。现有的DSR方法要求针对任何新观测到的系统进行特定训练,缺乏类似大型语言模型(LLMs)已知的零样本和上下文推理能力。 在此背景下,我们引入了DynaMix,这是一种基于多变量ALRNN(自适应长短期记忆网络)混合专家架构并预先训练用于DSR的新颖方法。这是第一个能够在不进行任何再训练的情况下,仅通过提供的上下文信号就准确预测新型动力系统长期演化的零样本泛化到域外DS的模型。 与现有时间序列(TS)基础模型(如Chronos)相比,在面对从未包含在DynaMix训练数据集中的现实世界时间序列数据时——例如交通或天气数据,DynaMix在长期统计和短期预测方面表现更优。尽管它的参数数量远少于这些基准模型,并且其推理速度快几个数量级。 我们展示了TS模型处理DSR问题的一些失败模式,并得出结论:基于动力系统原理构建的模型对于推进时间序列预测领域具有巨大潜力。
https://arxiv.org/abs/2505.13192
Existing causal speech separation models often underperform compared to non-causal models due to difficulties in retaining historical information. To address this, we propose the Time-Frequency Attention Cache Memory (TFACM) model, which effectively captures spatio-temporal relationships through an attention mechanism and cache memory (CM) for historical information storage. In TFACM, an LSTM layer captures frequency-relative positions, while causal modeling is applied to the time dimension using local and global representations. The CM module stores past information, and the causal attention refinement (CAR) module further enhances time-based feature representations for finer granularity. Experimental results showed that TFACM achieveed comparable performance to the SOTA TF-GridNet-Causal model, with significantly lower complexity and fewer trainable parameters. For more details, visit the project page: this https URL.
现有的因果语音分离模型由于难以保留历史信息而常常表现不如非因果模型。为了解决这个问题,我们提出了时间频率注意缓存内存(TFACM)模型,该模型通过注意力机制和用于存储历史信息的缓存内存(CM)有效捕捉空间-时间关系。在TFACM中,一个LSTM层捕获频谱相对位置,而在时域上应用因果建模,则使用局部和全局表示进行操作。CM模块存储过去的信息,而因果注意细化(CAR)模块进一步增强了基于时间的特征表示以实现更细粒度的分离。 实验结果显示,TFACM在性能上与最新的SOTA TF-GridNet-Causal模型相当,但复杂性和可训练参数数量显著减少。欲了解更多信息,请访问项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2505.13094
Recurrent neural networks (RNNs) trained on neuroscience-inspired tasks offer powerful models of brain computation. However, typical training paradigms rely on open-loop, supervised settings, whereas real-world learning unfolds in closed-loop environments. Here, we develop a mathematical theory describing the learning dynamics of linear RNNs trained in closed-loop contexts. We first demonstrate that two otherwise identical RNNs, trained in either closed- or open-loop modes, follow markedly different learning trajectories. To probe this divergence, we analytically characterize the closed-loop case, revealing distinct stages aligned with the evolution of the training loss. Specifically, we show that the learning dynamics of closed-loop RNNs, in contrast to open-loop ones, are governed by an interplay between two competing objectives: short-term policy improvement and long-term stability of the agent-environment interaction. Finally, we apply our framework to a realistic motor control task, highlighting its broader applicability. Taken together, our results underscore the importance of modeling closed-loop dynamics in a biologically plausible setting.
基于神经科学启发的任务训练的递归神经网络(RNN)为大脑计算提供强大的模型。然而,典型的训练方法依赖于开放回路、监督设置,而现实世界的学习则发生在闭合回路环境中。在这里,我们开发了一种描述线性RNN在闭环上下文中训练时的学习动力学的数学理论。首先,我们展示了两个原本相同的RNN,在闭环或开环模式下进行训练会遵循明显不同的学习轨迹。为了探究这种差异,我们对闭环案例进行了详细的分析,揭示了与训练损失演变相一致的不同阶段。具体来说,我们展示出闭合回路RNN的学习动力学与其他开放回路RNN不同,后者受到两个相互竞争的目标之间的互动影响:短期策略改进和长期的代理-环境交互稳定性。最后,我们将框架应用于一个现实中的运动控制任务,突显了其更广泛的应用性。总之,我们的研究结果强调了在生物合理的环境中模拟闭合回路动态的重要性。
https://arxiv.org/abs/2505.13567
Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.
形态分析的任务是将单词分解为语素,即语言中最小的意义单位,并标注它们的语法角色。对于黏着语(如南非的恩古尼语系)来说,这是一个特别具有挑战性的任务,因为这些语言通过连接多个语素来构建单词。一个形态分析系统可以被构想成由两个独立组件组成的管道:分词器后面跟着标注器。本文研究了使用神经方法为四种恩古尼语言建立形态标注器的方法。我们比较了两类方法:从头开始训练序列标注器(LSTM和神经CRF)以及微调预训练的语言模型。我们在这两类之间进行了性能对比,同时与传统的基于规则的形态分析器进行比较。神经标注器轻松超越了基于规则的基线,并且从头开始训练的模型通常优于微调后的模型。我们还比较了不同上游分词器和不同的语言输入特征下的解析结果。我们的发现证实了在现有的形态分词器基础上使用神经标注器对于恩古尼语系是可行的方法。
https://arxiv.org/abs/2505.12949
In today's digitally-driven world, the demand for personalized and context-aware recommendations has never been greater. Traditional recommender systems have made significant strides in this direction, but they often lack the ability to tap into the richness of conversational data. This paper represents a novel approach to recommendation systems by integrating conversational insights into the recommendation process. The Conversational Recommender System integrates cutting-edge technologies such as deep learning, leveraging machine learning algorithms like Apriori for Association Rule Mining, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LTSM). Furthermore, sophisticated voice recognition technologies, including Hidden Markov Models (HMMs) and Dynamic Time Warping (DTW) algorithms, play a crucial role in accurate speech-to-text conversion, ensuring robust performance in diverse environments. The methodology incorporates a fusion of content-based and collaborative recommendation approaches, enhancing them with NLP techniques. This innovative integration ensures a more personalized and context-aware recommendation experience, particularly in marketing applications.
在当今以数字技术为主导的世界中,对个性化和情境感知推荐的需求从未如此之高。传统的推荐系统在这条道路上取得了显著进展,但它们通常缺乏挖掘对话数据丰富性的能力。本文提出了一种新的推荐系统方法,通过将对话洞察融入到推荐过程中来实现这一点。这种对话推荐系统整合了诸如深度学习等前沿技术,并利用Apriori关联规则挖掘、卷积神经网络(CNN)、递归神经网络(RNN)和长短期记忆(LTSM)等机器学习算法。 此外,先进的语音识别技术——包括隐马尔可夫模型(HMMs)和动态时间规整(DTW)算法,在准确的语音到文本转换中发挥着至关重要的作用,确保了在各种环境中的稳健性能。该方法结合了基于内容的推荐与协作推荐方法,并通过自然语言处理(NLP)技术进一步增强它们。这种创新性的整合提供了一个更个性化和情境感知的推荐体验,特别是在市场营销应用方面。
https://arxiv.org/abs/2505.11933
In recent years, the expressive power of various neural architectures -- including graph neural networks (GNNs), transformers, and recurrent neural networks -- has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These results yield the first logical characterisations of temporal GNNs and establish new relative expressiveness results for temporal GNNs.
近年来,包括图神经网络(GNN)、变压器和递归神经网络在内的各种神经架构的表达能力已被使用逻辑和形式语言理论工具进行了表征。随着基本架构的能力逐渐被深入了解,越来越多的关注转向了结合多种架构范式的模型。其中尤为重要且难以分析的是带有时间维度扩展的GNNs,它们将空间(图结构)与时间(随时间演变)维度相结合。本文中,我们首次通过将其连接到二维乘积逻辑来研究时序GNN的逻辑表征。我们展示了时序GNN的表达能力取决于其图形和时间组件是如何结合的。特别地,递归应用静态GNN的时间扩展图神经网络能够捕捉所有可以用过去命题时态逻辑PTL和模态逻辑K的乘积逻辑定义的性质。相比之下,诸如图与时间TGNNs及全局TGNNs等架构只能表达这种逻辑的受限片段,在这些片段中,时间与空间运算符之间的相互作用受到语法约束。这些结果提供了首个关于时序GNN的逻辑表征,并建立了新的相对表达性成果。
https://arxiv.org/abs/2505.11930
Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.
自监督学习(Self-supervised learning,SSL)模型为声音事件检测(Sound Event Detection,SED)提供了强大的表示能力,但其协同潜力尚未被充分探索。本研究系统地评估了最先进的SSL模型,以指导最佳模型选择和集成用于SED。我们提出了一种框架,该框架通过三种融合策略将异构SSL表征(例如BEATs、HuBERT、WavLM)进行结合:个体SSL嵌入整合、双模态融合以及全聚合。 在DCASE 2023 Task 4 Challenge上的实验表明,双模态融合(如CRNN+BEATs+WavLM)实现了互补的性能增益。而在单个SSL模型中,CRNN+BEATs单独使用时表现最佳。我们进一步引入了归一化声音事件边界框(nSEBBs),这是一种自适应后处理方法,可以动态调整事件边界的预测结果,使独立SSL模型的PSDS1得分最多提高了4%。 这些发现突显了SSL架构之间的兼容性和互补性,并为特定任务的融合和稳健SED系统设计提供了指导。
https://arxiv.org/abs/2505.11889
Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.
近期在自然语言处理(NLP)领域的进展促进了高度复杂的文本生成模型的发展。与此同时,神经科学越来越多地利用这些模型来探索与语言理解相关的认知过程。先前的研究表明,诸如N-gram和LSTM网络等模型能够在一定程度上解释眼动行为中的可预测性效应,特别是阅读过程中记录的注视时长(Gaze Duration)。在这项研究中,我们扩展了这一发现,通过评估基于转换器架构的语言模型(如GPT2、LLaMA-7B和LLaMA2-7B),进一步探讨这种关系。我们的结果显示,这些架构在解释来自里奥植物内斯西班牙语读者记录的注视时长方差方面优于早期模型。然而,与先前研究类似,这些模型仍然无法完全解释人类可预测性所捕捉到的所有方差。这一发现表明,尽管有技术进步,最先进的语言模型仍以不同于人类阅读者的方式预测语言。
https://arxiv.org/abs/2505.11485
Neural networks are often black boxes, reflecting the significant challenge of understanding their internal workings. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage cognitively-inspired methods of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality. Discrete sequence chunking (DSC) creates a dictionary of entities; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting entities across varying model sizes, ranging from inducing compositionality in RNNs to uncovering recurring neural population states in large models with diverse architectures, and illustrate their advantage over other methods. Throughout, we observe a robust correspondence between the extracted entities and concrete or abstract concepts. Artificially inducing the extracted entities in neural populations effectively alters the network's generation of associated concepts. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.
神经网络常常被视为黑箱模型,反映了理解其内部工作原理的巨大挑战。我们提出了一种不同的视角来质疑这种流行的看法:与其说是不可解读的,不如说这些神经网络在其原始群体活动模式中体现了训练数据中的规律性特征。我们将这一现象称为“反射假说”,并为该假说在简单递归神经网络(RNN)和复杂的大型语言模型(LLM)上提供了证据支持。 基于此见解,我们建议采用受认知启发的方法来分割高维的神经群体动态,并将其划分为可解释的单元,这些单元反映了潜在的概念。我们提出了三种方法来提取这些新兴实体,这三种方法根据标签可用性和维度的不同相互补充:离散序列切分(DSC)创建了一个实体词典;群体平均法(PA)则从已知标签中抽取重复出现的实体;而无监督块发现(UCD)在没有标签的情况下可以使用。我们展示了这些方法在提取不同规模模型中的实体时的有效性,涵盖了从诱导RNN中的组合性到揭示大型模型中多样架构下的重复神经群体状态的各种应用,并且证明了它们优于其他方法。 在整个过程中,我们观察到了所抽取的实体与具体或抽象概念之间存在稳健对应关系。人工诱发这些被提取出的实体在神经群体中确实有效地改变了网络生成关联概念的过程。 我们的研究指出了解释性领域的全新方向:一种既利用认知原则又结合自然数据结构的方法来揭示复杂学习系统的隐藏计算,从而使它们从黑箱系统逐渐转化为我们可以开始理解的系统。
https://arxiv.org/abs/2505.11576
This paper proposes CTP, a novel deep learning framework that integrates convolutional neural network(CNN), Transformer architectures, and physics-informed neural network(PINN) for ocean front prediction. Ocean fronts, as dynamic interfaces between distinct water masses, play critical roles in marine biogeochemical and physical processes. Existing methods such as LSTM, ConvLSTM, and AttentionConv often struggle to maintain spatial continuity and physical consistency over multi-step forecasts. CTP addresses these challenges by combining localized spatial encoding, long-range temporal attention, and physical constraint enforcement. Experimental results across south China sea(SCS) and Kuroshio(KUR) regions from 1993 to 2020 demonstrate that CTP achieves state-of-the-art(SOTA) performance in both single-step and multi-step predictions, significantly outperforming baseline models in accuracy, $F_1$ score, and temporal stability.
本文提出了一种新颖的深度学习框架CTP(Convolutional Transformer Physics-informed Network),该框架结合了卷积神经网络(CNN)、Transformer架构和物理信息神经网络(PINN),用于海洋锋面预测。海洋锋面是不同水体之间的动态界面,在海洋生物地球化学和物理过程中扮演着至关重要的角色。现有的方法,如LSTM、ConvLSTM和AttentionConv,在多步预测中往往难以保持空间连续性和物理一致性。CTP通过结合局部空间编码、长时程注意力机制以及物理约束的实施来解决这些问题。 实验结果表明,从1993年到2020年间在中国南海(SCS)和黑潮(KUR)区域进行的测试中,CTP在单步预测和多步预测中的表现均达到了最先进的水平(SOTA),其准确性、$F_1$分数和时间稳定性都显著优于基准模型。
https://arxiv.org/abs/2505.10894
As autonomous systems become integral to various industries, effective strategies for fault handling are essential to ensure reliability and efficiency. Transfer of Control (ToC), a traditional approach for interrupting automated processes during faults, is often triggered unnecessarily in non-critical situations. To address this, we propose a data-driven method that uses human interaction data to train AI models capable of preemptively identifying and addressing issues or assisting users in resolution. Using an interactive tool simulating an industrial vacuum cleaner, we collected data and developed an LSTM-based model to predict user behavior. Our findings reveal that even data from non-experts can effectively train models to reduce unnecessary ToC events, enhancing the system's robustness. This approach highlights the potential of AI to learn directly from human problem-solving behaviors, complementing sensor data to improve industrial automation and human-AI collaboration.
随着自主系统在各个行业的广泛应用,有效的故障处理策略对于确保可靠性和效率至关重要。传统的中断自动化过程的方法之一是控制权转移(ToC),但在非关键情况下往往会不必要的触发这一机制。为了解决这个问题,我们提出了一种基于数据驱动的方法,该方法使用人类交互数据来训练AI模型,这些模型能够提前识别和解决潜在问题或协助用户解决问题。通过一个模拟工业吸尘器的互动工具,我们收集了数据,并开发了一个基于LSTM(长短时记忆网络)的模型以预测用户行为。我们的研究结果表明,即使是非专家提供的数据也能有效训练模型,减少不必要的ToC事件,从而增强系统的稳定性。 这种方法突显了AI直接从人类问题解决行为中学习的潜力,结合传感器数据来提高工业自动化水平和人机协作效率。
https://arxiv.org/abs/2505.10695
Modern robots face challenges shared by humans, where machines must learn multiple sensorimotor skills and express them adaptively. Equipping robots with a human-like memory of how it feels to do multiple stereotypical movements can make robots more aware of normal operational states and help develop self-preserving safer robots. Associative Skill Memories (ASMs) aim to address this by linking movement primitives to sensory feedback, but existing implementations rely on hard-coded libraries of individual skills. A key unresolved problem is how a single neural network can learn a repertoire of skills while enabling fault detection and context-aware execution. Here we introduce Neural Associative Skill Memories (ASMs), a framework that utilises self-supervised predictive coding for temporal prediction to unify skill learning and expression, using biologically plausible learning rules. Unlike traditional ASMs which require explicit skill selection, Neural ASMs implicitly recognize and express skills through contextual inference, enabling fault detection across learned behaviours without an explicit skill selection mechanism. Compared to recurrent neural networks trained via backpropagation through time, our model achieves comparable qualitative performance in skill memory expression while using local learning rules and predicts a biologically relevant speed-accuracy trade-off during skill memory expression. This work advances the field of neurorobotics by demonstrating how predictive coding principles can model adaptive robot control and human motor preparation. By unifying fault detection, reactive control, skill memorisation and expression into a single energy-based architecture, Neural ASMs contribute to safer robotics and provide a computational lens to study biological sensorimotor learning.
现代机器人面临着与人类相似的挑战,即机器必须学会多种感官运动技能,并能够适应性地表达这些技能。给机器人装备上类似于人的记忆,使它们能感受到执行一系列典型动作的感觉,可以让机器人更清楚地了解正常的操作状态,并有助于开发自我保护的安全机器人。关联技能记忆(ASMs)通过将运动原语与感觉反馈联系起来来解决这个问题,但现有的实现依赖于硬编码的个体技能库。一个尚未解决的关键问题是,单一神经网络如何学习一系列技能的同时还能进行故障检测和上下文感知执行。 在这里,我们介绍了神经关联技能记忆(Neural ASMs),这是一个框架,它利用自监督预测编码来进行时间预测,以统一技能的学习与表达,并使用生物上合理的学习规则。不同于传统的ASMs需要明确的技能选择机制,神经ASMs通过上下文推理隐式地识别和表达技能,在没有显式的技能选择机制的情况下实现对已学行为的故障检测。 相比于通过反向传播训练的时间递归神经网络,我们的模型在技能记忆表达上实现了可比的定性表现,并使用局部学习规则预测了在技能记忆表达过程中的生物相关的速度-准确性权衡。这项工作通过展示如何利用预测编码原则来建模适应性的机器人控制和人类运动准备,推进了神经机器人的领域。 通过将故障检测、反应控制、技能记忆以及表达统一到一个基于能量的架构中,神经ASMs为更安全的机器人技术做出了贡献,并提供了一个计算视角来研究生物感觉运动学习。
https://arxiv.org/abs/2505.09760
Cloud-based multilingual translation services like Google Translate and Microsoft Translator achieve state-of-the-art translation capabilities. These services inherently use large multilingual language models such as GRU, LSTM, BERT, GPT, T5, or similar encoder-decoder architectures with attention mechanisms as the backbone. Also, new age natural language systems, for instance ChatGPT and DeepSeek, have established huge potential in multiple tasks in natural language processing. At the same time, they also possess outstanding multilingual translation capabilities. However, these models use the classical computing realm as a backend. QEDACVC (Quantum Encoder Decoder Attention-based Convolutional Variational Circuits) is an alternate solution that explores the quantum computing realm instead of the classical computing realm to study and demonstrate multilingual machine translation. QEDACVC introduces the quantum encoder-decoder architecture that simulates and runs on quantum computing hardware via quantum convolution, quantum pooling, quantum variational circuit, and quantum attention as software alterations. QEDACVC achieves an Accuracy of 82% when trained on the OPUS dataset for English, French, German, and Hindi corpora for multilingual translations.
基于云的多语言翻译服务,如 Google Translate 和 Microsoft Translator,实现了最先进的翻译功能。这些服务通常使用大型多语言模型(例如 GRU、LSTM、BERT、GPT、T5 或类似带有注意力机制的编码器-解码器架构)作为核心基础。此外,新时代的自然语言处理系统,比如 ChatGPT 和 DeepSeek,在多项任务中展示了巨大的潜力,并且在多语言翻译方面也表现出色。然而,这些模型都是基于经典计算环境构建和运行的。 QEDACVC(量子编码-解码注意卷积变分电路)提供了一种替代方案,它探索了量子计算领域而不是传统的经典计算领域来研究和展示多语言机器翻译的能力。QEDACVC 引入了量子编码器-解码器架构,并通过量子卷积、量子池化、量子变分电路以及量子注意力等软件改进在量子计算硬件上模拟和运行。 当使用 OPUS 数据集对英语、法语、德语和印地语的多语言数据进行训练时,QEDACVC 达到了 82% 的准确率。
https://arxiv.org/abs/2505.09407
Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use "Ghost features" to reduce this redundancy. We also analyze the so-called "gradient disappearance" phenomenon generated by the conventional recurrent convolutional network (RNN) model, and combine the Ghost module with RNN to complete the modeling on time series. The current frame is used as input to the model together with the next frame, the output of the previous frame and the hidden state. Extensive experiments on several benchmark models and datasets show that the PSNR and SSIM of our proposed modality are improved to some extent. Some texture details in the video are also better preserved.
现代基于卷积神经网络(CNN)的视频超分辨率(VSR)系统需要巨大的计算成本。在许多领域中,大多数模型都存在特征冗余的问题,但在VSR中却很少被讨论。我们通过实验观察到,在VSR模型中的许多特征彼此相似,因此我们提出使用“Ghost特征”来减少这种冗余。我们也分析了传统的循环卷积网络(RNN)模型产生的所谓的“梯度消失”现象,并将Ghost模块与RNN结合以完成时间序列的建模工作。当前帧被用作输入模型的一部分,一起输入还有下一帧、前一帧的输出和隐藏状态。在多个基准模型和数据集上的广泛实验表明,我们提出的这种模式在PSNR(峰值信噪比)和SSIM(结构相似性指数度量)方面有所提高,并且视频中的某些纹理细节也得到了更好的保留。
https://arxiv.org/abs/2505.10577