Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at this https URL.
GUI自动化在动态环境中面临关键挑战。大规模语言模型(MLLMs)存在两个主要问题:误判UI组件和知识过时。传统的微调方法对于应用程序特定的知识更新成本高昂。我们提出了一种名为GUI-explorer的无训练需求的GUI代理,该代理融合了两种基本机制: 1. 功能感知轨迹自主探索。为了全面覆盖所有应用功能,我们设计了一个基于分析GUI结构信息(如屏幕截图和活动层次)的功能感知任务目标生成器,自动构建探索目标。这使得系统化地进行多样化轨迹收集成为可能。 2. 过渡感知知识无监督挖掘。为建立精确的屏幕操作逻辑,我们开发了一种过渡感知知识提取器,通过分析结构化的交互三元组(观察、行动、结果)的状态转换来进行有效的屏幕操作逻辑的无监督学习。这消除了人工参与知识抽取的需求。 在SPA-Bench和AndroidWorld基准测试中,GUI-explorer的任务成功率分别达到了53.7%和47.4%,显著优于现有最优方法(SOTA)代理,并且对于新应用无需参数更新即可使用。GUI-explorer已开源并公开发布于以下链接:[https URL]。
https://arxiv.org/abs/2505.16827
Reservoir Computing (RC) with physical systems requires an understanding of the underlying structure and internal dynamics of the specific physical reservoir. In this study, physical nano-electronic networks with neuromorphic dynamics are investigated for their use as physical reservoirs in an RC framework. These neuromorphic networks operate as dynamic reservoirs, with node activities in general coupled to the edge dynamics through nonlinear nano-electronic circuit elements, and the reservoir outputs influenced by the underlying network connectivity structure. This study finds that networks with varying degrees of sparsity generate more useful nonlinear temporal outputs for dynamic RC compared to dense networks. Dynamic RC is also tested on an autonomous multivariate chaotic time series prediction task with networks of varying densities, which revealed the importance of network sparsity in maintaining network activity and overall dynamics, that in turn enabled the learning of the chaotic Lorenz63 system's attractor behavior.
基于物理系统的液态计算(Reservoir Computing,RC)要求理解特定物理蓄水池的底层结构和内部动态。在这项研究中,探讨了具有神经形态动力学特性的纳米电子网络在RC框架中的应用潜力。这些神经形态网络作为动态蓄水池运行,节点活动通常通过非线性纳米电子电路元件与边缘动力学耦合,并且蓄水池输出受底层网络连接结构的影响。 研究发现,稀疏程度不同的网络相比于密集型网络能够生成更多有用的非线性时间序列输出,这对于动态RC尤为重要。此外,还测试了在自主多变量混沌时间序列预测任务中不同密度的网络对于动态RC的作用,这揭示了网络稀疏度在网络活动和整体动力学维持中的重要性,并进一步使得复杂系统的混沌吸引子行为(如洛伦兹63系统)得以学习和理解。
https://arxiv.org/abs/2505.16813
The Earth's surface is subject to complex and dynamic processes, ranging from large-scale phenomena such as tectonic plate movements to localized changes associated with ecosystems, agriculture, or human activity. Satellite images enable global monitoring of these processes with extensive spatial and temporal coverage, offering advantages over in-situ methods. In particular, resulting satellite image time series (SITS) datasets contain valuable information. To handle their large volume and complexity, some recent works focus on the use of graph-based techniques that abandon the regular Euclidean structure of satellite data to work at an object level. Besides, graphs enable modelling spatial and temporal interactions between identified objects, which are crucial for pattern detection, classification and regression tasks. This paper is an effort to examine the integration of graph-based methods in spatio-temporal remote-sensing analysis. In particular, it aims to present a versatile graph-based pipeline to tackle SITS analysis. It focuses on the construction of spatio-temporal graphs from SITS and their application to downstream tasks. The paper includes a comprehensive review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water resource forecasting. It also discusses numerous perspectives to resolve current limitations and encourage future developments.
地球表面受到复杂且动态的过程影响,这些过程从板块构造运动等大规模现象到生态系统、农业或人类活动相关的局部变化不等。卫星图像能够提供广泛的时空覆盖范围,用于全球监测这些过程,并且在这一点上相比现场方法具有优势。特别是,生成的卫星影像时间序列(SITS)数据集包含有价值的信息。为了处理其庞大的体积和复杂性,最近的一些研究侧重于采用基于图的技术,这种方法放弃了卫星数据的常规欧几里得结构,转而在对象层面进行工作。此外,图还能够建模已识别对象之间的空间和时间交互,这对于模式检测、分类和回归任务至关重要。 本文旨在探讨将基于图的方法整合到时空遥感分析中的努力,并特别致力于提出一种灵活的基于图的工作流程以处理SITS分析。该研究聚焦于从SITS构建时空图以及这些图表在下游任务中的应用。文章包括了全面回顾及两个案例研究,突显了基于图方法在土地覆盖制图和水资源预测方面的潜力。同时,它还讨论了许多解决当前限制并鼓励未来发展的方法视角。 具体来说,本文概述如下: 1. **文献综述**:详尽地回顾了遥感图像的时空特性、SITS分析中的数据处理技术以及现有基于图方法在遥感领域的应用。 2. **案例研究**: - 土地覆盖制图:展示如何使用时空图来改进土地利用分类和变化检测任务,特别是对于快速城市化地区或生态系统转化区域。 - 水资源预测:展示了通过构建包含水体、土地表面特征及气象信息的时空网络模型来进行水资源量预测的方法。 3. **未来工作**:强调了该领域存在的挑战与机会,包括数据质量和计算效率改进的需求,并提出了利用图神经网络进行更复杂模式识别的可能性。
https://arxiv.org/abs/2505.16685
Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box'' nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.
基于广泛电子健康记录(EHR)数据训练的深度学习模型在诊断预测方面已经达到了很高的准确率,有潜力帮助临床医生进行决策和治疗规划。然而,这些模型缺少两个关键特性:可解释性和交互性。由于“黑盒”特性的存在,使得临床医生难以理解模型背后的推理过程,从而限制了他们做出知情决定的能力。此外,缺乏互动机制也阻碍了临床医生将其知识和经验融入决策过程中。 为了解决这些问题,我们提出了II-KEA框架,这是一种增强型的知识驱动的因果发现方法,它整合了个性化知识数据库和代理语言模型(LLM)。通过明确的推理和因果分析,II-KEA提高了模型的可解释性;同时,该框架还允许临床医生通过定制化的知识库和提示来注入他们的知识和经验,从而增强了交互性。在对MIMIC-III和MIMIC-IV数据集进行评估时,II-KEA不仅展示了卓越的表现,还在案例研究中证明了其增强的可解释性和互动性的有效性。
https://arxiv.org/abs/2505.16288
We present a method for the automated discovery of system-level dynamics in Flow-Lenia$-$a continuous cellular automaton (CA) with mass conservation and parameter localization$-$using a curiosity-driven AI scientist. This method aims to uncover processes leading to self-organization of evolutionary and ecosystemic dynamics in CAs. We build on previous work which uses diversity search algorithms in Lenia to find self-organized individual patterns, and extend it to large environments that support distinct interacting patterns. We adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to drive exploration of diverse Flow-Lenia environments using simulation-wide metrics, such as evolutionary activity, compression-based complexity, and multi-scale entropy. We test our method in two experiments, showcasing its ability to illuminate significantly more diverse dynamics compared to random search. We show qualitative results illustrating how ecosystemic simulations enable self-organization of complex collective behaviors not captured by previous individual pattern search and analysis. We complement automated discovery with an interactive exploration tool, creating an effective human-AI collaborative workflow for scientific investigation. Though demonstrated specifically with Flow-Lenia, this methodology provides a framework potentially applicable to other parameterizable complex systems where understanding emergent collective properties is of interest.
我们提出了一种使用好奇心驱动的人工智能科学家来自动发现Flow-Lenia(一种具有质量守恒和参数局部化的连续元胞自动机(CA))中的系统级动态的方法。该方法旨在揭示导致进化和生态系统动力学自我组织的CA过程。我们在先前的研究基础上进行了扩展,这些研究利用多样性的搜索算法在Lenia中寻找自组织的个体模式,并将其推广到支持不同互动模式的大环境中。我们将内在动机目标探索过程(IMGEPs)适应于驱动Flow-Lenia环境中的多样化探索,使用如进化活动、基于压缩的复杂性和多尺度熵等广泛的模拟指标。 我们在两个实验中测试了该方法,展示了其相比随机搜索能更显著地揭示多样化的动态特性。我们通过定性结果来说明生态系统模拟如何使复杂的集体行为自我组织,而这些行为无法通过之前的个体模式搜索和分析捕捉到。为了补充自动化发现过程,我们还开发了一个交互式探索工具,创建了有效的科学探究的人机协作工作流程。 尽管该方法具体展示于Flow-Lenia中,但此方法提供了一种框架,可能适用于其他参数化复杂系统的研究,在这些系统中理解涌现的集体特性具有重要意义。
https://arxiv.org/abs/2505.15998
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at this https URL.
近期在视频问答(Video Question Answering,简称VideoQA)领域取得的进展引入了基于大型语言模型(LLM)的代理、模块化框架和程序性解决方案,取得了令人鼓舞的结果。这些系统采用动态代理和基于记忆机制来分解复杂任务并优化答案生成。然而,在长时间内跟踪物体以及根据推理进行决策方面仍需显著改进,以更好地将对象参考与语言模型输出对齐;随着新模型在这两项任务上的表现日益出色,这一需求显得尤为迫切。 本文介绍了一种结合“思考链”框架和基于实例化推理的零样本视频问答(VideoQA)LLM大脑代理。该方法与YOLO-World相结合,增强了对象跟踪和对齐能力,并在VideoQA及视频理解领域设立了新的技术标准,在NExT-QA、iVQA和ActivityNet-QA等基准测试中表现出色。 此外,我们的框架还支持时间框架内的实例化验证检查,从而提高了准确性,并为跨多个视频领域的输出可靠性提供了重要保障。相关代码可在[此处](https://example.com)获取(实际链接应根据实际情况填写)。
https://arxiv.org/abs/2505.15928
Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
理解言语产生的神经机制对于推进认知神经科学理论以及开发实用的通信技术至关重要。在这项研究中,我们使用脑磁图(MEG)信号来解码在言语产生和感知任务(被动聆听和语音回放)期间大脑活动中的音位信息。通过包括17名参与者的数据集,我们进行了两两音位分类,并扩展了我们的分析以涵盖15对音位。比较了几种机器学习方法的有效性,包括正则化线性模型和神经网络架构,来确定它们在解码音位信息方面的表现。我们的结果表明,在言语产生过程中(76.6%)的解码准确性显著高于被动聆听和播放模式(约51%),强调了在口头语中可获得更丰富的神经信息。在所有模型中,Elastic Net分类器始终优于复杂的神经网络,突显了当应用于有限且高维度的MEG数据集时传统正则化技术的有效性。 此外,对特定脑频率带的分析显示,低频振荡(尤其是δ波[0.2-3 Hz]和θ波[4-7 Hz])在解码准确性中贡献最大,表明这些频率带编码了与言语产生相关的关键神经过程。尽管使用了先进的降噪方法,但尚不清楚解码是否仅反映神经活动还是残余的肌肉或运动伪影也有所贡献,这表明需要进一步的方法改进。 总体而言,我们的发现强调了研究口语产生的范式的重要性,尽管它们很复杂,但仍为改善帮助患有严重言语障碍个体的大脑-计算机接口提供了机会。
https://arxiv.org/abs/2505.15355
Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.
最近,脑到图像的解码技术由于生成式人工智能模型的进步以及大型超高场功能性磁共振成像(fMRI)数据的可用性得到了推动。然而,当前的方法依赖于复杂的多阶段管道和预处理步骤,这些步骤通常会压缩大脑记录的时间维度,从而限制了时间分辨的大脑解码器的发展。为此,我们引入了一种新的单阶段扩散模型——Dynadiff(动态神经活动扩散用于图像重建),专门针对从动态变化的fMRI记录中重构图像而设计。 我们的方法主要提供了三项贡献: 首先,与现有方法相比,Dynadiff简化了训练过程。 第二,我们的模型在时间分辨的fMRI信号上超过了当前最先进的模型,特别是在高级语义图像重建指标方面表现出色,并且在预处理过的、已经压缩时间维度的fMRI数据上仍保持竞争力。 第三,这种方法能够精确表征大脑活动中的图像表示演变。 总体而言,这项工作为时间分辨的脑到图像解码奠定了基础。
https://arxiv.org/abs/2505.14556
This technical report presents a natural language processing (NLP)-based approach for systematically classifying scientific literature on childhood speech disorders. We retrieved and filtered 4,804 relevant articles published after 2015 from the PubMed database using domain-specific keywords. After cleaning and pre-processing the abstracts, we applied two topic modeling techniques - Latent Dirichlet Allocation (LDA) and BERTopic - to identify latent thematic structures in the corpus. Our models uncovered 14 clinically meaningful clusters, such as infantile hyperactivity and abnormal epileptic behavior. To improve relevance and precision, we incorporated a custom stop word list tailored to speech pathology. Evaluation results showed that the LDA model achieved a coherence score of 0.42 and a perplexity of -7.5, indicating strong topic coherence and predictive performance. The BERTopic model exhibited a low proportion of outlier topics (less than 20%), demonstrating its capacity to classify heterogeneous literature effectively. These results provide a foundation for automating literature reviews in speech-language pathology.
本技术报告提出了一种基于自然语言处理(NLP)的方法,用于系统地分类关于儿童言语障碍的科学文献。我们使用特定领域的关键词从PubMed数据库中检索并筛选了2015年后发表的相关4,804篇文章。对摘要进行清理和预处理后,我们应用了两种主题建模技术——潜在狄利克雷分配(LDA)和BERTopic,以识别语料库中的潜在主题结构。我们的模型发现了14个临床意义显著的聚类,例如婴儿多动症和异常癫痫行为。为了提高相关性和准确性,我们将一个定制化的停用词列表纳入了言语病理学领域。评估结果显示,LDA模型获得了0.42的主题一致性得分和-7.5的困惑度值,表明其具有较强的主题一致性和预测性能。BERTopic模型显示出了较低比例的离群主题(少于20%),证明它能够有效地分类异质性文献。这些结果为自动化言语语言病理学领域的文献审查奠定了基础。
https://arxiv.org/abs/2505.14242
Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct neural activity recordings captured during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto neural signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. We evaluate reconstructed neural signals against ground truth recordings using correlation metrics and signal reconstruction quality assessments. The results indicate that neural activity can be effectively reconstructed using embeddings from large language and speech models across all study participants, yielding Pearson correlation coefficients ranging from 0.79 to 0.99.
理解神经活动如何编码言语和语言生成是神经科学和人工智能领域的一个基本挑战。这项研究探讨了来自大规模自监督语言模型和语音模型的嵌入是否能够有效地重建在言语产生过程中捕获的脑电记录。我们利用深度学习模型经过训练后的预训练嵌入,这些模型基于语言学和声学数据,来表示高级别的语音特征,并将它们映射到神经信号上。我们分析了这些嵌入保留大脑活动时空动态的程度,并通过相关性指标和信号重建质量评估方法来评价重构的脑电信号与真实记录之间的差异。 结果显示,所有研究参与者的大脑活动都能够有效利用大型语言模型和语音模型的嵌入进行重构,皮尔逊相关系数范围在0.79到0.99之间。
https://arxiv.org/abs/2505.14074
Denoising diffusion probabilistic models are able to generate synthetic sensor signals. The training process of such a model is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model. This enables the generation of realistic data. However, the randomness within the process and the loss function itself makes it difficult to estimate the quality of the data. Therefore, we examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process using those metrics. The adapted metric can even be fine-tuned on the input data to comply with the requirements of an underlying classification task. We were able to significantly reduce the amount of training epochs without a performance reduction in the classification task. An optimized training process not only saves resources, but also reduces the time for training generative models.
去噪扩散概率模型能够生成合成传感器信号。这类模型的训练过程由一个损失函数控制,该损失函数衡量了前向过程中添加的噪声与扩散模型预测出的噪声之间的差异。这使得生成现实数据成为可能。然而,过程中的随机性和损失函数本身的特性使评估数据质量变得困难。因此,我们考察了多种相似性度量,并对现有的一种度量进行了调整以解决这个问题,通过这些度量来监控训练和合成过程。调整后的度量甚至可以针对输入数据进行微调,以符合底层分类任务的要求。我们成功地在不降低分类任务性能的情况下显著减少了所需的训练周期数。优化的训练流程不仅节省了资源,还缩短了生成模型的训练时间。
https://arxiv.org/abs/2505.14739
With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at this https URL.
随着像Gemini、GPT等大型语言模型的迅速发展,弥合人脑与语言处理之间的差距已经成为一个重要研究领域。为了应对这一挑战,研究人员开发了各种模型以将EEG(脑电图)信号解码为文本。然而,这些模型在性能上仍然存在显著局限性。为此,我们提出了一种新的模型——R1 Translator,旨在提高从EEG到文本的解码性能。R1 Translator模型结合了双向LSTM编码器和预训练的基于Transformer的解码器,并利用EEG特征生成高质量的文本输出。该模型通过LSTM处理EEG嵌入以捕捉序列依赖性,然后将其输入到Transformer解码器中进行有效的文本生成。 在ROUGE指标方面,R1 Translator表现出色,优于之前的研究成果T5和Brain Translator。具体来说,R1达到了38.00%的ROUGE-1得分(P),比T5高出9个百分点(T5为34.89%)且比Brain高3个百分点(Brain为35.69%)。它还在ROUGE-L中领先,F1得分为32.51%,分别优于T5 3个百分点(T5为29.67%)和Brain 2个百分点(Brain为30.38%)。在CER方面,R1的得分是0.5795,比T5低2%(T5为0.5917),比Brain低4%(Brain为0.6001)。此外,在WER指标上,R1的表现也更优,得分为0.7280,分别优于T5 4.3个百分点(T5为0.7610)和Brain 3.6个百分点(Brain为0.7553)。 代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2505.13936
Associative memory engages in the integration of relevant information for comprehension in the human cognition system. In this work, we seek to improve alignment between language models and human brain while processing speech information by integrating associative memory. After verifying the alignment between language model and brain by mapping language model activations to brain activity, the original text stimuli expanded with simulated associative memory are regarded as input to computational language models. We find the alignment between language model and brain is improved in brain regions closely related to associative memory processing. We also demonstrate large language models after specific supervised fine-tuning better align with brain response, by building the \textit{Association} dataset containing 1000 samples of stories, with instructions encouraging associative memory as input and associated content as output.
关联记忆在人类认知系统中用于整合相关信息以促进理解。在这项工作中,我们旨在通过集成关联记忆来改进语言模型与人脑在处理语音信息时的一致性。在验证了语言模型和大脑活动之间的一致性后,我们将原始文本刺激扩展为包含模拟的关联记忆,并将其作为计算语言模型的输入。我们发现,在与关联记忆处理密切相关的脑区中,语言模型和大脑之间的一致性得到了提高。此外,通过构建包含1000个故事样本的\textit{Association}数据集(该数据集鼓励将关联记忆作为输入并将相关联的内容作为输出),我们展示了在特定监督微调后的大型语言模型与脑响应更好地一致。
https://arxiv.org/abs/2505.13844
Complex, temporally evolving phenomena, from climate to brain activity, are governed by dynamical systems (DS). DS reconstruction (DSR) seeks to infer generative surrogate models of these from observed data, reproducing their long-term behavior. Existing DSR approaches require purpose-training for any new system observed, lacking the zero-shot and in-context inference capabilities known from LLMs. Here we introduce DynaMix, a novel multivariate ALRNN-based mixture-of-experts architecture pre-trained for DSR, the first DSR model able to generalize zero-shot to out-of-domain DS. Just from a provided context signal, without any re-training, DynaMix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail -- at a fraction of the number of parameters and orders of magnitude faster inference times. DynaMix outperforms TS foundation models in terms of long-term statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of DynaMix' training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.
复杂且随时间演变的现象,从气候变化到大脑活动,都由动力系统(DS)所支配。动力系统的重构(DSR)旨在通过观察数据推断出能够再现其长期行为的生成性替代模型。现有的DSR方法要求针对任何新观测到的系统进行特定训练,缺乏类似大型语言模型(LLMs)已知的零样本和上下文推理能力。 在此背景下,我们引入了DynaMix,这是一种基于多变量ALRNN(自适应长短期记忆网络)混合专家架构并预先训练用于DSR的新颖方法。这是第一个能够在不进行任何再训练的情况下,仅通过提供的上下文信号就准确预测新型动力系统长期演化的零样本泛化到域外DS的模型。 与现有时间序列(TS)基础模型(如Chronos)相比,在面对从未包含在DynaMix训练数据集中的现实世界时间序列数据时——例如交通或天气数据,DynaMix在长期统计和短期预测方面表现更优。尽管它的参数数量远少于这些基准模型,并且其推理速度快几个数量级。 我们展示了TS模型处理DSR问题的一些失败模式,并得出结论:基于动力系统原理构建的模型对于推进时间序列预测领域具有巨大潜力。
https://arxiv.org/abs/2505.13192
The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth-$\mu$P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call "$\mu$PC". Through an extensive analysis of the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, $\mu$PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, $\mu$PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results have implications for other local algorithms and could be extended to convolutional and transformer architectures. Code for $\mu$PC is made available as part of a JAX library for PCNs at this https URL (Innocenti et al., 2024).
反向传播(BP)在生物学上的不合理性激发了许多基于大脑启发的替代算法,这些算法仅依赖于局部信息,例如预测编码(PC)和平衡传播。然而,在训练非常深的网络方面,这些算法一直面临着严峻挑战,这阻碍了它们与BP在大规模场景中的竞争能力。事实上,最近有人提出了扩展预测编码网络(PCNs)作为社区面临的一项挑战(Pinchetti等人,2024年)。在这里,我们展示了使用Depth-$\mu$P参数化方法可以可靠地训练具有100多层的PCNs,并将其命名为"$\mu$PC"(Yang等人,2023;Bordelon等人,2023)。 通过对大规模预测编码网络(PCN)扩展行为进行详尽分析,我们揭示了几个病理现象,这些现象使得标准PCNs在深度较大时难以训练。然后,我们展示了尽管仅解决了其中一些不稳定因素,$\mu$PC仍能稳定地训练非常深的(高达128层)残差网络以解决简单的分类任务,并且与当前基准相比具有竞争性的性能和很少需要调整。 此外,$\mu$PC使得在宽度和深度之间进行权重和活动学习率的零样本迁移成为可能。我们的研究结果对其他局部算法也有影响,并可以扩展到卷积和转换器架构。$\mu$PC的相关代码作为JAX库的一部分提供给预测编码网络(PCNs),可在此处获取(Innocenti等人,2024年)。
https://arxiv.org/abs/2505.13124
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
人类活动特别复杂且多变,这使得深度学习模型难以对其进行推理。然而,我们注意到这种变化确实具有潜在的结构,即由一系列相关行动组成的层次模式。我们认为这样的结构可以从未经脚本化的日常活动中自然地涌现出来,并可用于更好地理解这些活动的内容。我们提出了HiERO(Hierarchical Enrichment of Representations for Videos),这是一种弱监督方法,用于通过相应的人类活动层次线程来丰富视频片段的特征。通过将视频片段与叙述性描述对齐,HiERO 利用分层架构进行上下文、语义和时间推理。 我们使用多个视频-文本对齐基准(EgoMCQ 和 EgoNLQ)证明了我们的增强特性具有潜力,并且在最少额外训练下可以实现这一点。此外,在零样本学习任务(EgoProceL 和 Ego4D Goal-Step)中,HiERO 也展示了其有效性。值得注意的是,HiERO 在所有基准测试中的性能均处于行业领先水平,并且对于程序学习任务,在零样本情况下优于全监督方法的幅度显著(在 EgoProceL 上 F1 分数高出 +12.5%)。我们的结果证明了使用人类活动层次知识对多种推理任务的重要性,特别是在第一人称视觉领域。
https://arxiv.org/abs/2505.12911
Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequency components of the input equally. This uniform treatment forces the model to extract meaning features and suppress irrelevant noise simultaneously, limiting its effectiveness. We introduce FreqSelect, a lightweight, adaptive module that selectively filters spatial-frequency bands before encoding. By dynamically emphasizing frequencies that are most predictive of brain activity and suppressing those that are uninformative, FreqSelect acts as a content-aware gate between image features and natural data. It integrates seamlessly into standard very deep VAE-diffusion pipelines and requires no additional supervision. Evaluated on the Natural Scenes dataset, FreqSelect consistently improves reconstruction quality across both low- and high-level metrics. Beyond performance gains, the learned frequency-selection patterns offer interpretable insights into how different visual frequencies are represented in the brain. Our method generalizes across subjects and scenes, and holds promise for extension to other neuroimaging modalities, offering a principled approach to enhancing both decoding accuracy and neuroscientific interpretability.
从功能磁共振成像(fMRI)数据重构自然图像仍然是自然解码中的核心挑战,这主要是由于视觉刺激的丰富性和fMRI信号的噪声大、分辨率低之间的不匹配。尽管近期结合深度变分自编码器(VAEs)与扩散模型的两阶段模型在这方面取得了进展,但这些方法对输入的所有空间频率成分一视同仁。这种统一处理迫使模型同时提取有意义的特征并抑制无关噪音,从而限制了其有效性。 我们提出了FreqSelect模块,这是一个轻量级、适应性强的组件,在编码之前选择性地过滤空间频率带。通过动态强调与脑活动最相关的频段,并压制那些不提供信息的频段,FreqSelect充当图像特征和自然数据之间的内容感知门控机制。它能够无缝集成到标准的深层VAE-扩散模型管道中,并且不需要额外的监督。 在自然场景数据集上的评估表明,无论是在低级还是高级指标上,FreqSelect都一致地提高了重构质量。除了性能提升之外,所学得的空间频率选择模式还提供了关于视觉不同频段如何在大脑中表示的可解释性见解。我们的方法可以跨不同的个体和场景进行泛化,并且具有扩展到其他神经影像模态的潜力,为提高解码准确性和神经科学解释能力提供了一种原则性的途径。
https://arxiv.org/abs/2505.12552
Understanding and decoding brain activity into visual representations is a fundamental challenge at the intersection of neuroscience and artificial intelligence. While EEG-based visual decoding has shown promise due to its non-invasive, low-cost nature and millisecond-level temporal resolution, existing methods are limited by their reliance on flat neural representations that overlook the brain's inherent visual hierarchy. In this paper, we introduce ViEEG, a biologically inspired hierarchical EEG decoding framework that aligns with the Hubel-Wiesel theory of visual processing. ViEEG decomposes each visual stimulus into three biologically aligned components-contour, foreground object, and contextual scene-serving as anchors for a three-stream EEG encoder. These EEG features are progressively integrated via cross-attention routing, simulating cortical information flow from V1 to IT to the association cortex. We further adopt hierarchical contrastive learning to align EEG representations with CLIP embeddings, enabling zero-shot object recognition. Extensive experiments on the THINGS-EEG dataset demonstrate that ViEEG achieves state-of-the-art performance, with 40.9% Top-1 accuracy in subject-dependent and 22.9% Top-1 accuracy in cross-subject settings, surpassing existing methods by over 45%. Our framework not only advances the performance frontier but also sets a new paradigm for biologically grounded brain decoding in AI.
理解并解码脑活动以生成视觉表示是神经科学与人工智能交叉领域中的一个基本挑战。尽管基于EEG的视觉解码由于其非侵入性、低成本以及毫秒级的时间分辨率而展现出巨大潜力,现有的方法却受到依赖于忽视大脑固有视觉层次结构的扁平化神经表示的限制。在本文中,我们提出了ViEEG——一种受生物启发的分层EEG解码框架,该框架与Hubel-Wiesel关于视觉处理理论相一致。ViEEG将每个视觉刺激分解为三个与生物学相符的部分:轮廓、前景对象和背景场景,作为三流EEG编码器的基础。这些EEG特征通过交叉注意力路由进行逐步融合,模拟从初级视皮层(V1)到颞叶内侧区域(IT)再到关联皮层的信息流动过程。我们进一步采用了层次对比学习方法来使EEG表示与CLIP嵌入对齐,从而实现零样本对象识别功能。在THINGS-EEG数据集上的大量实验表明,ViEEG取得了最先进的性能,在依赖于单个受试者的情况下实现了40.9%的Top-1准确率,并且在跨受试者设置中达到了22.9%的Top-1准确率,这比现有方法高出超过45%。我们的框架不仅提升了性能边界,还为人工智能领域中的生物基础脑解码设定了新的范式。
https://arxiv.org/abs/2505.12408
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user's chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues.
开放领域对话系统旨在生成自然且引人入胜的对话,这在社交机器人和个人助手等实际应用中具有重要的实用价值。大型语言模型(LLM)的发展极大地推动了这一领域的进步,通过提高上下文理解和对话流畅性。然而,现有的基于LLM的对话系统往往缺乏主动理解用户聊天偏好的能力,并且不能将对话引导到以用户为中心的话题上。这种缺乏用户导向性的主动性可能会让用户感到不受重视,降低他们的满意度和继续对话的积极性。为了解决这一问题,我们提出了一种面向用户的主动聊天机器人(UPC),旨在增强用户的导向性主动性。 具体来说,我们首先构建了一个评估这种主动性水平的评判器,其灵感来自“LLM作为裁判”的策略。鉴于高质量训练数据稀缺的问题,然后利用评判器指导聊天机器人与用户代理之间的对话,从而生成具有增强用户导向性的主动性的语料库。为了确保用户的背景多样性,我们引入了ISCO-800,这是一个多样化的用户背景数据集,用于构建用户代理。此外,考虑到不同用户间的沟通难度各异,我们提出了一个迭代课程学习方法,从易于交流的用户开始训练聊天机器人,逐步过渡到更具挑战性的用户群体,从而逐渐提升其性能。 实验结果表明,我们的训练方法适用于不同的LLM,并且能够提高开放领域对话中的用户导向性和吸引力。
https://arxiv.org/abs/2505.12334