Abstract
Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only.
Abstract (translated)
高级生物智能从信息丰富刺激信息的流中高效学习,即使对于行为质量的反馈稀少或缺失。这种学习利用了关于任务领域的隐含假设。我们将这种学习称为领域适应学习(DAL)。相比之下,人工智能学习算法依赖于明确的外部提供的行为质量指标来获得适应行为。这导致信息瓶颈,限制了学习从多样非奖励刺激信息中获取知识的能力,降低了学习效率。我们考虑生物进化如何绕过这一瓶颈,产生领域适应学习。我们提出,物种首先进化出从奖励信号中学习的能力,提供低效(瓶颈ed)但广泛的适应性。从那时开始,将非奖励信息整合到学习过程中可以通过渐进积累由这种信息引起的偏差来逐步进行。这种情况为无瓶颈、领域适应学习提供了生物学的合理途径。专注于这个情景的第二阶段,我们建立了一个由奖励驱动学习建模为强化学习(A2C)的NN种群,并使用神经调节更新机制逐步将非奖励信息整合到学习过程中,以提高学习效率。在连续2D空间中的导航任务中,进化的领域适应学习代理表现出比纯RL代理学习速度快300倍。进化被发现完全消除了对奖励信息的依赖,使领域适应学习代理能够仅从非奖励信息中学习,并通过局部神经调节基于连接权重更新来完成。
URL
https://arxiv.org/abs/2404.12631