Abstract
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
Abstract (translated)
大语言模型(LLMs)在边缘设备上通过微调和完善其参数来学习。尽管这种学习方法可以优化以减少资源利用率,但总体上边缘设备所需的资源仍然沉重负担。相反,检索增强生成(RAG)是一种资源高效的LLM学习方法,可以在不更新模型参数的情况下提高LLM生成的内容的质量。然而,基于RAG的LLM可能需要在每个用户-LLM交互过程中对用户数据进行重复搜索。这种搜索可能导致延迟的积累以及随着用户数据的增长而降低RAG的可扩展性。仍然是一个未解决的问题:如何从边缘设备的延迟和可扩展性约束中解放RAG?在本文中,我们提出了通过计算在内存中的架构加速RAG的新框架。它通过在内存中进行原地计算来加速矩阵乘法,同时避免计算单元和内存之间进行昂贵的数据传输。我们的框架Robust CiM-backed RAG(RoCR)使用了一种新的基于对比学习的学习方法和新颖的噪声感知训练,可以实现RAG与CiM的 efficiently搜索用户数据。据我们所知,这是第一个利用CiM加速RAG的工作。
URL
https://arxiv.org/abs/2405.04700