Abstract
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
Abstract (translated)
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
URL
https://arxiv.org/abs/2404.04538