Abstract
Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
Abstract (translated)
实际数据包含大量多模态信息,其中视觉和语言是最具代表性的模态。此外,越来越重的模型,如Transformers,吸引了研究人员对模型压缩的关注。然而,如何压缩多模态模型,特别是视觉语言Transformer,仍然是一个未解决的问题。本文提出了U Pop框架,作为通用的视觉语言Transformer压缩框架,该框架包括以下两个功能:1)从原始模型中统一搜索多模态子树,使其能够在连续优化空间中自动分配压缩比,从而实现可压缩模态和结构的压缩比自动分配;2)逐步搜索和重新训练子树,以保持搜索和重新训练之间的收敛,以获得更高的压缩比。多个生成和判别的视觉语言任务,包括视觉推理、图像标题、视觉问答、图像-文本检索、文本-图像检索和图像分类实验,证明了所提出的U Pop框架的有效性和 versatility。
URL
https://arxiv.org/abs/2301.13741