Abstract
This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.
Abstract (translated)
这项研究通过引入混合Kolmogorov-Arnold网络(KAN)-视觉变换器(Hyb-KAN ViT),一种新的框架,解决了视觉变换器中多层感知机(MLPs)的内在局限性。该框架结合了基于小波的频谱分解和样条优化激活函数。先前的研究未能关注ViT架构的预构建模块化以及小波功能在边缘检测能力上的整合。我们提出了两个关键模块:高效KAN(Eff-KAN),用样条函数替换MLP层,以及Wavelet KAN(Wav-KAN),利用正交小波变换进行多分辨率特征提取。这些模块系统地集成到ViT编码器层和分类头中,以增强空间-频率建模的同时缓解计算瓶颈。 在ImageNet-1K(图像识别)、COCO(目标检测和实例分割)以及ADE20K(语义分割)上的实验表明Hyb-KAN ViT具有最先进的性能。消融研究验证了小波驱动的频谱先验在分割任务中的有效性,以及基于样条的方法在检测任务中的效率。该框架为视觉架构中参数效率和多尺度表示的平衡建立了新的范式。
URL
https://arxiv.org/abs/2505.04740