Paper Reading AI Learner

Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

2025-05-07 19:13:17
Sainath Dey, Mitul Goswami, Jashika Sethi, Prasant Kumar Pattnaik

Abstract

This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.

Abstract (translated)

这项研究通过引入混合Kolmogorov-Arnold网络(KAN)-视觉变换器(Hyb-KAN ViT),一种新的框架,解决了视觉变换器中多层感知机(MLPs)的内在局限性。该框架结合了基于小波的频谱分解和样条优化激活函数。先前的研究未能关注ViT架构的预构建模块化以及小波功能在边缘检测能力上的整合。我们提出了两个关键模块:高效KAN(Eff-KAN),用样条函数替换MLP层,以及Wavelet KAN(Wav-KAN),利用正交小波变换进行多分辨率特征提取。这些模块系统地集成到ViT编码器层和分类头中,以增强空间-频率建模的同时缓解计算瓶颈。 在ImageNet-1K(图像识别)、COCO(目标检测和实例分割)以及ADE20K(语义分割)上的实验表明Hyb-KAN ViT具有最先进的性能。消融研究验证了小波驱动的频谱先验在分割任务中的有效性,以及基于样条的方法在检测任务中的效率。该框架为视觉架构中参数效率和多尺度表示的平衡建立了新的范式。

URL

https://arxiv.org/abs/2505.04740

PDF

https://arxiv.org/pdf/2505.04740.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot