Paper Reading AI Learner

Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images

2025-06-17 19:32:04
Md Abrar Jahin, Shahriar Soudeep, Arian Rahman Aditta, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen

Abstract

Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.

Abstract (translated)

区分由夸克和胶子引发的喷注是高能物理学中的一个关键且具有挑战性的任务,对于提高大型强子对撞机(LHC)上新物理搜索和精密测量的质量至关重要。尽管深度学习技术——特别是卷积神经网络(CNNs)——已经通过基于图像的表现形式推进了喷注分类的研究,但视觉变压器(ViT)架构在直接使用现实探测器和堆积条件下进行能量计图像分析中的潜力仍未得到充分探索,尤其是在模拟全局上下文信息方面。本文系统地评估了使用2012年CMS开放数据对夸克-胶子喷注分类的ViT及其与CNN混合模型的效果。我们从检测级别的能量沉积(ECAL、HCAL)和重建轨迹构建多通道喷注视图图像,从而实现端到端的学习方法。我们的全面基准测试表明,基于ViT的模型——特别是ViT+MaxViT和ViT+ConvNeXt混合体,在F1分数、ROC-AUC和准确率方面始终优于传统的CNN基线模型,突显了捕捉喷注亚结构中长距离空间相关性的优势。这项工作首次建立了使用公共对撞机数据将ViT架构应用于能量计图像基础喷注分类的系统框架,并为该领域的进一步深度学习研究提供了一个有组织的数据集和性能基准。

URL

https://arxiv.org/abs/2506.14934

PDF

https://arxiv.org/pdf/2506.14934.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot