Abstract
Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.
Abstract (translated)
区分由夸克和胶子引发的喷注是高能物理学中的一个关键且具有挑战性的任务,对于提高大型强子对撞机(LHC)上新物理搜索和精密测量的质量至关重要。尽管深度学习技术——特别是卷积神经网络(CNNs)——已经通过基于图像的表现形式推进了喷注分类的研究,但视觉变压器(ViT)架构在直接使用现实探测器和堆积条件下进行能量计图像分析中的潜力仍未得到充分探索,尤其是在模拟全局上下文信息方面。本文系统地评估了使用2012年CMS开放数据对夸克-胶子喷注分类的ViT及其与CNN混合模型的效果。我们从检测级别的能量沉积(ECAL、HCAL)和重建轨迹构建多通道喷注视图图像,从而实现端到端的学习方法。我们的全面基准测试表明,基于ViT的模型——特别是ViT+MaxViT和ViT+ConvNeXt混合体,在F1分数、ROC-AUC和准确率方面始终优于传统的CNN基线模型,突显了捕捉喷注亚结构中长距离空间相关性的优势。这项工作首次建立了使用公共对撞机数据将ViT架构应用于能量计图像基础喷注分类的系统框架,并为该领域的进一步深度学习研究提供了一个有组织的数据集和性能基准。
URL
https://arxiv.org/abs/2506.14934