Abstract
Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (<1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach -- which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) -- achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M -- the largest open LArTPC dataset (1M+ events, 5.2B labeled points) -- to advance SSL in high energy physics (HEP). Project site: this https URL
Abstract (translated)
有效的自监督学习(SSL)技术已成为解锁大型数据集进行表示学习的关键。尽管许多有前景的方法已经通过在线语料库和配有说明的照片得到了发展,但它们在科学领域的应用——这些领域中的数据编码了高度专业化的知识——仍处于早期阶段。我们提出了一种用于时间投影室(TPC)中3D粒子轨迹分析的自监督掩码建模框架。这些探测器产生的全局稀疏(<1%占用率)但局部密集的点云,以毫米级分辨率捕捉到米级的粒子轨迹。 基于PointMAE的工作提出了体积标记化来将稀疏的离子化点分组为与分辨率无关的补丁,并引入了一个辅助能量填充任务来改进轨迹语义。我们称这种方法为基于点的液氩掩码自编码器(PoLAr-MAE),它在没有标注数据的情况下,达到了99.4%的追踪和97.7%的 Shower分类F分数,与监督基线相匹配。 尽管该模型能够学习到丰富的粒子轨迹表示,但它仍难以处理如重叠或短寿命粒子轨迹这样的亚标记现象。为了支持进一步的研究,我们发布了PILArNet-M——一个最大的开放LArTPC数据集(超过100万事件,52亿个标注点),以推动高能物理领域中的自监督学习。 项目网站:[这个链接](https://this-url.com)
URL
https://arxiv.org/abs/2502.02558