Particle identification with machine learning from incomplete data in the ALICE experiment (2024)

Maja Karwowska,11footnotetext: Corresponding author.  Łukasz Graczykowski  Kamil Deja  Miłosz Kasak  and Małgorzata Janik

Abstract

The ALICE experiment at the LHC measures properties of the strongly interacting matter formed in ultrarelativistic heavy-ion collisions. Such studies require accurate particle identification (PID). ALICE provides PID information via several detectors for particles with momentum from about 100 MeV/cMeVc\rm{Me\kern-1.00006ptV/}croman_MeV / roman_cup to 20 GeV/cGeVc\rm{Ge\kern-1.00006ptV/}croman_GeV / roman_c.Traditionally, particles are selected with rectangular cuts. Amuch better performance can be achieved with machine learning (ML) methods. Our solution uses multiple neural networks (NN) serving as binary classifiers. Moreover, we extended our particle classifier with Feature Set Embedding and attention in order to train on data with incomplete samples. We also present the integration of the ML project with the ALICE analysis software, and we discuss domain adaptation, the ML technique needed to transfer the knowledge between simulated and real experimental data.

1 Introduction

ALICE (A Large Ion Collider Experiment)[1] is one of the four major detectors located at the Large Hadron Collider(LHC) at CERN[2]. The main goal of ALICE is to measure the properties of the quark–gluon plasma (QGP), a deconfined state of quarks and gluons, theorized to exist in the early Universe[3]. Detailed studies of QGP require very precise particle identification (PID), i.e., the ability to discriminate between different particle species produced during the collision. High PID accuracy distinguishes ALICE from other LHC experiments. It also allows for selecting a subset of particles required for specific analysis. Thanks to several detectors operating concurrently, various types of particles can be separated over a wide range of momentum from just around 100MeV/cMeVc\rm{Me\kern-1.00006ptV/}croman_MeV / roman_cup to around 10GeV/cGeVc\rm{Ge\kern-1.00006ptV/}croman_GeV / roman_c. Figure 1 presents a scheme of the ALICE detectors as used during the Run 1 and Run 2 LHC data-taking periods.

Particle identification with machine learning from incomplete data in the ALICE experiment (1)

In particular, particle identification over the full azimuthal angle uses information from three detectors: Time Projection Chamber[5] (TPC), Time-of-Flight[6] (TOF), Transition Radiation Detector[7] (TRD). The TPC is one of the most important ALICE detectors as it records the 3D trajectory of charged particles. It also measures particle-specific ionization energy loss, which is essential for PID. The TOF takes measurements of particle time of flight from the collision vertex to the detector. Particle velocity and mass can be further computed from the time of flight and the track information. The TRD records transition radiation, the emission of photons by electrons traversing the boundaries of a radiator. It enables to distinguish electrons from other charged particles. Since all the aforementioned detectors detect particles carrying a non-zero electric charge, our work focuses on the identification of charged particles.

The particles are identified based on observables derived from signals recorded by the detectors using selection criteria compared with theoretical calculations such as the Bethe-Bloch formula in case of TPC signals. The traditional approach, the so-called nσsubscriptn𝜎\rm{n}_{\sigma}roman_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT method, is to compare the number of standard deviations from the expected value for all detector-specific observables. If a particle has associated PID observables exceeding a certain number of standard deviations, it is rejected. For instance, if one intends to use both TPC and TOF signals, the PID selection would be defined as: nσ,TPC2+nσ,TOF2<Λsuperscriptsubscriptn𝜎TPC2superscriptsubscriptn𝜎TOF2Λ\rm\sqrt{n_{\sigma,TPC}^{2}+n_{\sigma,TOF}^{2}}<\Lambdasquare-root start_ARG roman_n start_POSTSUBSCRIPT italic_σ , roman_TPC end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_n start_POSTSUBSCRIPT italic_σ , roman_TOF end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < roman_Λ, where ΛΛ\Lambdaroman_Λ depends on desired balance between purity and efficiency, and typically is in range of 2 to 3. Such an approach is justified when the separation between various particle species is significantly large. However, in reality, the characteristics of different particle species can overlap, and combining information from multiple detectors can become very complex.

2 PID with machine learning

A natural response to the difficulties with optimal particle selection is to use machine learning (ML) algorithms, particularly the Bayesian method and neural networks (NN). In this view, particle identification is a standard classification problem. Compared to a human analyzer, ML can utilize more input particle features and learn more complex relationships between the variables. The Bayesian approach[8] is available in the new ALICE software framework O2, but its flexibility is limited. For this reason, we focused on neural networks, whose usage for particle identification was explored only for very specific analysis cases in other experiments[9, 10, 11].

2.1 Neural network approach

The simplest model, and also the starting point of our analysis, is a single feed-forward network trained and applied on Monte Carlo simulated data. We implemented a binary classifier, one instance per each (anti)particle species and each combination of detector signals. The network outputs a single value normalized to the range (0,1)01(0,1)( 0 , 1 ) by applying the logistic function f(x)=11+ex𝑓𝑥11superscript𝑒𝑥f(x)=\frac{1}{1+e^{-x}}italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG. The output value corresponds to the probability of the example corresponding to a specific particle type based on a selected detector set of measurements (TPC only, TPC+TOF, TPC+TOF+TRD). It is not possible to process different combinations of detector signals by a single network because the choice of detector set impacts the size of the network input vector. The networks for each (anti)particle species and detector setup are trained independently. Initial results reported in Ref.[12] are promising: asimple network using combined information from TPC and TOF detectors can improve efficiency compared to the traditional method while not losing purity.

Nevertheless, it must be considered that it is more difficult to estimate the systematic uncertainties of machine learning models. An example solution is to use dropout, a standard method for reducing overfitting in neural networks, to approximate Bayesian uncertainty as shown in[13].

2.2 Integration of PID ML with the O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT framework

The ALICE analysis framework, O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT[14], is written in C++ and heavily utilizes the ROOT[15] library, while most popular machine learning frameworks are implemented in Python. Therefore, we needed to build a universal interface between Python-based machine learning projects and the C++-based analysis framework. Our solution makes use of the ONNX (Open Neural Network Exchange) standard[16], which defines a common file format for storing machine learning models developed in various frameworks such as Tensorflow[17] and PyTorch[18]. The ONNXRuntime[19] can then be used for ONNX model inference in C++.

Particle identification with machine learning from incomplete data in the ALICE experiment (2)

The current PID ML workflow in O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is presented in Figure2. The processing starts from the analysis input data: reconstructed collisions and tracks in AO2D files. For training, simulated Monte Carlo data is used as it contains particle species labels. The PID ML producer task filters the input with a few rough preselections to exclude meaningless tracks and produces skimmed data that contains only track properties used by a neural network. The skimmed data is used for model training. Trained models are stored in the ONNX format in the Condition and Calibration Data Base (CCDB), which is accessible by analyses running on the worldwide LHC computing GRID. In an analysis task, a PID ML model for a single particle and detector settings is represented by an instance of the PidOnnxModel class. In more complex use cases, one can use PidOnnxModelInterface instead of dealing with a collection of PidOnnxModel instances. PidOnnxModelInterface provides handy management of multiple ML models, each with different target particle species, detector setup, and acceptance threshold.

Utilization of ONNXRuntime comes at the cost of creating additional memory copies. The input data is stored on disk in ROOT file format and read as Apache Arrow tables. Unfortunately, no direct conversion exists between Arrow tables and ONNXRuntime tensors. Therefore, the data of interest needs to be copied into C++ STL vectors and then converted to a tensor. This requires hardcoding of inputs for each ML model, which results in substantial code repetition in all analyses using ONNXRuntime and prevents the full development of a universal ONNXRuntime interface in O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The aforementioned problems are visible in the inference example in Listing1.

std::vector<float> inputValues{track.px(), track.py(), track.pz(), ...);

if (mDetector >= kTPCTOF) inputValues.push_back(scaledTOFSignal);

std::vector<Ort::Value> inputTensors;

inputTensors.emplace_back(Ort::Experimental::Value::CreateTensor<float>(inputValues.data(), inputValues.size(), input_shape));

auto outputTensors = mSession->Run(mInputNames, inputTensors, mOutputNames);

There exists at least one alternative to ONNXRuntime, ROOT SOPHIE TMVA. It is being developed as part of the ROOT library and enables inference of neural networks saved in ONNX file format. SOPHIE is naturally adapted to the ROOT file format, making it easier to adapt to the ROOT-Arrow O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT data pipeline. Nevertheless, SOPHIE is still in the development stage. The framework lacks the implementation of many advanced neural network operators as well as the implementation of other ML models such as Boosted Decision Trees used by various groups in ALICE.

Therefore, the main effort is still organized around a more analysis-friendly adaptation of ONNXRuntime. Once a direct, preferentially copyless Arrow-ONNXRuntime conversion is developed, it will be possible to use the so-called IO Binding to map inputs between data and an ML model. It will also enable encoding model inputs as, for example, strings in a JSON configuration file. This will result in a universal, analysis-independent inference code as depicted in Listing2.

std::vector<std::string> inputNamesFromJson = readJson(jsonFileName);

for (std::string& name : inputNamesFromJson)

mBinding->BindInput(name, table.asArrowTable()->GetColumnByName(name));

Ort::MemoryInfo outputMemInfo{"Cpu", OrtDeviceAllocator, 0, OrtMemTypeDefault};

mBinding->BindOutput("output", outputMemInfo);

mSession->Run(Ort::RunOptions(), *mBinding);

auto outputTensors = mBinding->GetOutputValues();\nopagbreak[4]

Presently, the same integration design that PID ML has, with ONNXRuntime and a C++ class to wrap management of a single ONNX model, was applied in other machine learning projects in O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. A common interface that further unified various implementations of the ONNX integration was developed based on PidOnnxModel. It is already used for supporting calculations for TPC PID and selection of Λc+superscriptsubscriptΛ𝑐\Lambda_{c}^{+}roman_Λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT in the pKπ+superscriptpKsuperscript𝜋\rm{pK}^{-}\pi^{+}roman_pK start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT decay channel. A specialization of the unified interface is being propagated to all heavy-flavor analyses. All of these developments encountered the same difficulties with input hardcoding and code repetitions, and they would benefit from code simplification brought by Arrow-ONNXRuntime direct conversion.

3 Feature Set Embedding and the attention mechanism

Even though an analyzer can set his own choice of detector set, it does not ensure that all input samples contain all needed detector signals. Particle properties are measured independently by each ALICE detector. Therefore, a particle can be recorded by a subset of detectors while not being measured by others. Possible reasons can be random, like detector malfunction or switching off, and systematic, like particle properties falling outside the detector acceptance region, e.g., too low transverse momentum, pTsubscript𝑝Tp_{\rm T}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT, to reach outer detectors like TOF and TRD.

One can simply remove all incomplete samples from the input. However, this does not allow the identification of samples with missing information, which can form the majority of data. Another method is imputation, which introduces artificial bias to the initial dataset, which can disturb the predictions of the ML algorithm. It is also possible to alter the neural network architecture. For example, one can use a neural network ensemble, a set of classifiers, one per each subset of the training dataset without missing data. The major drawback of this approach is computational complexity, especially with a growing set of attributes with missing values.

To overcome the aforementioned shortcomings, in Ref.[21], we introduce a novel method based on the attention mechanism, similar to the method introduced for a medical use-case in AMI-Net[22]. The system overview is shown in Figure3.

Particle identification with machine learning from incomplete data in the ALICE experiment (3)

The first module is based on the Feature Set Embedding strategy proposed in Ref.[23]. Input data samples are encoded into a set of feature-value pairs. Each pair represents a non-missing value in the input sample and a one-hot encoded index of the feature corresponding to this value. Then, a neural network with a single hidden layer computes embedding for each feature-value pair. The embeddings place similar features close in the embedded space.

The Transformer[24] encoding module connects different features represented by a set of embedding vectors and finds input patterns. For example, a detector signal has meaning only if the momentum is within a particular range. The softmax function is applied to the attention output, a variable-size set of vectors. An additional self-attention layer is used to merge these vectors into one. Finally, the pooled vector is processed by the simple classifier described in Section2.1.

The attention architecture was tested on data from a Monte Carlo simulation of proton–proton collisions at s=13𝑠13\sqrt{s}=13square-root start_ARG italic_s end_ARG = 13TeV with a realistic simulation of the time evolution of the detector conditions in the LHC Run 2 data-taking period. The simulation was performed with Pythia8[25], the Geant4[26] particle transport model, and general-purpose settings. The six most abundant particle species were considered for comparison: pions, kaons, protons, and their antiparticles.

The results reported in Ref.[21] clearly show that machine learning algorithms easily outperform the standard method as measured by F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metrics. The proposed attention architecture achieves very high scores of F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, precision (purity), and recall (efficiency), comparable with other analyzed ML models. At the same time, our model avoids the flaws of other solutions: artificial bias in imputed and case-deleted data and potentially larger complexity of the neural network ensemble.

4 Domain Adversarial Neural Networks

Particle identification is used to discriminate particle species in both real experimental data and Monte Carlo simulations. In particular, machine learning techniques presented in this article learn on labeled simulated data but can also be applied to unlabeled experimental data. However, the simulations often result in distributions of particle features shifted as compared to values registered at the experiment. To mitigate this effect, standard PID methods utilize automated processes for data domain alignment. For example, ALICE implemented a tuning method of simulated signals, which shifts back simulated distributions to reproduce, on average, the collected data distributions of selected variables.

Naturally, this is a limited solution, which does not allow for full domain alignment of all particle features. Therefore, we will make use of a known machine learning technique called domain adaptation, which aims to learn the discrepancies between two data domains, the labeled source and the unlabeled target, and translate those to a single hyperspace. The desired classifier is trained and applied to features from the combined latent space. Since the classifier works independently of the initial data domains, it achieves similar performance on both the labeled and unlabeled data (simulated and experimental data in our case).

In the world of neural networks, Domain Adversarial Neural Network (DANN)[27] is the realization of the domain adaptation technique. As depicted in Figure4, DANN is composed of three neural networks. The feature mapping module maps the original input into domain invariant features, which are provided to the particle classifier that outputs the particle type. At the same time, the domain classifier enforces domain invariance of extracted features through adversarial training.

Particle identification with machine learning from incomplete data in the ALICE experiment (4)

Adversarial training requires DANN training to be split into two steps. First, the domain classifier takes current features from the feature mapping network and assigns them domain labels, whether the data comes from a real or a simulation source. Then, the weights of the domain classifier are frozen, and the particle classifier and the feature mapper learn jointly to predict accurate particle species. The weights of the particle classifier and the domain classifier are updated with respective gradients, while the feature mapper weights are updated with a gradient from the particle classifier and a reversed gradient from the domain classifier. As a result, the trained model maximizes particle identification scores while minimizing domain classifier scores to ensure that the domains are hardly distinguishable for the particle classifier. Overall, the training procedure is more complex than in a simple neural network, but the application performance is similar and depends on the complexity of the particle classifier and the feature mapper.

Domain adaptation is widely used in natural language processing[28, 29] and computer vision[30, 31]. In high-energy physics, the author of Ref.[32] presents how this method can improve the quality of automatic jet tagging on real experimental data. The initial tests with DANN[12] show that this technique improves the classification of particles in experimental data.

5 Conclusions and outlook

Using machine learning for particle identification can result in higher purity and efficiency than standard methods. The current priority is to test PID ML in a real-world analysis of simulation data from the present Run 3 data-taking period. Afterward, we will extend the attention model with domain adaptation and test it on the new real data. We will check what levels of disagreement between experimental and Monte Carlo data DANN can handle, and we will estimate model uncertainties. Finally, we will be able to achieve regular production of models for Run 3.

In our project, we also solved a technical problem of integration of Python machine learning projects with the C++ ALICE software. Once a convenient data format conversion into ONNXRuntime tensors is developed, O2superscriptO2\rm{O}^{2}roman_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ML projects will get an even more unified, user-friendly interface.

Acknowledgments

We would like to thank the ALICE Collaboration for guidance and support during our research as well as for the access to all software and data.

This work was supported by the Polish National Science Centre under agreements no. 2021/43/ D/ST2/02214 and UMO-2022/45/B/ST2/02029, by the Polish Ministry for Education and Science under agreements no. 2022/WK/01 and 5236/CERN/2022/0, as well as by the IDUB-POB-FWEiTE-2 project granted by Warsaw University of Technology under the program Excellence Initiative: Research University (ID-UB).

References

  • [1]ALICE collaboration, The ALICE experiment at the CERN LHC,JINST 3 (2008) S08002.
  • [2]L.Evans and P.Bryant, LHC Machine,JINST 3 (2008) S08001.
  • [3]ALICE collaboration, The ALICE experiment – A journeythrough QCD, arXiv:2211.04384 [nucl-ex] (2022) [2211.04384].
  • [4]A.Tauro, “ALICE Schematics.” 2017.
  • [5]ALICE collaboration, ALICE Time Projection Chamber: TechnicalDesign Report, Technical design report. ALICE, CERN, Geneva (2000).
  • [6]ALICE collaboration, ALICE Time-Of-Flight system (TOF):Technical Design Report, Technical design report. ALICE, CERN, Geneva(2000).
  • [7]ALICE collaboration, ALICE Transition-Radiation Detector:Technical Design Report, Technical design report. ALICE, CERN, Geneva(2001).
  • [8]ALICE collaboration, Particle identification in ALICE: aBayesian approach,European PhysicsJournal Plus 131 (2016) 168[1602.01392].
  • [9]LHCb collaboration, LHCb detector performance,International Journalof Modern Physics A 30 (2015) 1530022.
  • [10]J.Collado etal., Learning to identify electrons,Physical Review D103 (2021) 116028.
  • [11]CMS collaboration, Identification of hadronic tau leptondecays using a deep neural network,Journal ofInstrumentation 17 (2022) P07023.
  • [12]Ł.K.Graczykowski, M.Jakubowska, K.R.Deja, M.Karwowska and onbehalfofthe ALICEcollaboration, Using machine learning for particleidentification in ALICE,Journal ofInstrumentation 17 (2022) C07016.
  • [13]Y.Gal and Z.Ghahramani, Dropout as a Bayesian approximation:Representing model uncertainty in deep learning, in InternationalConference on Machine Learning, pp.1050–1059, PMLR, 2016.
  • [14]AliceO2Group, “O2 analysis framework documentation.” 2024,https://aliceo2group.github.io/analysis-framework/ (Accessed: 15January 2024).
  • [15]R.Brun and F.Rademakers, Root – an object oriented data analysisframework,NuclearInstruments and Methods in Physics Research Section A: Accelerators,Spectrometers, Detectors and Associated Equipment 389 (1997)81.
  • [16]ONNX Community, “ONNX.” 2024, https://onnx.ai/ (Accessed: 15January 2024).
  • [17]M.Abadi etal., TensorFlow: Large-Scale Machine Learning onHeterogeneous Systems, 2015.
  • [18]A.Paszke etal., PyTorch: An Imperative Style, High-Performance DeepLearning Library, in Advances in Neural Information ProcessingSystems, vol.32, pp.8024–8035 (2019).
  • [19]ONNXRuntime Community, “ONNXRuntime.” 2024, https://onnxruntime.ai/(Accessed: 15 January 2024).
  • [20]A.Alkin, G.Eulisse, J.F.Grosse-Oetringhaus, P.Hristov and M.Karwowska,ALICE Run 3 Analysis Framework,EPJ Web Conf.251 (2021) 03063.
  • [21]M.Kasak, K.Deja, M.Karwowska, M.Jakubowska, Łukasz Graczykowski andM.Janik, Machine-learning-based particle identification with missingdata, 2023.
  • [22]Z.Wang etal., Attention-based multi-instance neural network formedical diagnosis from incomplete and low quality data, in 2019International Joint Conference on Neural Networks (IJCNN), pp.1–8, 2019,DOI.
  • [23]D.Grangier and I.Melvin, Feature set embedding for incomplete data,Advances in Neural Information Processing Systems 23(2010) .
  • [24]A.Vaswani etal., Attention is all you need, Advances inNeural Information Processing Systems 30 (2017) .
  • [25]T.Sjöstrand etal., An introduction to PYTHIA 8.2,Computer PhysicsCommunications 191 (2015) 159[1410.3012].
  • [26]R.Brun etal., GEANT Detector Description and Simulation Tool, Tech.Rep. CERN-W5013, CERN-W-5013, W5013, W-5013, CERN (1994),DOI.
  • [27]Y.Ganin etal., Domain-adversarial training of neural networks,The Journal ofMachine Learning Research 17 (2016) 2096.
  • [28]J.Blitzer, M.Dredze and F.Pereira, Biographies, Bollywood, Boom-boxesand Blenders: Domain Adaptation for Sentiment Classification, inProceedings of the 45th Annual Meeting of the Association ofComputational Linguistics, pp.440–447, 2007.
  • [29]X.Glorot, A.Bordes and Y.Bengio, Domain adaptation for large-scalesentiment classification: A deep learning approach, in ICML, 2011.
  • [30]R.Gopalan, R.Li and R.Chellappa, Domain adaptation for objectrecognition: An unsupervised approach, in 2011 InternationalConference on Computer Vision, pp.999–1006, IEEE, 2011,DOI.
  • [31]B.Fernando etal., Unsupervised visual domain adaptation using subspacealignment, in Proceedings of the IEEE International Conference onComputer Vision, pp.2960–2967, 2013,DOI.
  • [32]D.Walter, Domain Adaptation Studies in Deep Neural Networks forHeavy-Flavor Jet Identification Algorithms with the CMS Experiment,Master thesis (2018) .
Particle identification with machine learning from incomplete data in the ALICE experiment (2024)

References

Top Articles
Latest Posts
Article information

Author: Reed Wilderman

Last Updated:

Views: 6223

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.