Source: Daniel Whiteson daniel '@' uci.edu, Assistant Professor, Physics & Astronomy, Univ. of California Irvine Data Set Information: Machine learning is used in high-energy physics experiments to search for the signatures of exotic particles. These signatures are learned from Monte Carlo simulations of the collisions that produce these particles and the resulting decay products. In each of the three data sets here, the goal is to separate particle-producing collisions from a background source. The mass of the new particle is unknown, so three separate data sets are provided. In each data set, 50% of the data is from a signal process, while 50% is from the background process. The data is separated into a training set of 7 million examples and a test set of 3.5 million for each. 1) In the '1000' dataset, the signal particle has mass=1000. (Note: this dataset does not include a mass feature since all signal examples have the same mass.) 2) In the 'not1000' dataset, the signal particle's mass is drawn uniformly from the set {500, 750, 1250, 1500}. The mass is included as an input feature; for the background examples, the mass is selected randomly from this same set. 3) In the 'all' dataset, the signal particle's mass is drawn uniformly from the set {500, 750, 1000, 1250, 1500}. The mass is included as an input feature; for the background examples, the mass is selected randomly from this same set. Attribute Information: The first column is the class label (1 for signal, 0 for background), followed by the 27 normalized features (22 low-level features then 5 high-level features), and a 28th mass feature for datasets 2 and 3. See the original paper for more detailed information. There is a header line in each file. Features are ordered as follows, with the 22 low-level features followed by the 5 high-level features: LL: lepton pt, eta phi [3] Met magnitude, phi [2] number of jets [1] jet {1,2,3,4} pt, eta, phi, b-tag [16] HL: m_jj, m_jjj, m_lv, m_jlv, m_wwbb [5] Note that the data has been pre-processed by taking log(x + 10**-5) of the following features: 0, 3, 5, 6, 10, 14, 18, 22, 23, 24, 25, 26. Then each feature was centered and divided by the mean. The unprocessed simulation output that was used to construct the dataset is contained in ./simulation_output/. For the signal (xttbar) events, the mass is given in the filename. Two identical simulation runs were performed on aug4 and aug17/aug24, and are given in separate files, but the data from the two runs are combined. Relevant Papers: Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and Daniel Whiteson. 'Parameterized Machine Learning for High-Energy Physics.' The European Physics Journal C, 76(5), 1-7, 2016. Also available from the UCI ML repository at: https://archive.ics.uci.edu/ml/datasets/HEPMASS