advancement of the atlas deep learning based …dl1 c-jet probability 10−6 10−5 10−4 10−3...
TRANSCRIPT
Manuel Guth (Uni Freiburg, CEA Saclay), 27.02.2019[1] “Optimisation and performance studies of the ATLAS b-tagging algorithms for the 2017-18 LHC run” - ATL-PHYS-PUB-2017-013
[2] https://github.com/lwtnn/lwtnn
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1DL1 c-jet probability
6−10
5−10
4−10
3−10
2−10
1−10
1
Nor
mal
ised
b-jetsc-jetslight flavour jets
ATLAS Simulation Preliminaryt= 13 TeV, ts
|<2.5η>20 GeV and |T
p
Advancement of the ATLAS Deep Learning Based High Level Flavour
Tagger DL1 -LHCC Poster Session -
q
t̄ q0
g
b̄
W+
W�
q̄
q̄0
t
b
b
t̄q0
g`+
b
b̄
H0
W�
q̄
b̄
W+
tt
⌫
beyond SM
Why b-Tagging?
• Several interesting physics processes have b-quarks in their final state
• Or a veto on b-quarks can suppress the background
t̄
q
⌫̄e
g
b
b̄
b
b̄
W�
e�
q̄0
g
W+
H
t
• Heavy-flavour tagging important tool for physics analyses• Precision measurements• Search for new physics
High-level Taggers
d0
secondaryvertex
primary vertex
tertiaryvertex
b-jet
light-jets
b-hadron
impactparameter
tracks
c-hadron
Impact Parameter based
(IP2D,IP3D)
Inclusive secondary vertex (SV1)
NN based on track parameters (RNNIP)
Semi-leptonic decays to muons (SMT)
Baseline Taggers
Boosted Decision Tree
(MV2)
Deep Learning(DL1)
Topological secondary & tertiary vertices (JetFitter)
b-Tagging Structure in ATLAS
• Baseline taggers deploy specific heavy flavour jet properties• Long lifetime (∼1.5 ps ➔ ∼3mm track in detector)• High mass (∼5 GeV)• High decay product multiplicity• b-hadron decays to a c-hadron (|Vcb|>>|Vub|)
• High-level taggers (MV2 & DL1 [1]) combine these information (40-50 variables)
Network Output
input node
input node
input node
input node
input node
input node
…
bias
…
bias
…
bias
… …
bias
b node
c node
light- flavour node
hidden layers output layerinput layer
pb
pc
plight
Deep Neural Network Architecture
pre-processing
feature extraction/
feature selection
machine learning
evaluation & model
post processing
DL1score = lnpb
fc ⋅ pc + (1 − fc) ⋅ plight-flavour
• Deep neural network requires also• Preprocessing, feature selection
etc.• Network with fully connected layers• Multi-class output ➔ allows also c-
tagging
• Using Keras (2.2.4) framework with tensorflow backend
• Full training procedure relies on HDF5• Application can be run in ATLAS
reconstruction software, relying on the LWTNN C++ interface [2]
Training Samples
[GeV]T
jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000
T1/N
dN
/dp
6−10
5−10
4−10
3−10
2−10
1−10ATLAS Simulation Preliminary
=13 TeVslight−flavour jet
tt
Z’
(a)
[GeV]T
jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000
T1/N
dN
/dp
6−10
5−10
4−10
3−10
2−10
ATLAS Simulation Preliminary
=13 TeVsb-jet
tt
Z’
(b)
Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour
(a) and b-jets (b).
The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z
0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.
[GeV]T
b-hadron p
200 400 600 800 1000 1200 1400
[G
eV
]T
b-je
t p
200
400
600
800
1000
1200
1400
7−10
6−10
5−10
4−10
3−10
2−10
T = 0.8*b-jet pT b-hadron p
ATLAS Simulation Preliminary
t=13 TeV, ts
(a)
[GeV]T
b-hadron p
200 400 600 800 1000 1200 1400
[G
eV
]T
b-je
t p
200
400
600
800
1000
1200
1400
4−10
3−10
T = 0.8*b-jet pT b-hadron p
ATLAS Simulation Preliminary
=13 TeV, Z’s
(b)
Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0
(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.
A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z
0 events, to probe the high pT regime. This new sample, referred to as the
22
[GeV]T
jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000
T1/N
dN
/dp
6−10
5−10
4−10
3−10
2−10
1−10ATLAS Simulation Preliminary
=13 TeVslight−flavour jet
tt
Z’
(a)
[GeV]T
jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000
T1/N
dN
/dp
6−10
5−10
4−10
3−10
2−10
ATLAS Simulation Preliminary
=13 TeVsb-jet
tt
Z’
(b)
Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour
(a) and b-jets (b).
The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z
0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.
[GeV]T
b-hadron p
200 400 600 800 1000 1200 1400
[G
eV
]T
b-je
t p
200
400
600
800
1000
1200
1400
7−10
6−10
5−10
4−10
3−10
2−10
T = 0.8*b-jet pT b-hadron p
ATLAS Simulation Preliminary
t=13 TeV, ts
(a)
[GeV]T
b-hadron p
200 400 600 800 1000 1200 1400
[G
eV
]T
b-je
t p
200
400
600
800
1000
1200
1400
4−10
3−10
T = 0.8*b-jet pT b-hadron p
ATLAS Simulation Preliminary
=13 TeV, Z’s
(b)
Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0
(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.
A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z
0 events, to probe the high pT regime. This new sample, referred to as the
22
t t̄• Using hybrid sample composed of and Z’ events
➔More statistics in higher pT region• Downsampling approach applied to match pT
and |η| distributions for all 3 categories • Ensure independence of tagging from
kinematics• Different hybrid compositions tested
ATL-PHYS-PUB-2017-013
hybrid sample in the following, is obtained by including b-jets from tt̄ if the corresponding b-hadronpT <250 GeV and from the Z
0 sample if the b-hadron pT >250 GeV. For c- and light-flavour jets, the samemixing strategy is applied, moving from tt̄ events to Z
0 events for values of the jet pT above 250 GeV.The full statistics of the tt̄ (5M events) and the Z
0 (3M events) samples is used for the training of themultivariate algorithms. Figure 16 shows the resulting pT distribution of b-jets in this training sample.
[GeV]T
b−jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000
T1
/N d
N/d
p
7−10
6−10
5−10
4−10
3−10
2−10
ATLAS Simulation Preliminary
=13 TeVs
Figure 16: b-jet pT distribution of the hybrid sample used for the training of the multivariate-based taggers.
4.2 MV2 tagger
The first high-level tagger is a BDT discriminant that combines the output of the low-level taggersdescribed in Section 3. The BDT algorithm is trained using the ROOT Toolkit for Multivariate DataAnalysis (TMVA) [31] on the hybrid sample. b-jets are considered as signal, c- and light-flavour jets asbackground. The performance is evaluated separately on the tt̄ and Z
0 samples. The list of input variablesalso includes the pT and ⌘ of the jets, as they give useful information in interpreting the separation powerof the variables from the low-level taggers.
The b- and c-jets are reweighted in pT and ⌘ to match the spectrum of the light-flavour jets. This procedureavoids di�erences in the kinematic distributions of signal and background that can be interpreted as dis-criminating by the training. The complete list of variables used for the training, as well as the features ofthe algorithm and the optimization of the training parameters are discussed in Ref. [5]. The c-jet fractionin the training is set to 7% and that of light-flavoured jets to 93%. These values achieve a suitable balancebetween light-flavour and c-jet rejection.
Given the availability of new low-level taggers, several variants of the MV2 taggers have been developed:
• a reference option, based on the standard impact parameter (IP2D and IP3D) and secondary vertex-based (SV1 and JetFitter) input variables is adopted for the 2016 data-analysis in Ref. [5] (MV2);
• a new option, including in addition the SMT (MV2Mu);
• a full option with the SMT and the RNNIP inputs in addition to the standard variables (MV2MuRnn).
The use of the full set of input variables from the SMT algorithm and the output of the SMT multivariateBDT discriminant have both been tested in the MV2 training. The MV2 BDT configuration is significantlysimplified by using a single output variable from SMT instead of six inputs. Furthermore, it is found that
23
ATL-PHYS-PUB-2017-013
Hyper Parameter (HP) Optimisation with GRID GPUs• Using docker image (built by Gitlab CI)
for jobs• Configurable amount of HP
combinations (configs)• Using panda job splitting to distribute
HP configurations over sites / jobs• Interactive development on JupyterHub
deployed with Kubernetes
training & validationdataset (hdf5)
panda (prun)
training script(python)
provides Docker image
GRIDGPUs rucio
CONTAINER withconfig files (json)
containing HP information
training results(json)
Workflow0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
DL1 b-jet probability
4−10
3−10
2−10
1−10
1
Nor
mal
ised
b-jetsc-jetslight flavour jets
ATLAS Simulation Preliminaryt= 13 TeV, ts
|<2.5η>20 GeV and |T
p
4 4.5 5 5.5 6 6.5 7c-jet rejection
20
40
60
80
100
120
140
160
180
light
flav
our j
et re
ject
ion
ATLAS Simulation Preliminaryt= 13 TeV, ts
|<2.5η>20 GeV and |T
p
MV2c10rnn performanceHP optimisation, best resultHP optimisation, medium resultHP optimisation, worst result
77% b-tagging efficiency
Optimisation Results• 800 combinations over 5 HP dimensions tested• Optimisation provides promising results• Confirms relation between ROC curve and validation loss
units in 3rd layer
256
512
units in 4th layer
42
48
units in 5th layer
36
42
activation function
tanh
relu
batch size
1000
1800
2600
3400
4200
5000
learning rate validation loss
0.0001
0.0021
0.0041
0.006
0.008
0.01
0.66
0.68
0.69
0.7
0.72
0.73ATLAS Simulation Preliminary DL1 Hyper parameter optimisation
• Each jet gets probability for being a b-, c- or light flavour jet
interactive development
laptop laptop GRID