advancement of the atlas deep learning based …dl1 c-jet probability 10−6 10−5 10−4 10−3...

Manuel Guth (Uni Freiburg, CEA Saclay), 27.02.2019[1] “Optimisation and performance studies of the ATLAS b-tagging algorithms for the 2017-18 LHC run” - ATL-PHYS-PUB-2017-013

[2] https://github.com/lwtnn/lwtnn

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1DL1 c-jet probability

6−10

5−10

4−10

3−10

2−10

1−10

1

Nor

mal

ised

b-jetsc-jetslight flavour jets

ATLAS Simulation Preliminaryt= 13 TeV, ts

|<2.5η>20 GeV and |T

p

Advancement of the ATLAS Deep Learning Based High Level Flavour

Tagger DL1 -LHCC Poster Session -

q

t̄ q0

g

b̄

W+

W�

q̄

q̄0

t

b

b

t̄q0

g`+

b

b̄

H0

W�

q̄

b̄

W+

tt

⌫

beyond SM

Why b-Tagging?

• Several interesting physics processes have b-quarks in their final state

• Or a veto on b-quarks can suppress the background

t̄

q

⌫̄e

g

b

b̄

b

b̄

W�

e�

q̄0

g

W+

H

t

• Heavy-flavour tagging important tool for physics analyses• Precision measurements• Search for new physics

High-level Taggers

d0

secondaryvertex

primary vertex

tertiaryvertex

b-jet

light-jets

b-hadron

impactparameter

tracks

c-hadron

Impact Parameter based

(IP2D,IP3D)

Inclusive secondary vertex (SV1)

NN based on track parameters (RNNIP)

Semi-leptonic decays to muons (SMT)

Baseline Taggers

Boosted Decision Tree

(MV2)

Deep Learning(DL1)

Topological secondary & tertiary vertices (JetFitter)

b-Tagging Structure in ATLAS

• Baseline taggers deploy specific heavy flavour jet properties• Long lifetime (∼1.5 ps ➔ ∼3mm track in detector)• High mass (∼5 GeV)• High decay product multiplicity• b-hadron decays to a c-hadron (|Vcb|>>|Vub|)

• High-level taggers (MV2 & DL1 [1]) combine these information (40-50 variables)

Network Output

input node

input node

input node

input node

input node

input node

…

bias

…

bias

…

bias

… …

bias

b node

c node

light- flavour node

hidden layers output layerinput layer

pb

pc

plight

Deep Neural Network Architecture

pre-processing

feature extraction/

feature selection

machine learning

evaluation & model

post processing

DL1score = lnpb

fc ⋅ pc + (1 − fc) ⋅ plight-flavour

• Deep neural network requires also• Preprocessing, feature selection

etc.• Network with fully connected layers• Multi-class output ➔ allows also c-

tagging

• Using Keras (2.2.4) framework with tensorflow backend

• Full training procedure relies on HDF5• Application can be run in ATLAS

reconstruction software, relying on the LWTNN C++ interface [2]

Training Samples

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

1−10ATLAS Simulation Preliminary

=13 TeVslight−flavour jet

tt

Z’

(a)

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

ATLAS Simulation Preliminary

=13 TeVsb-jet

tt

Z’

(b)

Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour

(a) and b-jets (b).

The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z

0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

7−10

6−10

5−10

4−10

3−10

2−10

T = 0.8*b-jet pT b-hadron p


t=13 TeV, ts

(a)

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

4−10

3−10



=13 TeV, Z’s

(b)

Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0

(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.

A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z

0 events, to probe the high pT regime. This new sample, referred to as the

22

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

1−10ATLAS Simulation Preliminary

=13 TeVslight−flavour jet

tt

Z’

(a)

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10


=13 TeVsb-jet

tt

Z’

(b)

Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour

(a) and b-jets (b).

The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z

0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

7−10

6−10

5−10

4−10

3−10

2−10



t=13 TeV, ts

(a)

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

4−10

3−10



=13 TeV, Z’s

(b)

Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0

(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.

A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z

0 events, to probe the high pT regime. This new sample, referred to as the

22

t t̄• Using hybrid sample composed of and Z’ events

➔More statistics in higher pT region• Downsampling approach applied to match pT

and |η| distributions for all 3 categories • Ensure independence of tagging from

kinematics• Different hybrid compositions tested

ATL-PHYS-PUB-2017-013

hybrid sample in the following, is obtained by including b-jets from tt̄ if the corresponding b-hadronpT <250 GeV and from the Z

0 sample if the b-hadron pT >250 GeV. For c- and light-flavour jets, the samemixing strategy is applied, moving from tt̄ events to Z

0 events for values of the jet pT above 250 GeV.The full statistics of the tt̄ (5M events) and the Z

0 (3M events) samples is used for the training of themultivariate algorithms. Figure 16 shows the resulting pT distribution of b-jets in this training sample.

[GeV]T

b−jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1

/N d

N/d

p

7−10

6−10

5−10

4−10

3−10

2−10


=13 TeVs

Figure 16: b-jet pT distribution of the hybrid sample used for the training of the multivariate-based taggers.

4.2 MV2 tagger

The first high-level tagger is a BDT discriminant that combines the output of the low-level taggersdescribed in Section 3. The BDT algorithm is trained using the ROOT Toolkit for Multivariate DataAnalysis (TMVA) [31] on the hybrid sample. b-jets are considered as signal, c- and light-flavour jets asbackground. The performance is evaluated separately on the tt̄ and Z

0 samples. The list of input variablesalso includes the pT and ⌘ of the jets, as they give useful information in interpreting the separation powerof the variables from the low-level taggers.

The b- and c-jets are reweighted in pT and ⌘ to match the spectrum of the light-flavour jets. This procedureavoids di�erences in the kinematic distributions of signal and background that can be interpreted as dis-criminating by the training. The complete list of variables used for the training, as well as the features ofthe algorithm and the optimization of the training parameters are discussed in Ref. [5]. The c-jet fractionin the training is set to 7% and that of light-flavoured jets to 93%. These values achieve a suitable balancebetween light-flavour and c-jet rejection.

Given the availability of new low-level taggers, several variants of the MV2 taggers have been developed:

• a reference option, based on the standard impact parameter (IP2D and IP3D) and secondary vertex-based (SV1 and JetFitter) input variables is adopted for the 2016 data-analysis in Ref. [5] (MV2);

• a new option, including in addition the SMT (MV2Mu);

• a full option with the SMT and the RNNIP inputs in addition to the standard variables (MV2MuRnn).

The use of the full set of input variables from the SMT algorithm and the output of the SMT multivariateBDT discriminant have both been tested in the MV2 training. The MV2 BDT configuration is significantlysimplified by using a single output variable from SMT instead of six inputs. Furthermore, it is found that

23

ATL-PHYS-PUB-2017-013

Hyper Parameter (HP) Optimisation with GRID GPUs• Using docker image (built by Gitlab CI)

for jobs• Configurable amount of HP

combinations (configs)• Using panda job splitting to distribute

HP configurations over sites / jobs• Interactive development on JupyterHub

deployed with Kubernetes

training & validationdataset (hdf5)

panda (prun)

training script(python)

provides Docker image

GRIDGPUs rucio

CONTAINER withconfig files (json)

containing HP information

training results(json)

Workflow0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

DL1 b-jet probability

4−10

3−10

2−10

1−10

1

Nor

mal

ised

b-jetsc-jetslight flavour jets



p

4 4.5 5 5.5 6 6.5 7c-jet rejection

20

40

60

80

100

120

140

160

180

light

flav

our j

et re

ject

ion



p

MV2c10rnn performanceHP optimisation, best resultHP optimisation, medium resultHP optimisation, worst result

77% b-tagging efficiency

Optimisation Results• 800 combinations over 5 HP dimensions tested• Optimisation provides promising results• Confirms relation between ROC curve and validation loss

units in 3rd layer

256

512

units in 4th layer

42

48

units in 5th layer

36

42

activation function

tanh

relu

batch size

1000

1800

2600

3400

4200

5000

learning rate validation loss

0.0001

0.0021

0.0041

0.006

0.008

0.01

0.66

0.68

0.69

0.7

0.72

0.73ATLAS Simulation Preliminary DL1 Hyper parameter optimisation

• Each jet gets probability for being a b-, c- or light flavour jet

interactive development

laptop laptop GRID

advancement of the atlas deep learning based …dl1 c-jet probability 10−6 10−5 10−4 10−3...

Documents