advancement of the atlas deep learning based …dl1 c-jet probability 10−6 10−5 10−4 10−3...

1
Manuel Guth (Uni Freiburg, CEA Saclay), 27.02.2019 [1] “Optimisation and performance studies of the ATLAS b-tagging algorithms for the 2017-18 LHC run” - ATL-PHYS-PUB-2017-013 [2] https://github.com/lwtnn/lwtnn 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DL1 c-jet probability 6 10 5 10 4 10 3 10 2 10 1 10 1 Normalised b-jets c-jets light flavour jets ATLAS Simulation Preliminary t = 13 TeV, t s |<2.5 η >20 GeV and | T p Advancement of the ATLAS Deep Learning Based High Level Flavour Tagger DL1 -LHCC Poster Session - q ¯ t q 0 g ¯ b W + W - ¯ q ¯ q 0 t b b ¯ t q 0 g ` + b ¯ b H 0 W - ¯ q ¯ b W + t t beyond SM Why b-Tagging? Several interesting physics processes have b- quarks in their final state Or a veto on b-quarks can suppress the background ¯ t q ¯ e g b ¯ b b ¯ b W - e - ¯ q 0 g W + H t Heavy-flavour tagging important tool for physics analyses • Precision measurements • Search for new physics High-level Taggers d0 secondary vertex primary vertex tertiary vertex b-jet light-jets b-hadron impact parameter tracks c-hadron Impact Parameter based (IP2D,IP3D) Inclusive secondary vertex (SV1) NN based on track parameters (RNNIP) Semi-leptonic decays to muons (SMT) Baseline Taggers Boosted Decision Tree (MV2) Deep Learning (DL1) Topological secondary & tertiary vertices (JetFitter) b-Tagging Structure in ATLAS Baseline taggers deploy specific heavy flavour jet properties • Long lifetime (1.5 ps 3mm track in detector) • High mass (5 GeV) •High decay product multiplicity • b-hadron decays to a c-hadron (|V cb |>>|V ub |) High-level taggers (MV2 & DL1 [1]) combine these information (40-50 variables) Network Output input node input node input node input node input node input node bias bias bias bias b node c node light- flavour node hidden layers output layer input layer pb pc plight Deep Neural Network Architecture pre- processing feature extraction/ feature selection machine learning evaluation & model post processing DL1 score = ln p b f c p c + (1 f c ) p light-flavour Deep neural network requires also Preprocessing, feature selection etc. Network with fully connected layers Multi-class output allows also c- tagging Using Keras (2.2.4) framework with tensorflow backend Full training procedure relies on HDF5 Application can be run in ATLAS reconstruction software, relying on the LWTNN C++ interface [2] Training Samples [GeV] T jet p 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T 1/N dN/dp 6 10 5 10 4 10 3 10 2 10 1 10 ATLAS Simulation Preliminary =13 TeV s lightflavour jet t t Z’ [GeV] T jet p 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T 1/N dN/dp 6 10 5 10 4 10 3 10 2 10 ATLAS Simulation Preliminary =13 TeV s b-jet t t Z’ t ¯ t Using hybrid sample composed of and Z’ events More statistics in higher p T region Downsampling approach applied to match p T and |η| distributions for all 3 categories Ensure independence of tagging from kinematics Different hybrid compositions tested ATL-PHYS-PUB-2017-013 [GeV] T bjet p 0 200 400 600 800 1000 1200 1400 1600 1800 2000 T 1/N dN/dp 7 10 6 10 5 10 4 10 3 10 2 10 ATLAS Simulation Preliminary =13 TeV s ATL-PHYS-PUB-2017-013 Hyper Parameter (HP) Optimisation with GRID GPUs Using docker image (built by Gitlab CI) for jobs Configurable amount of HP combinations (configs) Using panda job splitting to distribute HP configurations over sites / jobs Interactive development on JupyterHub deployed with Kubernetes training & validation dataset (hdf5) panda (prun) training script (python) provides Docker image GRID GPUs rucio CONTAINER with config files (json) containing HP information training results (json) Workflow 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DL1 b-jet probability 4 10 3 10 2 10 1 10 1 Normalised b-jets c-jets light flavour jets ATLAS Simulation Preliminary t = 13 TeV, t s |<2.5 η >20 GeV and | T p 4 4.5 5 5.5 6 6.5 7 c-jet rejection 20 40 60 80 100 120 140 160 180 light flavour jet rejection ATLAS Simulation Preliminary t = 13 TeV, t s |<2.5 η >20 GeV and | T p MV2c10rnn performance HP optimisation, best result HP optimisation, medium result HP optimisation, worst result 77% b-tagging efficiency Optimisation Results 800 combinations over 5 HP dimensions tested Optimisation provides promising results Confirms relation between ROC curve and validation loss units in 3rd layer 256 512 units in 4th layer 42 48 units in 5th layer 36 42 activation function tanh relu batch size 1000 1800 2600 3400 4200 5000 learning rate validation loss 0.0001 0.0021 0.0041 0.006 0.008 0.01 0.66 0.68 0.69 0.7 0.72 0.73 ATLAS Simulation Preliminary DL1 Hyper parameter optimisation Each jet gets probability for being a b-, c- or light flavour jet interactive development laptop laptop GRID

Upload: others

Post on 05-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advancement of the ATLAS Deep Learning Based …DL1 c-jet probability 10−6 10−5 10−4 10−3 10−2 10−1 Normalised 1 b-jets c-jets light flavour jets ATLASSimulation Preliminary

Manuel Guth (Uni Freiburg, CEA Saclay), 27.02.2019[1] “Optimisation and performance studies of the ATLAS b-tagging algorithms for the 2017-18 LHC run” - ATL-PHYS-PUB-2017-013

[2] https://github.com/lwtnn/lwtnn

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1DL1 c-jet probability

6−10

5−10

4−10

3−10

2−10

1−10

1

Nor

mal

ised

b-jetsc-jetslight flavour jets

ATLAS Simulation Preliminaryt= 13 TeV, ts

|<2.5η>20 GeV and |T

p

Advancement of the ATLAS Deep Learning Based High Level Flavour

Tagger DL1 -LHCC Poster Session -

q

t̄ q0

g

W+

W�

q̄0

t

b

b

t̄q0

g`+

b

H0

W�

W+

tt

beyond SM

Why b-Tagging?

• Several interesting physics processes have b-quarks in their final state

• Or a veto on b-quarks can suppress the background

q

⌫̄e

g

b

b

W�

e�

q̄0

g

W+

H

t

• Heavy-flavour tagging important tool for physics analyses• Precision measurements• Search for new physics

High-level Taggers

d0

secondaryvertex

primary vertex

tertiaryvertex

b-jet

light-jets

b-hadron

impactparameter

tracks

c-hadron

Impact Parameter based

(IP2D,IP3D)

Inclusive secondary vertex (SV1)

NN based on track parameters (RNNIP)

Semi-leptonic decays to muons (SMT)

Baseline Taggers

Boosted Decision Tree

(MV2)

Deep Learning(DL1)

Topological secondary & tertiary vertices (JetFitter)

b-Tagging Structure in ATLAS

• Baseline taggers deploy specific heavy flavour jet properties• Long lifetime (∼1.5 ps ➔ ∼3mm track in detector)• High mass (∼5 GeV)• High decay product multiplicity• b-hadron decays to a c-hadron (|Vcb|>>|Vub|)

• High-level taggers (MV2 & DL1 [1]) combine these information (40-50 variables)

Network Output

input node

input node

input node

input node

input node

input node

bias

bias

bias

… …

bias

b node

c node

light- flavour node

hidden layers output layerinput layer

pb

pc

plight

Deep Neural Network Architecture

pre-processing

feature extraction/

feature selection

machine learning

evaluation & model

post processing

DL1score = lnpb

fc ⋅ pc + (1 − fc) ⋅ plight-flavour

• Deep neural network requires also• Preprocessing, feature selection

etc.• Network with fully connected layers• Multi-class output ➔ allows also c-

tagging

• Using Keras (2.2.4) framework with tensorflow backend

• Full training procedure relies on HDF5• Application can be run in ATLAS

reconstruction software, relying on the LWTNN C++ interface [2]

Training Samples

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

1−10ATLAS Simulation Preliminary

=13 TeVslight−flavour jet

tt

Z’

(a)

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

ATLAS Simulation Preliminary

=13 TeVsb-jet

tt

Z’

(b)

Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour

(a) and b-jets (b).

The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z

0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

7−10

6−10

5−10

4−10

3−10

2−10

T = 0.8*b-jet pT b-hadron p

ATLAS Simulation Preliminary

t=13 TeV, ts

(a)

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

4−10

3−10

T = 0.8*b-jet pT b-hadron p

ATLAS Simulation Preliminary

=13 TeV, Z’s

(b)

Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0

(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.

A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z

0 events, to probe the high pT regime. This new sample, referred to as the

22

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

1−10ATLAS Simulation Preliminary

=13 TeVslight−flavour jet

tt

Z’

(a)

[GeV]T

jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1/N

dN

/dp

6−10

5−10

4−10

3−10

2−10

ATLAS Simulation Preliminary

=13 TeVsb-jet

tt

Z’

(b)

Figure 14: Distribution of the jet transverse momentum of the Z0 sample compared to that of tt̄ events for light-flavour

(a) and b-jets (b).

The relation between the pT and energy of the jet to those of the original b-quark and hadron is key forunderstanding the flavour tagging response as a function of the jet pT. Two di�erent regimes can arise, asshown in Figure 15. In tt̄ events the b-jets originate from a relatively low-mass (mt ) state. This results insmall pT transferred to the heavy b-hadron. This leads to a correlation between jet pT and heavy hadronpT for pT . mt . For jet pT & mt , the jet transverse momentum is determined by nearby hadronic activityunrelated to the heavy hadron and the correlation is therefore reduced. Instead, b-hadrons produced in thebroad Z

0 decays with large pT yield a high degree of correlation with the jet pT. It is important to ensurethat the training of the flavour taggers covers both of these regimes.

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

7−10

6−10

5−10

4−10

3−10

2−10

T = 0.8*b-jet pT b-hadron p

ATLAS Simulation Preliminary

t=13 TeV, ts

(a)

[GeV]T

b-hadron p

200 400 600 800 1000 1200 1400

[G

eV

]T

b-je

t p

200

400

600

800

1000

1200

1400

4−10

3−10

T = 0.8*b-jet pT b-hadron p

ATLAS Simulation Preliminary

=13 TeV, Z’s

(b)

Figure 15: Two-dimensional correlations between the b-hadron and the b-jet transverse momenta for tt̄ (a) and Z0

(b) events. The continuous line represents the most probable value of the b quark fragmentation function used inthe simulation that controls the relation between the b quark and the b hadron energy.

A new training strategy has been employed using a sample made both of tt̄ events, to characterise thelow pT region, and broad Z

0 events, to probe the high pT regime. This new sample, referred to as the

22

t t̄• Using hybrid sample composed of and Z’ events

➔More statistics in higher pT region• Downsampling approach applied to match pT

and |η| distributions for all 3 categories • Ensure independence of tagging from

kinematics• Different hybrid compositions tested

ATL-PHYS-PUB-2017-013

hybrid sample in the following, is obtained by including b-jets from tt̄ if the corresponding b-hadronpT <250 GeV and from the Z

0 sample if the b-hadron pT >250 GeV. For c- and light-flavour jets, the samemixing strategy is applied, moving from tt̄ events to Z

0 events for values of the jet pT above 250 GeV.The full statistics of the tt̄ (5M events) and the Z

0 (3M events) samples is used for the training of themultivariate algorithms. Figure 16 shows the resulting pT distribution of b-jets in this training sample.

[GeV]T

b−jet p0 200 400 600 800 1000 1200 1400 1600 1800 2000

T1

/N d

N/d

p

7−10

6−10

5−10

4−10

3−10

2−10

ATLAS Simulation Preliminary

=13 TeVs

Figure 16: b-jet pT distribution of the hybrid sample used for the training of the multivariate-based taggers.

4.2 MV2 tagger

The first high-level tagger is a BDT discriminant that combines the output of the low-level taggersdescribed in Section 3. The BDT algorithm is trained using the ROOT Toolkit for Multivariate DataAnalysis (TMVA) [31] on the hybrid sample. b-jets are considered as signal, c- and light-flavour jets asbackground. The performance is evaluated separately on the tt̄ and Z

0 samples. The list of input variablesalso includes the pT and ⌘ of the jets, as they give useful information in interpreting the separation powerof the variables from the low-level taggers.

The b- and c-jets are reweighted in pT and ⌘ to match the spectrum of the light-flavour jets. This procedureavoids di�erences in the kinematic distributions of signal and background that can be interpreted as dis-criminating by the training. The complete list of variables used for the training, as well as the features ofthe algorithm and the optimization of the training parameters are discussed in Ref. [5]. The c-jet fractionin the training is set to 7% and that of light-flavoured jets to 93%. These values achieve a suitable balancebetween light-flavour and c-jet rejection.

Given the availability of new low-level taggers, several variants of the MV2 taggers have been developed:

• a reference option, based on the standard impact parameter (IP2D and IP3D) and secondary vertex-based (SV1 and JetFitter) input variables is adopted for the 2016 data-analysis in Ref. [5] (MV2);

• a new option, including in addition the SMT (MV2Mu);

• a full option with the SMT and the RNNIP inputs in addition to the standard variables (MV2MuRnn).

The use of the full set of input variables from the SMT algorithm and the output of the SMT multivariateBDT discriminant have both been tested in the MV2 training. The MV2 BDT configuration is significantlysimplified by using a single output variable from SMT instead of six inputs. Furthermore, it is found that

23

ATL-PHYS-PUB-2017-013

Hyper Parameter (HP) Optimisation with GRID GPUs• Using docker image (built by Gitlab CI)

for jobs• Configurable amount of HP

combinations (configs)• Using panda job splitting to distribute

HP configurations over sites / jobs• Interactive development on JupyterHub

deployed with Kubernetes

training & validationdataset (hdf5)

panda (prun)

training script(python)

provides Docker image

GRIDGPUs rucio

CONTAINER withconfig files (json)

containing HP information

training results(json)

Workflow0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

DL1 b-jet probability

4−10

3−10

2−10

1−10

1

Nor

mal

ised

b-jetsc-jetslight flavour jets

ATLAS Simulation Preliminaryt= 13 TeV, ts

|<2.5η>20 GeV and |T

p

4 4.5 5 5.5 6 6.5 7c-jet rejection

20

40

60

80

100

120

140

160

180

light

flav

our j

et re

ject

ion

ATLAS Simulation Preliminaryt= 13 TeV, ts

|<2.5η>20 GeV and |T

p

MV2c10rnn performanceHP optimisation, best resultHP optimisation, medium resultHP optimisation, worst result

77% b-tagging efficiency

Optimisation Results• 800 combinations over 5 HP dimensions tested• Optimisation provides promising results• Confirms relation between ROC curve and validation loss

units in 3rd layer

256

512

units in 4th layer

42

48

units in 5th layer

36

42

activation function

tanh

relu

batch size

1000

1800

2600

3400

4200

5000

learning rate validation loss

0.0001

0.0021

0.0041

0.006

0.008

0.01

0.66

0.68

0.69

0.7

0.72

0.73ATLAS Simulation Preliminary DL1 Hyper parameter optimisation

• Each jet gets probability for being a b-, c- or light flavour jet

interactive development

laptop laptop GRID