p2r - benchmarking code for portability studies

Giuseppe Cerati (FNAL)CCE-PPS MeetingDec. 4th, 2020

p2r - benchmarking code for portability studies

2020/12/04

Background

• SciDAC-4 project “HEP Reconstruction with Cutting Edge Computing Architectures”- https://computing.fnal.gov/hepreco-scidac4/- Two areas of work: LArTPC reconstruction and CMS Tracking / mkFit• S. Berkman, et al, Reconstruction for Liquid Argon TPC Neutrino Detectors Using Parallel

Architectures, EPJ Web Conf. 245 (2020), 02012, arXiv:2002.06291• S.Lantz, et al, Speeding up Particle Track Reconstruction using a Parallel Kalman Filter

Algorithm, JINST 15 P09030, arXiv:2006.00071

• Portability studies are of primary importance for our project- work ongoing on portability studies for “p2z”, a benchmarking code extracted from mkFit- this talk: explore potential collaboration about “p2r”

2

https://computing.fnal.gov/hepreco-scidac4/

https://arxiv.org/abs/2002.06295

https://arxiv.org/abs/2006.00071

2020/12/043 2020/07/28 Parallelization for HEP Event Reconstruction - ICHEP2020

LHC Reconstruction Challenges

4

• Increasing pile-up (PU) drives exponential increase in event reconstruction processing time• Tracking is the dominant contributor to the reconstruction time, and it’s vital for the physics

output of HEP experiments- reducing the tracking phase space implies significantly worse physics sensitivity

• Kalman filter-based tracking is the “standard” tracking method in HEP- demonstrated high efficiency physics performance, robust handling of material effects- track building is a combinatorial algorithm traditionally implemented in serial fashion,

challenging to parallelize

Recap from ICHEP2020 slides


Parallelizing Tracking: the mkFit Approach• mkFit mission: parallelize Kalman filter track building

• Matriplex library designed for SIMD processing of track candidates- bunches of small matrices operating in lock-step, auto-generated vectorized code is aware of matrix sparsity

• mkFit is multithreaded at multiple levels with TBB tasks: events, detector regions, bunches of seeds• Reduce memory footprint with lightweight detector description (geometry, material, magnetic field)• Minimize memory operations (number and size) within combinatorial branching- bookkeeping of explored candidates, clone only best ranking ones at each layer (with per seed cap)

5

Matriplex memory layout mkFit geometry representation



mkFit Physics and Computing Results• mkFit track building achieves comparable physics performance as standard CMS tracking (first iteration)• Standalone application achieves speedups of up to 3x from vectorization and up to 35x from

multithreading across multiple collision events- vectorization speedup computed on main track building function only, evaluated on Intel Skylake Gold processor

• mkFit is integrated in CMS framework (CMSSW), with a single threaded application it is 6x faster- mkFit compiled using icc and AVX-512 extensions- track building with mkFit is faster than CMSSW track fitting!- integration of mkFit in CMS workflows is currently under investigation

6

10 20 30 40 50 60Number of Threads

0

5

10

15

20

25

30

35

40

Aver

age

Spee

dup

per E

vent

1 Events2 Events4 Events8 Events16 Events32 Events64 EventsIdeal Scaling

Concurrent Event Scaling on SKL-SP



Next Steps: Portable Code Explorations• NVIDIA and CUDA are the current leaders in the GPU market, but the

future will bring GPUs from a variety of vendors and possibly different heterogeneous solutions• Need to explore solutions for code portability across platforms to

avoid re-writing software for each platform- many different options are emerging, mainly compiler directives or libraries

• Started an effort to test different portable implementations using a single function from the mkFit code- function is the propagation of track parameters to a given position along the z axis- representative of full code in terms of math operations types;

main limitation is that it does not include combinatorics• Early tests using OpenACC and Kokkos give promising results- on a Summit Node OpenACC GPU application almost 10x faster than OpenMP

application fully loading the CPU- performance of Kokkos CPU version is similar to nominal implementation with

dedicated Matriplex library• Ongoing work towards a more completed suite of measurements,

including OpenMP, OpenACC, Kokkos, Alpaka, Eigen, as well as different compilers

9

7KUHDGV

6HFRQGV��ORJ�VFDOH�

��

��

��

�

�

��

��

��

��

NRNNRV�DY[�� PDWULSOH[�DY[��

([HFXWLRQ�WLPH�IRU�S�]��,QWHO�;HRQ�(��Y��#��*+]


2020/12/04

From p2z to p2r

• Initial p2z code only performed propagation to z, recently upgraded to perform Kalman update operation as well- some improvement in arithmetic intensity, but still not ideal- standalone benchmarking code is now the backbone of Kalman filter fit for endcap disk-

only geometry

• “p2r” benchmarking code would be similar in scope, but for barrel-only geometry- small benchmarking code [O(100-1k) lines], with tunable parameters for scaling studies - both propagation and Kalman update functions are actually different from p2z- independent piece of work, with large commonalities- together they can be the groundwork for a portable Kalman filter track fit implementation

7

2020/12/04

Propagation to R• Barrel layer representation uses only 3 parameters: - average radius R, radius spread ΔR, layer length ΔZ- avoid filling memory with complex geometry structures (position, rotation of each module)- almost detector independent

• Two step propagation: - to average radius, select hits within search

window (inflated to account for ΔR)- for each hit, propagate to the

exact hit radius- work out Kalman Filter math on tangent plane

• Math driven by helix parametrization formulas- some trigonometric functions, a few small matrix operations

8

2020/12/04

Kalman Update

• Kalman update consists of a few small-matrix operations- implementation heavily relies on Matriplex

library

9

N-1

N

xN-1N-1

xN-1N=FN-1･xN-1N-1

mN

xNN=xN-1N+KN･(mN-HN･xN-1N)

updated state after N-1

propagation to N

Nth measurement

updated state after N

(x,y,z,px,py,pz)

(x,y,z)

2020/12/04

Summary

• SciDAC project working on modernization of reconstruction algorithms is exploring portability studies with benchmark codes extracted from mkFit

• Opportunity for collaboration with CCE-PPS on “p2r” code- complementary to ongoing “p2z” studies: independent code but large room for cross-

pollination of ideas and solution- combined they may result in a portable implementation track fit for CMS

10

p2r - benchmarking code for portability studies

Documents