“ gpus in na62 trigger system”

30
“GPUs in NA62 trigger system” Gianluca Lamanna (CERN) NA62 Collaboration meeting in Brussels 9.9.2010

Upload: tory

Post on 19-Jan-2016

65 views

Category:

Documents


0 download

DESCRIPTION

“ GPUs in NA62 trigger system”. Gianluca Lamanna (CERN). NA62 Collaboration meeting in Brussels 9.9.2010. Outline. Short reminder Use of GPUs in NA62 trigger system Study of algorithms performances: resolution and timing Towards a working prototype Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “ GPUs  in NA62 trigger system”

“GPUs in NA62 trigger system”

Gianluca Lamanna(CERN)

NA62 Collaboration meeting in Brussels 9.9.2010

Page 2: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Outline

2

Short reminder Use of GPUs in NA62 trigger system Study of algorithms performances: resolution and timing Towards a working prototype Conclusions

Page 3: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0GPU Idea: reminder

3

Nowadays the GPU (standard PC video card processor) are very powerful processors, with computing power exceeding 1 Teraflops. The GPUs are designed for digital imaging, video games, computer graphics. In this context the main problems are to apply the same operations to a large quantity of data (for instance move an object, transform a part of an image, etc…) The architecture of the processor is highly parallel. In latter years several efforts to use the GPU in high performance computing in several fields (GPGPU)

Is it possible to use the GPUs for “hard realtime” computing? Is it possible to build a trigger system based on GPUs for high performance online selection? Is it possible to select events with high efficiency using cheap off-the-shelf components?

Page 4: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0GPU characteristics: reminder

4

SIMD architecture Big number of cores The single cores are grouped in Multiprocessors sharing a small quantity of in-chip memory (very fast) Huge quantity of external memory accessible from each core Particular care in programming the chip to exploit the architecture Big performance guarantee for parallelizable problems Two logic level of parallelization: parallel algorithm and parallel executionTESLA C1060:

• 240 cores• 30 Multiprocessor• 4 GB Ram• 102 GB/s memory bandwidth• PCI-ex gen2 connection

The GPU (“device”) has to be always connected to a CPU (“host”) !!!

Page 5: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0The “GPU trigger”

5

FE digitization

Trigger primitive

s

pipeline

Trigger processor

PCsL0

L1

Standard trigger system

FE digitization PCs PCs

“Triggerless”

FEDigitization + buffer

+ (trigger primitives)

PCs+GPU

PCs+GPUL0

L1

“quasi-triggerless” with GPUs

Custom HW

Commercial PCs

Commercial GPU system

Page 6: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Benefits

6

The main advantage in using GPUs is to have a huge computing power in a compact system → cheap, off-the-shelf large consumer sector in continuous development, easy high level programming (C/C++), fully reconfigurable system, minimum custom hardware, very innovative (nobody in the world is using GPUs for trigger at the moment!).

The software trigger levels are the natural place where the GPU should be placed → reduction of farm dimension.

In the L0 the GPUs could be used to design more flexible and efficient trigger algorithms based on high quality fast reconstructed primitives → more physics channels collected, lower bandwidth employed, higher purity for the main channels

hyperCP sgoldstino

[See other physics example in my talks in Anacapri (2.9.2009) and at the NA62 Physics Handbook Workshop (12.12.2009)]

Page 7: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0L1 “GPU trigger”

7

TEL62

L0TP

L1 PC

GPU

L1TP

L2 PC

GPU

GPU

1 MHz 100 kHz

The use of GPUs in software trigger levels is straightforward The GPUs act as “co-processors” to increase the power of the PCs allowing faster decision and smaller number of CPU cores involved

The RICH L1 dimension is dominated by the computing power:

4 Gigabits links from the 4 RICH TELL1 should be, in principle, managed by a single PC. Assuming 10 s of total time for L1 and 1 MHz of input rate the time budget for single event is 1 us.

The LKr is read at L1 at 100 kHz. About 173 Gb/s are produced. At least 20 PCs have to be dedicated to the event building (assuming 10 GbE links after a switch). Each PC see 5 kHz, resulting in 200 us of max processing time. Also in this case the GPUs could be employed to guarantee this time budget avoiding big increasing in farm cost (in the TD we assumed 4 ms per event).

Assuming 200 us for ring reconstruction (and other small stuffs), we need 200 cores (25 PCs) to produce the L1 trigger primitives Using GPUs a factor of 200 in reduction it isn’t impossible (see later)

Page 8: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0L0 “GPU trigger”

8

TEL62L0 GPU

L0TP10 MHz 10 MHz

1 MHz

Max 1 ms latency

In the L0 GPU one event has to be processed in 100 ns and the total latency should be less than 100 us!!!

Data arrive

Transfer in the RAM

Transfer of a Packet of data in GRAM in video card

Processing

Send back to PC the results

… protocol stack possibly managed in the receiver card

… the non deterministic behavior of the CPU should be avoided (real time OS)

… the PCI-ex gen2 is fast enough. Concurrent transfer during processing.

… as fast as possible!!!

… done!

Max

100

us

Page 9: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Example: RICH

9

TEL62

TDCB

TDCB

TDCB

TDCB GBE

TEL62

TDCB

TDCB

TDCB

TDCB GBE

R/O ~50 MB/s

Std primitives ~22 MB/s

GPU primit. ~1.2 Gb/s

L0 GPU

2 TEL62 for 1 spots (1000 PMs) The GPU needs the position of the PMs → 10 bits 3 hits can be put in a single word 20 hits x 5 MHz / 3 ~1.2 Gb/s 1 link for R/O, 1 link for standard primitives (histograms), 2 links for hits position for the GPU

Should be very useful to have rings (center and position) at L0 Measurement of particle velocity First ingredient for PID (we need also the spectrometer) Possibility to have missing mass information (assuming the particle mass and Pt kick) Very useful for rare decays (K→pgg, K→pg, Ke2g, …)

Page 10: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Ring finding with GPU: generality

10

The parallelization offered by the GPU architecture can be exploited in two different ways:

In the Algorithm: parts of the same algorithms could be executed on different cores in order to speed up the single event processing Processing many events at the same time: each core process a single event.

The two “ways” are, usually, mixed.

The “threads” running in a multiprocessor (8 cores each) communicates through the very fast Shared memory (1 TB/s bandwidth) The data are stored in the huge Global Memory (a “packet” of N events is periodically copied in the global memory)

In general we have two approaches: each thread makes very simple operation → heavy use of shared memory each thread makes harder computations → light use of shared memory

Page 11: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0On the CPU

11

Minimization of

i

i Adrrr

L log))((

2

12

200

With dr0=4.95, s=2.9755, A=-2asexp(0.5)(dr0+s)

Inside the PM acceptance and constant outside. (From the RICH test beam analisys (Antonino))

The minimization is done using the MINUIT package with Root.

Page 12: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0POMH

12

distance

Each PM (1000) is considered as the center of a circle. For each center an histogram is constructed with the distances btw center and hits (<32). The whole processor is used for a single event (huge number of center) The single thread computes few distances Several histograms computed in different shared memory spaces Not natural for processor Isn’t possible to process more than one event at the same time (the parallelism is fully exploited to speed up the computation)

• Very important: particular care has to be adopted writing in shared memory to avoid “conflicts” → In this algorithms isn’t easy to avoid conflicts! (in case of conflicts the writing on the memory is serialized → lost of time)

Page 13: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0DOMH

13

Shared

Shared

Shared

1 event -> M threads (each thread for N PMs)

1 event -> M threads (each thread for N PMs)

1 event -> M threads (each thread for N PMs)

G L O B A L M E

MO R Y

Exactly the same algorithm wrt the POMH, but with different resources assignment The system is exploited in a more natural way: each block is dedicated to a single event, using the shared memory for one histogram Several events processed in parallel in the same time Easier to avoid conflicts in shared and global memory

Page 14: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0HOUGH

14

X

Y

Radius

Each hit is the center of a test circle with a given radius. The ring center is the best common point of the test circles PMs position -> constant memory hits -> global memory Testing circle prototypes -> constant memory

3D space for histograms in shared memory (2D grid VS test circle radius) Limitations due to the total shared memory amount (16K) One thread for each center (hits) -> 32 threads (in one thread block) for event

Page 15: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0TRIPLS

15

1

2

3

In each thread the center of the ring is computed using three points (“triplets”) For the same event, several triplets are examined at the same time. Not all the possible triplets are considered: fixed number depending on the number of events. Each thread fills a 2D histogram in order to decide the final center The radius is obtained from the center averaging the distances with the hits.

The vector with the indexes of the triplets combination is loaded once in the costant memory at the beginning. The “noise” should induce “fake” center, but the procedure has been demonstrate to converge for a sufficient number of combination (at least for no to small number of hits)

Page 16: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0MATH

16

Is a pure matematical approach: Conformal Mapping It based on the inversion of the ring equation after the transformation:

After second order approximation, the center and radius are obtained solving a linear system The shared memory is not used at all. The solution can be obtained exploiting many core for the same event or just one core for event: the latter has been proved to be more efficient for at least a factor of two.

yyv

xxu

ii

ii

Page 17: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Resolution plots

17

Toy MC: 100000 rings generated with random center and random radius but with fixed (17) number of hits (for this plots) 100 packets with 1000 events. POMH e DOMH give, as expected, very similar results HOUGH shows secondary peeks due to convergence for wrong radius Very strange HOUGH radius result: probably a bug in the code

R-R_gen

Page 18: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Resolution plots

18

TRIPL and MATH shows better resolution with respect to the “memories” algorithms The CPU shows strange tails mainly due to ring not fully included in the acceptance (the likelihood doesn’t converge)

Page 19: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Resolution vs Nhits

19

The resolution depends slightly on the number of hits The difference in X and Y is due to the different packing of the PMs in X and Y In the last plot the HOUGH result is out of scale The MATH resolution is better than the CPU resolution!

Nhits

Nhits

Nhits

cmcm

cm

Algo R resol. (cm)

POMH 0.7

DOMH 0.7

HOUGH 2.6

TRIPL 0.28

MATH 0.15

CPU 0.27

Page 20: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Resolution plots (with noise)

20

In order to study the stability of the results with the possible noise, in the MC random hits are added to the generated ring A variable percentage of noise is considered (0%, 5%, 10%, 13%, 18%)

POMH, DOMH and HOUGH are marginally influenced from noise For noise>10% not gaussian tails are predominant in the resolution in TRIPL and MATH. Shift of central value are observed at big value of noise.

Page 21: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Noise plots

21Noise

Noise

Noise

cm

cm

cm

resolution as a function of %noise The noise in the RICH is expected to be quite low, according to test on the prototype

Page 22: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Time plots

22

The execution time is evaluated using an internal timestamp counter in the GPU The resolution of the counter is 1 us The time per event is obtained as average in packet of 1000 events The plots are obtained for events with 20 hits

Algo Time (us)

POMH 82.1±0.7

DOMH 3.54±0.01

HOUGH 169±4

TRIPL 0.648±0.005

MATH 0.0500±0.004

CPU 723±5

usus

us

us

us

us

Page 23: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Time vs Nhits plots

23

The execution time depends on the number of hits This dependence is quite small in the GPU (at least for the 4 faster algorithms) and is higher in the CPU The best result, at the moment is 50 ns for ring with MATH, but…

Nhits

Nhits

Nhits

us

us

us

Page 24: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0MATH optimization

24

The time result in MATH depends on the total occupancy of concurrent threads in the GPU In particular it depends on the dimension of a logic block (a multiprocessor schedule internally the execution of a Block of N threads, but the real parallelization is given for 16 threads) Isn’t a priori obvious which is the correct number of threads to optimize the execution: it depends on memory occupancy, single thread time, divergency between threads in the block,… The result before was for 256 threads in one block. The optimization is for 16 threads in one block (the plot above is for 17 hits events, the result before was for 20 hits events)

Ring computed in 4.4 ns !!!

Page 25: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Data transfer time

25

The other time that we need to know is the transfer time of the data inside the Video Card and the results outside The transfer times are studied as a function of the dimension of the packets There is, more or less, linearity with the dimension of the packet Obviously they are, in first approximation, independents on the algorithms.

For packets of 1000 events :RAM→GRAM = 120 usGRAM → RAM = 41 us

Some algorithms requires extra-time to transfer data in the constant memory (for example TRIPLS needs the indices vector to choose between the various combination of hits)

Page 26: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Comparison between MATH e CPUMATH

26

Is quite unfair to compare the MATH result with the CPU result (in this case we have an improvement of a factor of 180000!!!) The CPUMATH (the MATH algorithm on the CPU) gives good result in terms of resolution and time. In any case the timing result isn’t very stable, depending on the activity of the CPU

In this case the improvement is “only” a factor 25 For instance the L1 RICH farm needs a factor 25 less processors to do the same job (essentially we need only one PC)

Page 27: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Time budget in MATH

27

400 us

Page 28: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Time budget in MATH

28

For packet of 1000 events we have (real time for operations) 120 us to transfer data in the video card 4.4 us to process all the events 41 us to transfer results in the PC

The measured elapsed time to have a final results for packets of 1000 events is: 400 us! Actually, in the TESLA card, it’s possible to transfer data (a different packet) during the kernel execution in the processor. In the new FERMI card it’s possible to have two separated streams of transfer: data and results The “wrong” time budget for the kernel execution in the plots (1/3 of the total time) is due to the double call of the kernel for the “warm up” → probably should be avoided for each packet. Other “tricks” should be used to decrease the total time → probably a factor of two is still possible: 200 us!

Page 29: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Triggering with MATH

29

A single PC with a TESLA video card could process, at 10 MHz, events with 200 us of latency. The L0 trigger level requires an overall latency of 1 ms → the GPU can be used to generate high quality primitives! To complete the system we need three more elements:

X CARD: a smart gigabit receiver card in which the ethernet protocol and the decoding is done in the FPGA (the link brings header, trailers, words with 3 hits (10 bits)… the video card wants only hits coordinates), to avoid extra jobs in the CPU→ Altera Stratix IV development board bought in Pisa Real time OS: any non deterministic behavior should be avoid, with 200 us of execution it’s very easy to have fluctuation of several factors due to CPU extra activity → A group in Ferrara is studying the transfer time with standard and real time OS. Transmitter card: in principle should be a standard ethernet card, but the X CARD should be adapted to rebuild the ethernet packet adding the timestamp after a fixed latency (in order to avoid extra job in the CPU)

Page 30: “ GPUs  in NA62 trigger system”

NA

62 in

Bru

ssel

s– G

ianl

uca

Lam

ann

a –

9.9

.201

0Conclusions & ToDo

30

Several algorithms have been tested on GPU, using a toy MC to study the resolution with and without noise The processing time per event has been measured for each algorithms: the best result is 4.4 ns per ring ! Including the transfer time the latency, for packet of 1000 events, is around 200 us (due to the linearity of transfer time a packet of 500 events is processed in 100 us) This latency allows to imagine a system to build high quality primitives for the RICH at L0.

ToDo: Test (and modify) the algorithms for 2 rings case. Test the GPU with a real transfer from an other PC (everything is ready in Pisa). Test the new FERMI card (already baught in Pisa). Final refining of the procedure. Start thinking about the X Card. …