“ gpus in na62 trigger system”
DESCRIPTION
“ GPUs in NA62 trigger system”. Gianluca Lamanna (CERN). NA62 Collaboration meeting in Brussels 9.9.2010. Outline. Short reminder Use of GPUs in NA62 trigger system Study of algorithms performances: resolution and timing Towards a working prototype Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
“GPUs in NA62 trigger system”
Gianluca Lamanna(CERN)
NA62 Collaboration meeting in Brussels 9.9.2010
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Outline
2
Short reminder Use of GPUs in NA62 trigger system Study of algorithms performances: resolution and timing Towards a working prototype Conclusions
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0GPU Idea: reminder
3
Nowadays the GPU (standard PC video card processor) are very powerful processors, with computing power exceeding 1 Teraflops. The GPUs are designed for digital imaging, video games, computer graphics. In this context the main problems are to apply the same operations to a large quantity of data (for instance move an object, transform a part of an image, etc…) The architecture of the processor is highly parallel. In latter years several efforts to use the GPU in high performance computing in several fields (GPGPU)
Is it possible to use the GPUs for “hard realtime” computing? Is it possible to build a trigger system based on GPUs for high performance online selection? Is it possible to select events with high efficiency using cheap off-the-shelf components?
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0GPU characteristics: reminder
4
SIMD architecture Big number of cores The single cores are grouped in Multiprocessors sharing a small quantity of in-chip memory (very fast) Huge quantity of external memory accessible from each core Particular care in programming the chip to exploit the architecture Big performance guarantee for parallelizable problems Two logic level of parallelization: parallel algorithm and parallel executionTESLA C1060:
• 240 cores• 30 Multiprocessor• 4 GB Ram• 102 GB/s memory bandwidth• PCI-ex gen2 connection
The GPU (“device”) has to be always connected to a CPU (“host”) !!!
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0The “GPU trigger”
5
FE digitization
Trigger primitive
s
pipeline
Trigger processor
PCsL0
L1
Standard trigger system
FE digitization PCs PCs
“Triggerless”
FEDigitization + buffer
+ (trigger primitives)
PCs+GPU
PCs+GPUL0
L1
“quasi-triggerless” with GPUs
Custom HW
Commercial PCs
Commercial GPU system
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Benefits
6
The main advantage in using GPUs is to have a huge computing power in a compact system → cheap, off-the-shelf large consumer sector in continuous development, easy high level programming (C/C++), fully reconfigurable system, minimum custom hardware, very innovative (nobody in the world is using GPUs for trigger at the moment!).
The software trigger levels are the natural place where the GPU should be placed → reduction of farm dimension.
In the L0 the GPUs could be used to design more flexible and efficient trigger algorithms based on high quality fast reconstructed primitives → more physics channels collected, lower bandwidth employed, higher purity for the main channels
hyperCP sgoldstino
[See other physics example in my talks in Anacapri (2.9.2009) and at the NA62 Physics Handbook Workshop (12.12.2009)]
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0L1 “GPU trigger”
7
TEL62
L0TP
L1 PC
GPU
L1TP
L2 PC
GPU
GPU
1 MHz 100 kHz
The use of GPUs in software trigger levels is straightforward The GPUs act as “co-processors” to increase the power of the PCs allowing faster decision and smaller number of CPU cores involved
The RICH L1 dimension is dominated by the computing power:
4 Gigabits links from the 4 RICH TELL1 should be, in principle, managed by a single PC. Assuming 10 s of total time for L1 and 1 MHz of input rate the time budget for single event is 1 us.
The LKr is read at L1 at 100 kHz. About 173 Gb/s are produced. At least 20 PCs have to be dedicated to the event building (assuming 10 GbE links after a switch). Each PC see 5 kHz, resulting in 200 us of max processing time. Also in this case the GPUs could be employed to guarantee this time budget avoiding big increasing in farm cost (in the TD we assumed 4 ms per event).
Assuming 200 us for ring reconstruction (and other small stuffs), we need 200 cores (25 PCs) to produce the L1 trigger primitives Using GPUs a factor of 200 in reduction it isn’t impossible (see later)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0L0 “GPU trigger”
8
TEL62L0 GPU
L0TP10 MHz 10 MHz
1 MHz
Max 1 ms latency
In the L0 GPU one event has to be processed in 100 ns and the total latency should be less than 100 us!!!
Data arrive
Transfer in the RAM
Transfer of a Packet of data in GRAM in video card
Processing
Send back to PC the results
… protocol stack possibly managed in the receiver card
… the non deterministic behavior of the CPU should be avoided (real time OS)
… the PCI-ex gen2 is fast enough. Concurrent transfer during processing.
… as fast as possible!!!
… done!
Max
100
us
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Example: RICH
9
TEL62
TDCB
TDCB
TDCB
TDCB GBE
TEL62
TDCB
TDCB
TDCB
TDCB GBE
R/O ~50 MB/s
Std primitives ~22 MB/s
GPU primit. ~1.2 Gb/s
L0 GPU
2 TEL62 for 1 spots (1000 PMs) The GPU needs the position of the PMs → 10 bits 3 hits can be put in a single word 20 hits x 5 MHz / 3 ~1.2 Gb/s 1 link for R/O, 1 link for standard primitives (histograms), 2 links for hits position for the GPU
Should be very useful to have rings (center and position) at L0 Measurement of particle velocity First ingredient for PID (we need also the spectrometer) Possibility to have missing mass information (assuming the particle mass and Pt kick) Very useful for rare decays (K→pgg, K→pg, Ke2g, …)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Ring finding with GPU: generality
10
The parallelization offered by the GPU architecture can be exploited in two different ways:
In the Algorithm: parts of the same algorithms could be executed on different cores in order to speed up the single event processing Processing many events at the same time: each core process a single event.
The two “ways” are, usually, mixed.
The “threads” running in a multiprocessor (8 cores each) communicates through the very fast Shared memory (1 TB/s bandwidth) The data are stored in the huge Global Memory (a “packet” of N events is periodically copied in the global memory)
In general we have two approaches: each thread makes very simple operation → heavy use of shared memory each thread makes harder computations → light use of shared memory
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0On the CPU
11
Minimization of
i
i Adrrr
L log))((
2
12
200
With dr0=4.95, s=2.9755, A=-2asexp(0.5)(dr0+s)
Inside the PM acceptance and constant outside. (From the RICH test beam analisys (Antonino))
The minimization is done using the MINUIT package with Root.
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0POMH
12
distance
Each PM (1000) is considered as the center of a circle. For each center an histogram is constructed with the distances btw center and hits (<32). The whole processor is used for a single event (huge number of center) The single thread computes few distances Several histograms computed in different shared memory spaces Not natural for processor Isn’t possible to process more than one event at the same time (the parallelism is fully exploited to speed up the computation)
• Very important: particular care has to be adopted writing in shared memory to avoid “conflicts” → In this algorithms isn’t easy to avoid conflicts! (in case of conflicts the writing on the memory is serialized → lost of time)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0DOMH
13
Shared
Shared
Shared
1 event -> M threads (each thread for N PMs)
1 event -> M threads (each thread for N PMs)
1 event -> M threads (each thread for N PMs)
G L O B A L M E
MO R Y
Exactly the same algorithm wrt the POMH, but with different resources assignment The system is exploited in a more natural way: each block is dedicated to a single event, using the shared memory for one histogram Several events processed in parallel in the same time Easier to avoid conflicts in shared and global memory
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0HOUGH
14
X
Y
Radius
Each hit is the center of a test circle with a given radius. The ring center is the best common point of the test circles PMs position -> constant memory hits -> global memory Testing circle prototypes -> constant memory
3D space for histograms in shared memory (2D grid VS test circle radius) Limitations due to the total shared memory amount (16K) One thread for each center (hits) -> 32 threads (in one thread block) for event
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0TRIPLS
15
1
2
3
In each thread the center of the ring is computed using three points (“triplets”) For the same event, several triplets are examined at the same time. Not all the possible triplets are considered: fixed number depending on the number of events. Each thread fills a 2D histogram in order to decide the final center The radius is obtained from the center averaging the distances with the hits.
The vector with the indexes of the triplets combination is loaded once in the costant memory at the beginning. The “noise” should induce “fake” center, but the procedure has been demonstrate to converge for a sufficient number of combination (at least for no to small number of hits)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0MATH
16
Is a pure matematical approach: Conformal Mapping It based on the inversion of the ring equation after the transformation:
After second order approximation, the center and radius are obtained solving a linear system The shared memory is not used at all. The solution can be obtained exploiting many core for the same event or just one core for event: the latter has been proved to be more efficient for at least a factor of two.
yyv
xxu
ii
ii
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Resolution plots
17
Toy MC: 100000 rings generated with random center and random radius but with fixed (17) number of hits (for this plots) 100 packets with 1000 events. POMH e DOMH give, as expected, very similar results HOUGH shows secondary peeks due to convergence for wrong radius Very strange HOUGH radius result: probably a bug in the code
R-R_gen
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Resolution plots
18
TRIPL and MATH shows better resolution with respect to the “memories” algorithms The CPU shows strange tails mainly due to ring not fully included in the acceptance (the likelihood doesn’t converge)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Resolution vs Nhits
19
The resolution depends slightly on the number of hits The difference in X and Y is due to the different packing of the PMs in X and Y In the last plot the HOUGH result is out of scale The MATH resolution is better than the CPU resolution!
Nhits
Nhits
Nhits
cmcm
cm
Algo R resol. (cm)
POMH 0.7
DOMH 0.7
HOUGH 2.6
TRIPL 0.28
MATH 0.15
CPU 0.27
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Resolution plots (with noise)
20
In order to study the stability of the results with the possible noise, in the MC random hits are added to the generated ring A variable percentage of noise is considered (0%, 5%, 10%, 13%, 18%)
POMH, DOMH and HOUGH are marginally influenced from noise For noise>10% not gaussian tails are predominant in the resolution in TRIPL and MATH. Shift of central value are observed at big value of noise.
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Noise plots
21Noise
Noise
Noise
cm
cm
cm
resolution as a function of %noise The noise in the RICH is expected to be quite low, according to test on the prototype
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Time plots
22
The execution time is evaluated using an internal timestamp counter in the GPU The resolution of the counter is 1 us The time per event is obtained as average in packet of 1000 events The plots are obtained for events with 20 hits
Algo Time (us)
POMH 82.1±0.7
DOMH 3.54±0.01
HOUGH 169±4
TRIPL 0.648±0.005
MATH 0.0500±0.004
CPU 723±5
usus
us
us
us
us
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Time vs Nhits plots
23
The execution time depends on the number of hits This dependence is quite small in the GPU (at least for the 4 faster algorithms) and is higher in the CPU The best result, at the moment is 50 ns for ring with MATH, but…
Nhits
Nhits
Nhits
us
us
us
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0MATH optimization
24
The time result in MATH depends on the total occupancy of concurrent threads in the GPU In particular it depends on the dimension of a logic block (a multiprocessor schedule internally the execution of a Block of N threads, but the real parallelization is given for 16 threads) Isn’t a priori obvious which is the correct number of threads to optimize the execution: it depends on memory occupancy, single thread time, divergency between threads in the block,… The result before was for 256 threads in one block. The optimization is for 16 threads in one block (the plot above is for 17 hits events, the result before was for 20 hits events)
Ring computed in 4.4 ns !!!
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Data transfer time
25
The other time that we need to know is the transfer time of the data inside the Video Card and the results outside The transfer times are studied as a function of the dimension of the packets There is, more or less, linearity with the dimension of the packet Obviously they are, in first approximation, independents on the algorithms.
For packets of 1000 events :RAM→GRAM = 120 usGRAM → RAM = 41 us
Some algorithms requires extra-time to transfer data in the constant memory (for example TRIPLS needs the indices vector to choose between the various combination of hits)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Comparison between MATH e CPUMATH
26
Is quite unfair to compare the MATH result with the CPU result (in this case we have an improvement of a factor of 180000!!!) The CPUMATH (the MATH algorithm on the CPU) gives good result in terms of resolution and time. In any case the timing result isn’t very stable, depending on the activity of the CPU
In this case the improvement is “only” a factor 25 For instance the L1 RICH farm needs a factor 25 less processors to do the same job (essentially we need only one PC)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Time budget in MATH
27
400 us
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Time budget in MATH
28
For packet of 1000 events we have (real time for operations) 120 us to transfer data in the video card 4.4 us to process all the events 41 us to transfer results in the PC
The measured elapsed time to have a final results for packets of 1000 events is: 400 us! Actually, in the TESLA card, it’s possible to transfer data (a different packet) during the kernel execution in the processor. In the new FERMI card it’s possible to have two separated streams of transfer: data and results The “wrong” time budget for the kernel execution in the plots (1/3 of the total time) is due to the double call of the kernel for the “warm up” → probably should be avoided for each packet. Other “tricks” should be used to decrease the total time → probably a factor of two is still possible: 200 us!
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Triggering with MATH
29
A single PC with a TESLA video card could process, at 10 MHz, events with 200 us of latency. The L0 trigger level requires an overall latency of 1 ms → the GPU can be used to generate high quality primitives! To complete the system we need three more elements:
X CARD: a smart gigabit receiver card in which the ethernet protocol and the decoding is done in the FPGA (the link brings header, trailers, words with 3 hits (10 bits)… the video card wants only hits coordinates), to avoid extra jobs in the CPU→ Altera Stratix IV development board bought in Pisa Real time OS: any non deterministic behavior should be avoid, with 200 us of execution it’s very easy to have fluctuation of several factors due to CPU extra activity → A group in Ferrara is studying the transfer time with standard and real time OS. Transmitter card: in principle should be a standard ethernet card, but the X CARD should be adapted to rebuild the ethernet packet adding the timestamp after a fixed latency (in order to avoid extra job in the CPU)
NA
62 in
Bru
ssel
s– G
ianl
uca
Lam
ann
a –
9.9
.201
0Conclusions & ToDo
30
Several algorithms have been tested on GPU, using a toy MC to study the resolution with and without noise The processing time per event has been measured for each algorithms: the best result is 4.4 ns per ring ! Including the transfer time the latency, for packet of 1000 events, is around 200 us (due to the linearity of transfer time a packet of 500 events is processed in 100 us) This latency allows to imagine a system to build high quality primitives for the RICH at L0.
ToDo: Test (and modify) the algorithms for 2 rings case. Test the GPU with a real transfer from an other PC (everything is ready in Pisa). Test the new FERMI card (already baught in Pisa). Final refining of the procedure. Start thinking about the X Card. …