computer measurement group, india 0 0 hpc tutorial manoj nambiar, performance engineering...

102
Computer Measurement Group, India 1 Computer Measurement Group, India 1 www.cmgindia.org HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization CoE

Upload: anis-hardy

Post on 24-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 1Computer Measurement Group, India 1

www.cmgindia.org

HPC TutorialManoj Nambiar, Performance Engineering Innovation LabsParallelization and Optimization CoE

Page 2: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 2

A Common Expectation

- 2 -

Our ERP application has slowed down. All the

departments are complaining.

Let’s use HPC

Page 3: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 3

Agenda

• Part – I – A sample domain problem– Hardware & Software

• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application (if time permits)

Page 4: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 4

Thank You

Designing an Airplane for performance ……

Problem: Calculate Total Lift and Drag on the plane for a wind-speed of 150 m/s

Page 5: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 5

Performance Assurance – Airplanes vs Software

AssuranceApproach

Airplane Software

Testing Wind Tunnel Testing Load Testing with virtual users

Simulation CFD Simulation Discrete Event Simulation

Analytical None MVA, BCMP, M/M/k etc

AccuracyCost

Page 6: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 6

Thank You

CFD Example – Problem Decomposition

Methodology

1.Partition volume into cells

2. For a number of time steps

2.a For each cell 2.a.1 calculate velocities 2.a.2 calculate pressure 2.a.3 calculate turbulence

All cells have to be in equilibrium with each other.

Becomes a large AX=b problem. This problem is partitioned into groups of cells which are assigned to CPUsEach CPU can compute in parallel but the also have to communicate to each other

Page 7: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 7

A serial algorith for Ax = B

Compute Complexity – O(n2)

Page 8: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 8

What kind of H/W and S/W do we need

• Take an example Ax=b solver– Order of computational complexity is n2

– Where n is the number of cells in which the domain is divided

• Higher the number of cells – Higher the accuracy

• Typical number of cells– In 10’s of millions

• Very prohibitive to run sequentially

• Increase in memory requirements will need proportionally higher number of servers

Parallel implementation is needed on a large cluster or servers

Page 9: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 9

Software

• Lets look at the software aspect first

– Then we look at the hardware

Page 10: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 10

Work Load Balancing• After solving Ax=B

– Some elements of x need to be exchanged with neighbor groups

– Every group (process) has to send and receive values with its neighbors• For the next Guass Seidel iteration

Also need to check that all values of x have converged

Should this using TCP/IP or 3 tier web/app/database architecture?

Page 11: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 11

Why TCP/IP wont suffice

• Philosophically – NO– These parallel programs are peers– No one process is client or server

• Technically – NO– There can be as much as 10000 parallel processes

• Need to keep a directory of public server IP and port for each process– TCP is a stream oriented protocol

• Applications need to pass messages

• Changing the size of the cluster is tedious

Page 12: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 12

Why 3 tier application will not suffice?

• 3 tier applications are meant to serve end user transactions– This application is not transactional

• Database is not needed for these applications– No need to first persist and then read data

• This kind of I/O will impact performance significantly• Better to store data in RAM

– ACID properties of the database are not required• Applications are not transactional in nature

– SQL is a major overhead considering data velocity requirements

• Managed Frameworks like J2EE, .NET not optimal for such requirements

Page 13: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 13

MPI to the rescue

• A message oriented interface

• Has an API sparring 300 functions– Support complex messaging requirements

• A very simple interface for parallel programming

• Also portable regardless of the size of the deployment cluster

Page 14: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 14

MPI_Functions

• MPI_Send

• MPI_Recv

• MPI_Wait

• MPI_Reduce– SUM– MIN– MAX– ….

Page 15: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 15

Not so intuitive MPI calls

• MPI_Allgather(v)

• MPI_Scatter(v)

• MPI_Gather(v)

• MPI_All_to_All(v)

Page 16: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 16

Not so intuitive MPI calls

• MPI_Allgather(v)

• MPI_Scatter(v)

• MPI_Gather(v)

• MPI_All_to_All(v)

Page 17: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 17

Not so intuitive MPI calls

• MPI_Allgather(v)

• MPI_Scatter(v)

• MPI_Gather(v)

• MPI_All_to_All(v)

Page 18: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 18

Not so intuitive MPI calls

• MPI_Allgather(v)

• MPI_Scatter(v)

• MPI_Gather(v)

• MPI_All_to_All(v)

Page 19: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 19

Sample MPI program – parallel addition of a large array

Page 20: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 20

MPI – Send, Recv and Wait

If you have some computation to be donewhile waiting to receive a message from a peerThis is the place to do it

Page 21: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 21

Hardware

• Lets look at the Hardware

– Clusters– Servers– Coprocessors– Parallel File System

Page 22: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 22

HPC Cluster

Not very different from regular data center clusters

Page 24: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 24

Parallelism in Hardware

• Multi-server/Multi-node

• Multi-sockets

• Multi-core

• Co-processors– Many Core– GPU

• Vector Processing

Mult-socket server board

Multi-core CPU

Page 25: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 25

Coprocessor - GPU

• SM – Streaming Multi-processor• Device RAM – high speed GDDR5 RAM• Extreme multi-threading – thousands of threads

PCIE Card

Page 26: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 26

Inside a GPU streaming multiprocessor (SM)• An SM can be compared to a CPU core

• A GPU core is essentially an ALU

• All cores execute the same instruction at a time– What happens to “if-then-else”?

• A warp is software equivalent of a CPU thread.– Scheduled independently– A warp instruction executed by all cores at a time

• Many warps can be scheduled on an SM– Just like many threads on a CPU– When 1 warp is scheduled to run other warps are moving data

• A collection of warps concurrently running on an SM make a block– Conversely an SM can run only one block at a time

Efficiency is achieved when there is one warp in 1 stage of the execution pipeline

Page 27: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 27

How S/W runs on the GPU

1. A CPU process/thread initiates data transfer from CPU memory to GPU memory

2. The CPU invokes a function (kernel) that runs on the GPU– CPU specifies the number of blocks and blocks per thread– Each block is scheduled on one SM– After all blocks complete execution CPU is woken

3. CPU fetches the kernel output from the GPU memoryThis is known as offload mode of execution

Page 28: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 28

Co-Processor – Many Integrated Core (MiC)

• Cores are same as Intel Pentium CPU’s– With vector processing instructions

• L2 Level cache is accessible by all the cores

Execution Modes• Native• Offload• Symmetric

Page 29: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 29

What is vector processing?

A B

C

A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8

C1 C2 C3 C4 C5 C6 C7 C8

ALU in an ordinary CPU core

ALU in an CPU core with vector processing

Vector registers

1 arithmetic operation per instruction cycle

8 arithmetic operations per instruction cycle

for(i=1; i< 8; i++) c[i] = a[i]+b[i];

VADD C, A, B

ADD C, A, B

Page 30: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 30

HPC Networks – Bandwidth and Latency

Page 31: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 31

Hierarchical network

• The most intuitive design of a network– Not uncommon in data centers

• What happens when the 1st 8 nodes need to communicate to the next 8?– Remember that all links have the same bandwidth

Top of rack

End of row switch

Page 32: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 32

Clos Network

• Can be likened to a replicated hierarchical network– All nodes can talk to all other nodes – Dynamic routing capability essential in the switches

Page 33: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 33

Common HPC Network Technology - Infiniband

• Technology used for building high throughput low latency network– Competes with Ethernet

• To use Infiniband – You need a separate NiC on the server– An Infiniband switch– An Infiniband Cable

• Messaging supported in Infiniband– a direct memory access read from or, write to, a remote node

• (RDMA).– a channel send or receive– a transaction-based operation (that can be reversed)– a multicast transmission.– an atomic operation

Page 34: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 34

Parallel File Systems - Lustre

• Parallel file systems give the same file system interface to legacy applications• Can be built out of commodity hardware and storage.

Page 35: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 35

HPC Applications - Modeling and Simulation• Aerodynamics

– Vehicular design

• Energy and Resources– Seismic Analysis– Geo-Physics– Mining

• Molecular Dynamics– Drug Discovery– Structural Biology

• Weather Forecasting

Simulation OR Physical Experimentation

Prototype

Final Design

Lab Verification

HPC or no HPC?

Accuracy Speed

Power Cost

From Natural Science to Software

Page 36: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 36

Relatively Newer & Upcoming Applications• Finance

– Risk Computations– Options Pricing– Fraud Detection– Low Latency trading

• Image Processing– Medical Imaging– Image Analysis– Enhancement and Restoration

• Bio-Informatics– Genomics

Video Analytics– Face Detection– Surveillance

Internet of Things– Smart City– Smart Water– eHealth

Knowledge of core algorithms is key

Page 37: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 37

Technology Trends Impacting Performance & Availability

• Multi-Core, Speeds not increasing

• Memory Evolution– Lower memory per core– Relatively Low Memory Bandwidth– Deep Cache & Memory Hierarchies

• Heterogeneous Computing– Coprocessors.

• Vector Processing

Temperature fluctuation induced slowdowns

Memory error induced slowdowns

Network communication errors

Large sized cluster– Increased failure probability

Algorithms need to be re-engineered to make best use of trends

Page 38: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 38

Knowing Performance Bounds

• Amdahl’s Law– Maximum speed up achievable sp = (s + (1-s)/p)-1

– Where s is the fraction of code that has to run sequentially

Also Important to take problem size into account when estimating speedups

Compute/Communication ratio is key.

Typically – Higher the problem size - higher the the ratio- Better the speed up

Page 40: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 40

FLOPS and Bandwidth dependencies

• FLOPS – Floating operations per second– Frequency– No of CPU sockets– No of cores/per socket– No of Hyper-threads per core– No of vector units per core / hyperthead

• Bandwidths (Bytes/sec)– Level in the hierarchy – Registers, L1, L2, L3, DRAM– Serial / Parallel– Memory attached to same CPU socket or another CPU

Why are we not talking about memory latencies?

Page 41: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 41

Know your performance bounds

• Above information can also be obtained from product data sheets• What do you gain by knowing performance bounds?

GPU

Page 42: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 42

Other ways to gauge performance

• CPU speed– SPEC – integer and floating point benchmark

• Memory Bandwidth– Streams benchmark

Page 43: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 43

Basic Problem

• Consider the following code– double a[N], b[N], c[N], d[N];– int i;– for (i = 0; i < N-1; i++) a[i] = b[i] + c[i]*d[i];

• If N = 1012

• And the code has to complete in 1 second?– How many Xeon E5-2670 CPU sockets would you need?– Is this memory bound or CPU bound?

Page 44: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 44

General guiding principles for performance optimization

• Minimize communication requirements between parallel processes / threads

• If communication is essential then– Hide communication delays by overlapping compute and communication

• Maximize data locality– Helps caching– Good NUMA page placement

• Do not forget to use compiler optimization flags

• Implement weighted decomposition of workload– In a cluster with heterogeneous compute capabilities

Let your profiling results guide you on the next steps

Page 45: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 45

Optimization Guidelines for GPU platforms• Minimize use of “if-then-else” or any other branching

– they cause divergence

• Tune the number of threads per block– Too many will exhaust caches and registers in the SM– Too few will underutilize GPU capacity

• Use device memory for constants

• Use shared memory for frequently accessed data

• Use sequential memory access instead of strided

• Coalesce memory accesses

• Use streams to overlap compute and communications

Page 46: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 46

Steps in designing parallel programs

• Partitioning

• Communication

• Agglomeration

• Mapping

Data Structure Primitive Tasks

Page 47: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 47

Steps in designing parallel programs

• Partitioning

• Communication

• Agglomeration

• Mapping

• Combine sender and receiver• Eliminate communication• Increase Locality

• Combine senders and receivers• Reduces number of message transmissions

Page 48: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 48

Steps in designing parallel programs

• Partitioning

• Communication

• Agglomeration

• Mapping

NODE 1 NODE 2 NODE 3

Page 49: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 49

Agenda

• Part – I – A sample domain problem– Hardware & Software

• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application Xeon Phi (if time permits)

Page 50: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 50

Multi-core Performance Enhancement: Case Study

Page 51: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 51

Background

• Risk Management in a commodities exchange

• Risk computed post trade– Clearing and settlement – T+2

• Risk details updated on screen– Alerting is controlled by human operators

Page 52: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 52Computer Measurement Group, India

Commodities Exchange: Online Risk Management

Trading System

RiskManagementSystem

CollateralFallsShort

Prevent Client/Clearing Member from Trading

OnlineTrades

Alerts

Initial Deposit of Collateral Long/Short Positions on

Contracts Contract/Commodity Price

Changes Risk Parameters Change during

day

Clearing Member

Client1 ClientK

Page 53: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 53Computer Measurement Group, India

Will standard architecture on commodity servers suffice?

Application Server2 CPU Database Server

2 CPU

Risk Management System?

Page 54: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 54Computer Measurement Group, India

Commodities Exchange: Online Risk Management

Computations:

• Position Monitoring, Mark to Market, P&L, Open Interest, Exposure Margins

• SPAN: Initial Margin (Scanning Risk), Inter-Commodity Spread Charge, Inter-Month Spread Charge, Short Option Margin, Net Option Value

• Collateral Management

Functionality is complexLet’s look at a simpler problem that reflects the same computational challenge & come back later

Page 55: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 55Computer Measurement Group, India

Workload Requirements

• Trades/Day : 10 Million

• Peak Trades/Sec : 300

• Traders : 1 Million

Page 56: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 56Computer Measurement Group, India

P&L Computation

Time Txn Stock Quantity Price Total Amount

t1 BUY Cisco 100 950 95,000

t2 BUY IBM 200 30 6000

t3 SELL Cisco 40 975 39,000

t4 SELL IBM 200 31 6200

Trader A

Profit(Cisco, t4) = -95000 + 39000 + (100-40)*970 = -56000 + 58200 = 2200

Current Cisco price is 970

In general Profit on a given stock S at time t:= sum of txn values up to time t +

(netpositions on stock at time t) * price of stock at time t

Buy txns take –ve value, sell +ve value

Biggest culprit

Page 57: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 57Computer Measurement Group, India

P&L Computation

int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;

sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

loop for all traders r

profit[r] = profit[r] – profitperstock[r][t.stock];

profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;

profit[r] = profit[r]+ profitperstock[r][t.stock];

end loop

end loop

Page 58: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 58Computer Measurement Group, India

• Profit has to be kept updated for every price change– For all traders

• Inner Loop: 8 Computations– 4 Computations (+ + * +)– Loop Counter– 3 Assignments

• Actual Computational Complexity– 20 times as complex as displayed algorithm

• Number of traders: 1 million

P&L Computational Analysis

Page 59: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 59Computer Measurement Group, India

• SLA Expectation: 300 trades / sec

• Computations/trade– 8 computations x 1 million traders x 20 = 160

million

• Computations/sec = 160 million x 300 trades/sec– 48 billion computations/sec!

• Out of reach of contemporary servers that time!

Can we deliver within an IT budget?

P&L Computational Analysis

Page 60: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 60Computer Measurement Group, India

Test Environment

• Server• 8 Xeon 5560 cores• 2.8 GHz• 8 GB RAM

• OS: Centos 5.3• Linux kernel 2.6.18

• Programming Language : C• Compilers: gcc and icc

Page 61: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 61Computer Measurement Group, India

Test InputsNumber of Trades 1 MillionNumber of Traders 100,000Number of Stocks 100

Trade File Size 20 MB

Trades % Stock %20% 30%20 % 60%60% 10%

Trade Distribution

Page 62: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 62Computer Measurement Group, India

P&L Computation: Baselining

Trades/sec Overall Gain

Baseline Performance gcc 190

gcc –O3 323 70%

Page 63: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 63Computer Measurement Group, India 63

P&L Computation: Transpose

int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;

sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

loop for all traders r

profit[r] = profit[r] – profitperstock[r][t.stock];

profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;

profit[r] = profit[r]+ profitperstock[r][t.stock];

end loop

end loop

Tra-der

Stock s1

Stocksi

r1

r2

r3

Trade t

Very Poor Caching

Page 64: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 64Computer Measurement Group, India

Matrix Layout

Trader Stock s1 . Stocksi

r1

r2

r3

Memory LayoutTrader r1 Trader r2 Trader r3 Trader r4

S1

S2

Si

S1

S2

S iS1

S2

S iS 1

S2

S i

Page 65: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 65Computer Measurement Group, India

Matrix Layout - Optimized

Optimized Memory LayoutStock S1 Stock S2 Stock S3 Stock S4

r1r2 rn r1r2 rnr1r2 rnr1r2 rn

Stock Trader r1

Trader r2

Trader r3

S1

S2

S3

Page 66: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 66Computer Measurement Group, India

P&L Computation: Transpose

int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXSTOCKS][MAXTRADERS]; // net positions per stockint sumtxnvalue[MAXSTOCKS][MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS][MAXTRADERS];

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;

sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

loop for all traders r

profit[r] = profit[r] – profitperstock[t.stock][r];

profitperstock[t.stock][r] = sumtxnvalue[t.stock][r] + netpositions[t.stock][r] * t.price;

profit[r] = profit[r]+ profitperstock[t.stock][r];

end loop

end loop

Stock Tra-derr1

Tra-derri

s1

si

Trade t

Very Good Caching

Page 67: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 67Computer Measurement Group, India

P&L Computation: Transpose

Trades/sec

Overall Gain Immediate Gain

Baseline Performance gcc

190

gcc –O3 323 1.7X

Transpose of Trader/Stock

4750 25X 14.7X

Intel Compiler

Trades/sec

Overall Gain Immediate Gain

icc –fast (not –O3) 6850 36X 37%

Page 68: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 68Computer Measurement Group, India

P&L Computation: Use of Partial Sums

int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXSTOCKS] [MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS] [MAXTRADERS];

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;

sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

loop for all traders r

profit[r] = profit[r] – profitperstock[r][t.stock];

profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;

profit[r] = profit[r]+ profitperstock[r][t.stock];

end loop

end loop

This can be maintained cumulatively for the trader. Need not be per stock.

Page 69: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 69Computer Measurement Group, India 69

P&L Computation: Use of Partial sumsint profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer] − = t.quantity * t.price;

sumtxnvalue[t.seller] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];

sumposvalue[t.seller] − = t.quantity * ltp[t.stock];

loop for all traders r

sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

ltp[t.stock] = t.price;

end loop

Monetary Value of all stock positionstime of trade

Trades/sec

Overall Gain Immediate Gain

Use of Partial Sum 9650 50X 41%

Page 70: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 70Computer Measurement Group, India

P&L Computation: Skip Zero Values

int netpositions [MAXSTOCKS] [MAXTRADERS];

loop for all traders r

if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

endif

end loop

Majority of the values of this matrix are 0, thanks to hot stocks

Trades/sec

Overall Gain Immediate Gain

Skip Zero Values 10800 56X 12%

Page 71: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 71Computer Measurement Group, India

• There is a large percentage of cold stocks– Those which are held by very few traders

• In the last optimization an “if” check was added to avoid computation– If the trader does not hold the traded stock

• Is there any benefit if the trader record is not accessed at all?– We are computing for 100,000 traders

P&L Computation: Cold Stocks

Page 72: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 72Computer Measurement Group, India

P&L Computation: Sparse Matrix Representation

Stock A B C D E

s1 1 1 0 0 0

s2 1 1 1 0 0

s3 1 0 0 1 1

Flags Table – This Stock owned by who?

Updated in outer loopStock Count T0 T1 T2 . .

s1 2 A B 0 0 0

s2 3 A C B 0 0

s3 3 A E D 0 0

Traversed in outer loop

Traders Indexes/stock

Page 73: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 73Computer Measurement Group, India 73

P&L Computation: Sparse Matrix Representation

int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer] − = t.quantity * t.price;

sumtxnvalue[t.seller] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];

sumposvalue[t.seller] − = t.quantity * ltp[t.stock];

loop for all traders r

sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

ltp[t.stock] = t.price;

end loop

Traverse list of trader count for stock less than threshold

Trades/sec

Overall Gain Immediate Gain

Sparse Matrix 36000 189X 3.24X

Page 74: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 74Computer Measurement Group, India

P&L Computation: Clustering

struct TraderRecord { int profit; int sumtxnvalue int sumposvalue;} Trades/

secOverall Gain Immediate

Gain

Clustering 70000 368X 94%

int profit[MAXTRADERS];int sumtxnvalue [MAXTRADERS]; int sumposvalue [MAXTRADERS];

Poor caching for sparse matrix lists

Better caching performance!

Page 75: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 75Computer Measurement Group, India

P&L Computation: Precompute Price Difference

Trades/sec Overall Gain Immediate Gain

Clustering 75000 394X 7%

int netpositions [MAXSTOCKS] [MAXTRADERS];

loop for all traders r

if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

Loop Invariant: Move to outside the loop

Page 76: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 76Computer Measurement Group, India

P&L Computation: Loop Unrolling

Trades/sec

Overall Gain Immediate Gain

Clustering 80000 421X 7%

#pragma unroll

loop for all traders r

if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

Page 77: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 77Computer Measurement Group, India

Commodities Exchange: Online Risk Management

Trading System

RiskManagementSystem

CollateralFallsShort

Prevent Client/Clearing Member from Trading

OnlineTrades

AlertsClearing Member

Client1 ClientK

Initial Deposit of Collateral Long/Short Positions on

Contracts Contract/Commodity Price

Changes Risk Parameters Change during

day

Page 78: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 78Computer Measurement Group, India

P&L Computation: Batching of Trades

Trades/sec Overall Gain Immediate Gain

Batching of 100 trades 150000 789X 1.88X

Batching of 1000 trades 400000 2105X 2.67X

Batch n trades and use ltp of last trade // increases risk by a small delay

loop for all traders r

if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

So far all this is with only one thread!!!

Page 79: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 79Computer Measurement Group, India

P&L Computation: Use of Parallel Processing

Trades/sec Overall Gain

Immediate Gain

OpenMP 1.2 million 5368X 2.55X

#pragma openmp with chunks (32 threads on 8 core Intel server)

loop for all traders r

if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

Page 80: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 80Computer Measurement Group, India

P&L Computation: Summary of OptimizationsOptimization Trades/sec Immediate

GainOverall

Gain

Baseline gcc 190

gcc –O3 320 1.70X 1.7X

Transpose of Trader/Stock 4750 14.70X 25X

Intel Compiler icc –fast 6850 1.37X 36X

Use of Partial Sums 9650 1.41X 50X

Skip Zero Values 10,800 1.12X 56X

Sparse Matrix 36,000 3.24X 189X

Clustering of Arrays 70,000 1.94X 368X

Precompute Price Diff 75,000 1.07X 394X

Loop Unrolling 80,000 1.07X 421X

Batching of 100 Trades 150,000 1.88X 789X

Batching of 1000 Trades 400,000 2.67X 2105X

OpenMP 1,020,000 2.55X 5368X

Single Thread

8 CPU, 32 Threads

Page 81: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 81

BACKGROUNDLattice Boltzmann on GPU

Page 82: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 82Computer Measurement Group, India

2-D Square Lid Driven Cavity Problem

Moving Top Lid

L

L

X

Y

U

Fluid

Flow is generated by continuously moving top lid at a constant velocity.

Page 83: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 83Computer Measurement Group, India

Level 1

Time (ms) MGUPS Remarks

520727.1 5.034192 Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.

/*CPU Code*/for(y=0; y<(ny-2); y++){ for(x=0; x<(nx-2); x++) { -- }}

/*GPU Code*//*for(int y=0; y<(ny-2); y++){*/

if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { -- } }

Replace outer loop Iterations with threads. Total Threads=(ny-2), Each thread working on (nx-2)

grid points. MGUPS = (GridSize x TimeIterations) / (Time x 1000000)

Page 84: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 84Computer Measurement Group, India

Level 1 (Cont.)

Page 85: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 85Computer Measurement Group, India

Level 2

Time (ms) MGUPS Remarks

115742 22.64899 Loop Collapsing

/*GPU Code Level 1*/

if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { for(y=0; y<(ny-2); y++)-- } }

/*GPU Code with Loop Fusion*/

if(tid < ((ny-2)*(nx-2))){ y = (tid/(nx-2))+1; x = (tid%(nx-2))+1;

--}

Collapsing of 2 nested loops into one to exhibit massive parallelism.

Total threads=[(ny-2)*(nx-2)], Now each thread working on 1 grid point.

Page 86: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 86Computer Measurement Group, India

About GPU Constant Memory

Can be used for data that will not change over the course of kernel execution.

Define constant memory using __constant__ cudMemcpyToSymbol will copy data to constant memory. Constant memory is cached. Constant memory is read-only. Just 64 KB.

SM 1 SM 2 SM 14

Global Memory

Constant Memory

Tesla C2075

Page 87: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 87Computer Measurement Group, India

Level 3

Time (ms) MGUPS Remarks

113061.8 23.186 Copied Lattice Structure in GPU Constant Memory

__constant__ Lattice lattice_dev_const[1];cudaMemcpyToSymbol(lattice_dev_const, lattice, sizeof(Lattice));

typedef struct Lattice{ int Cs[9]; int Lattice_velocities[9][2]; real_dt Lattice_constants[9][4]; real_dt ek_i[9][9]; real_dt w_k[9]; real_dt ac_i[9]; real_dt gamma9[9];}Lattice;

Page 88: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 88Computer Measurement Group, India

Level 4

Time (ms) MGUPS Remarks

40044.5 65.5 Coalesced Memory Access pattern for Node Structure

typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;

Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]

Grid Point 0 Grid Point 1

Page 89: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 89Computer Measurement Group, India

Level 4 (Cont.)

Page 90: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 90Computer Measurement Group, India

Level 4 (Cont.)

Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]

Grid Point 0 Grid Point 1

T - 0

T - 1

(All Threads simultaneously accessing Density)

Stride

Page 91: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 91Computer Measurement Group, India

Level 4 (Cont.)

Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]

Grid Point 0 Grid Point 1

T - 0

T - 1

(All Threads simultaneously accessing Density)

Stride

Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]

T - 0

T - 1

(All Threads simultaneously accessing Density)

Coalesced Access pattern

Efficient access of global memory

Inefficient access of global memory

Page 92: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 92Computer Measurement Group, India

Level 4 (Cont.)

typedef struct Type{ int *val;}Type;typedef struct Vel{ real_dt *val;}Vel;typedef struct Density{ real_dt *val;}Density;typedef struct F{ real_dt *val;}F;typedef struct Ftmp{ real_dt *val;}Ftmp;

typedef struct Node_map{Type type;Vel vel[2];Density density;F f[9];Ftmp ftmp[9];

}Node_dev;

typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;

Page 93: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 93Computer Measurement Group, India

Level 5

Time (ms) MGUPS Remarks

14492.6 180.9 Arithmetic Optimizations

for(int k=3; k<SPEEDS; k++){ //mk[k] = lattice_dev_const->gamma9[k]*mk[k]; //mk[k] = lattice_dev_const->gamma9[k] * mk[k] / lattice_dev_const->w_k[k]; mk[k] = lattice_dev_const->gamma9_div_wk[k]*mk[k];}

for(int i=0; i<SPEEDS; i++){ f_neq = 0.0; for(int k=0; k<SPEEDS; k++) { //f_neq += ((lattice_dev_const->ek_i[k][i] * mk[k]) / lattice_dev_const->w_k[k]); f_neq += lattice_dev_const->ek_i[k][i]*mk[k]; }}

Page 94: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 94Computer Measurement Group, India

Level 5 (Cont.)

Page 95: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 95Computer Measurement Group, India

Level 6

Time (ms) MGUPS Remarks

8309.662109 315.468903 Algorithmic Optimization

nnn vF ,,nFtmp

nnv ,

nF

Global barrier

Collision Streaming

Collision stores Ftmp to GPU Global Memory. Streaming loads Ftmp from GPU Global Memory. Global Memory Load/Store operations are

expensive.

Page 96: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 96Computer Measurement Group, India

Level 6 (Cont.)

Collision Streaming

Pull Ftmp from Neighbors needs Synchronization.

Page 97: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 97Computer Measurement Group, India

Level 6 (Cont.)

Collision Streaming

Instead Push Ftmp to Neighbors – No need of Synchronization

Page 98: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 98Computer Measurement Group, India

Level 6 (Cont.)

Collision & Streaming can be one kernel. Saves one Load/Store from/to Global Memory.

nnn vF ,,

nnv ,

nFtmp

nFtmp

nF

Page 99: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 99Computer Measurement Group, India

Optimizations Achieved on GPU using CUDA

Levels Time (ms)

MGUPS (Million Grid Updates Per Second)

Remarks

1 520727.1 5.034192Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.

2 115742 22.64899 Loop Collapsing

3 113061.8 23.186 Copied Lattice Structure in GPU Constant Memory

4 40044.5 65.5 Coalesced Memory Access pattern for Node Structure

5 14492.6 180.9 Arithmetic Optimizations

6 8309.662109 315.468903 Algorithmic Optimization

CUDA Card: Tesla C2075 (448 Cores, 14 SM, Fermi, Compute 2.0)

Page 100: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 100

Recap

• Part – I – A sample domain problem– Hardware & Software

• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation

- 100 -

Page 101: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 101

Closing Comments• OLTP applications seldom require HPC technologies

– Unless it is an application that needs to respond in microseconds• Algo trading etc

• Can HPC technologies be used to speed up my data-transformation (ETL/ELT) and reporting workloads?

– Sure – you have to let go the ease of using 3rd party products & databases• If you don’t want to –customizing a specific bottleneck process could help

– Stay tuned to companies innovating in this space – • e.g SQREAM – implements databases operations on GPU’s

• Investing in an HPC cluster and technologies not enough– Also investing people who understand

• Underlying technologies• Applications

- 101 -

Page 102: Computer Measurement Group, India 0 0  HPC Tutorial Manoj Nambiar, Performance Engineering Innovation Labs Parallelization and Optimization

Computer Measurement Group, India 102Computer Measurement Group, India 102

www.cmgindia.org

Q&A

[email protected]