computer measurement group, india 0 0 hpc tutorial manoj nambiar, performance engineering...

Computer Measurement Group, India 1Computer Measurement Group, India 1

www.cmgindia.org

HPC TutorialManoj Nambiar, Performance Engineering Innovation LabsParallelization and Optimization CoE

http://www.cmgindia.org/

Computer Measurement Group, India 2

A Common Expectation

- 2 -

Our ERP application has slowed down. All the

departments are complaining.

Let’s use HPC


Agenda

• Part – I – A sample domain problem– Hardware & Software

• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application (if time permits)


Thank You

Designing an Airplane for performance ……

Problem: Calculate Total Lift and Drag on the plane for a wind-speed of 150 m/s


Performance Assurance – Airplanes vs Software

AssuranceApproach

Airplane Software

Testing Wind Tunnel Testing Load Testing with virtual users

Simulation CFD Simulation Discrete Event Simulation

Analytical None MVA, BCMP, M/M/k etc

AccuracyCost


Thank You

CFD Example – Problem Decomposition

Methodology

1.Partition volume into cells

2. For a number of time steps

2.a For each cell 2.a.1 calculate velocities 2.a.2 calculate pressure 2.a.3 calculate turbulence

All cells have to be in equilibrium with each other.

Becomes a large AX=b problem. This problem is partitioned into groups of cells which are assigned to CPUsEach CPU can compute in parallel but the also have to communicate to each other


A serial algorith for Ax = B

Compute Complexity – O(n2)


What kind of H/W and S/W do we need

• Take an example Ax=b solver– Order of computational complexity is n2

– Where n is the number of cells in which the domain is divided

• Higher the number of cells – Higher the accuracy

• Typical number of cells– In 10’s of millions

• Very prohibitive to run sequentially

• Increase in memory requirements will need proportionally higher number of servers

Parallel implementation is needed on a large cluster or servers


Software

• Lets look at the software aspect first

– Then we look at the hardware


Work Load Balancing• After solving Ax=B

– Some elements of x need to be exchanged with neighbor groups

– Every group (process) has to send and receive values with its neighbors• For the next Guass Seidel iteration

Also need to check that all values of x have converged

Should this using TCP/IP or 3 tier web/app/database architecture?


Why TCP/IP wont suffice

• Philosophically – NO– These parallel programs are peers– No one process is client or server

• Technically – NO– There can be as much as 10000 parallel processes

• Need to keep a directory of public server IP and port for each process– TCP is a stream oriented protocol

• Applications need to pass messages

• Changing the size of the cluster is tedious


Why 3 tier application will not suffice?

• 3 tier applications are meant to serve end user transactions– This application is not transactional

• Database is not needed for these applications– No need to first persist and then read data

• This kind of I/O will impact performance significantly• Better to store data in RAM

– ACID properties of the database are not required• Applications are not transactional in nature

– SQL is a major overhead considering data velocity requirements

• Managed Frameworks like J2EE, .NET not optimal for such requirements


MPI to the rescue

• A message oriented interface

• Has an API sparring 300 functions– Support complex messaging requirements

• A very simple interface for parallel programming

• Also portable regardless of the size of the deployment cluster


MPI_Functions

• MPI_Send

• MPI_Recv

• MPI_Wait

• MPI_Reduce– SUM– MIN– MAX– ….


Not so intuitive MPI calls

• MPI_Allgather(v)

• MPI_Scatter(v)

• MPI_Gather(v)

• MPI_All_to_All(v)




• MPI_Scatter(v)

• MPI_Gather(v)



Sample MPI program – parallel addition of a large array


MPI – Send, Recv and Wait

If you have some computation to be donewhile waiting to receive a message from a peerThis is the place to do it


Hardware

• Lets look at the Hardware

– Clusters– Servers– Coprocessors– Parallel File System


HPC Cluster

Not very different from regular data center clusters


Now lets look inside a server

Coprocessor’s go here

NUMA

http://www.google.co.in/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=ob4Pnh4A-klLaM&tbnid=S6UiVqzDiz9PDM:&ved=0CAUQjRw&url=http://blog.asset-intertech.com/test_data_out/2010/08/intel-xeon-processor-7500-series.html&ei=PITTU9GqOZKIuASczIDQCw&bvm=bv.71778758,d.c2E&psig=AFQjCNEOxj6apflY13SQV9NL-JD8a6h8HQ&ust=1406457259941189


Parallelism in Hardware

• Multi-server/Multi-node

• Multi-sockets

• Multi-core

• Co-processors– Many Core– GPU

• Vector Processing

Mult-socket server board

Multi-core CPU


Coprocessor - GPU

• SM – Streaming Multi-processor• Device RAM – high speed GDDR5 RAM• Extreme multi-threading – thousands of threads

PCIE Card


Inside a GPU streaming multiprocessor (SM)• An SM can be compared to a CPU core

• A GPU core is essentially an ALU

• All cores execute the same instruction at a time– What happens to “if-then-else”?

• A warp is software equivalent of a CPU thread.– Scheduled independently– A warp instruction executed by all cores at a time

• Many warps can be scheduled on an SM– Just like many threads on a CPU– When 1 warp is scheduled to run other warps are moving data

• A collection of warps concurrently running on an SM make a block– Conversely an SM can run only one block at a time

Efficiency is achieved when there is one warp in 1 stage of the execution pipeline


How S/W runs on the GPU

1. A CPU process/thread initiates data transfer from CPU memory to GPU memory

2. The CPU invokes a function (kernel) that runs on the GPU– CPU specifies the number of blocks and blocks per thread– Each block is scheduled on one SM– After all blocks complete execution CPU is woken

3. CPU fetches the kernel output from the GPU memoryThis is known as offload mode of execution


Co-Processor – Many Integrated Core (MiC)

• Cores are same as Intel Pentium CPU’s– With vector processing instructions

• L2 Level cache is accessible by all the cores

Execution Modes• Native• Offload• Symmetric


What is vector processing?

A B

C

A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8

C1 C2 C3 C4 C5 C6 C7 C8

ALU in an ordinary CPU core

ALU in an CPU core with vector processing

Vector registers

1 arithmetic operation per instruction cycle

8 arithmetic operations per instruction cycle

for(i=1; i< 8; i++) c[i] = a[i]+b[i];

VADD C, A, B

ADD C, A, B


HPC Networks – Bandwidth and Latency


Hierarchical network

• The most intuitive design of a network– Not uncommon in data centers

• What happens when the 1st 8 nodes need to communicate to the next 8?– Remember that all links have the same bandwidth

Top of rack

End of row switch


Clos Network

• Can be likened to a replicated hierarchical network– All nodes can talk to all other nodes – Dynamic routing capability essential in the switches


Common HPC Network Technology - Infiniband

• Technology used for building high throughput low latency network– Competes with Ethernet

• To use Infiniband – You need a separate NiC on the server– An Infiniband switch– An Infiniband Cable

• Messaging supported in Infiniband– a direct memory access read from or, write to, a remote node

• (RDMA).– a channel send or receive– a transaction-based operation (that can be reversed)– a multicast transmission.– an atomic operation


Parallel File Systems - Lustre

• Parallel file systems give the same file system interface to legacy applications• Can be built out of commodity hardware and storage.


HPC Applications - Modeling and Simulation• Aerodynamics

– Vehicular design

• Energy and Resources– Seismic Analysis– Geo-Physics– Mining

• Molecular Dynamics– Drug Discovery– Structural Biology

• Weather Forecasting

Simulation OR Physical Experimentation

Prototype

Final Design

Lab Verification

HPC or no HPC?

Accuracy Speed

Power Cost

From Natural Science to Software


Relatively Newer & Upcoming Applications• Finance

– Risk Computations– Options Pricing– Fraud Detection– Low Latency trading

• Image Processing– Medical Imaging– Image Analysis– Enhancement and Restoration

• Bio-Informatics– Genomics

Video Analytics– Face Detection– Surveillance

Internet of Things– Smart City– Smart Water– eHealth

Knowledge of core algorithms is key


Technology Trends Impacting Performance & Availability

• Multi-Core, Speeds not increasing

• Memory Evolution– Lower memory per core– Relatively Low Memory Bandwidth– Deep Cache & Memory Hierarchies

• Heterogeneous Computing– Coprocessors.

• Vector Processing

Temperature fluctuation induced slowdowns

Memory error induced slowdowns

Network communication errors

Large sized cluster– Increased failure probability

Algorithms need to be re-engineered to make best use of trends


Knowing Performance Bounds

• Amdahl’s Law– Maximum speed up achievable sp = (s + (1-s)/p)-1

– Where s is the fraction of code that has to run sequentially

Also Important to take problem size into account when estimating speedups

Compute/Communication ratio is key.

Typically – Higher the problem size - higher the the ratio- Better the speed up


Quick Hardware Recap

FLOPS Bound

Bandwidth Bound

What about server clusters?

http://www.google.co.in/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=ob4Pnh4A-klLaM&tbnid=S6UiVqzDiz9PDM:&ved=0CAUQjRw&url=http://blog.asset-intertech.com/test_data_out/2010/08/intel-xeon-processor-7500-series.html&ei=PITTU9GqOZKIuASczIDQCw&bvm=bv.71778758,d.c2E&psig=AFQjCNEOxj6apflY13SQV9NL-JD8a6h8HQ&ust=1406457259941189


FLOPS and Bandwidth dependencies

• FLOPS – Floating operations per second– Frequency– No of CPU sockets– No of cores/per socket– No of Hyper-threads per core– No of vector units per core / hyperthead

• Bandwidths (Bytes/sec)– Level in the hierarchy – Registers, L1, L2, L3, DRAM– Serial / Parallel– Memory attached to same CPU socket or another CPU

Why are we not talking about memory latencies?


Know your performance bounds

• Above information can also be obtained from product data sheets• What do you gain by knowing performance bounds?

GPU


Other ways to gauge performance

• CPU speed– SPEC – integer and floating point benchmark

• Memory Bandwidth– Streams benchmark


Basic Problem

• Consider the following code– double a[N], b[N], c[N], d[N];– int i;– for (i = 0; i < N-1; i++) a[i] = b[i] + c[i]*d[i];

• If N = 1012

• And the code has to complete in 1 second?– How many Xeon E5-2670 CPU sockets would you need?– Is this memory bound or CPU bound?


General guiding principles for performance optimization

• Minimize communication requirements between parallel processes / threads

• If communication is essential then– Hide communication delays by overlapping compute and communication

• Maximize data locality– Helps caching– Good NUMA page placement

• Do not forget to use compiler optimization flags

• Implement weighted decomposition of workload– In a cluster with heterogeneous compute capabilities

Let your profiling results guide you on the next steps


Optimization Guidelines for GPU platforms• Minimize use of “if-then-else” or any other branching

– they cause divergence

• Tune the number of threads per block– Too many will exhaust caches and registers in the SM– Too few will underutilize GPU capacity

• Use device memory for constants

• Use shared memory for frequently accessed data

• Use sequential memory access instead of strided

• Coalesce memory accesses

• Use streams to overlap compute and communications


Steps in designing parallel programs

• Partitioning

• Communication

• Agglomeration

• Mapping

Data Structure Primitive Tasks



• Partitioning

• Communication

• Agglomeration

• Mapping

• Combine sender and receiver• Eliminate communication• Increase Locality

• Combine senders and receivers• Reduces number of message transmissions



• Partitioning

• Communication

• Agglomeration

• Mapping

NODE 1 NODE 2 NODE 3


Agenda


• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application Xeon Phi (if time permits)


Multi-core Performance Enhancement: Case Study


Background

• Risk Management in a commodities exchange

• Risk computed post trade– Clearing and settlement – T+2

• Risk details updated on screen– Alerting is controlled by human operators

Computer Measurement Group, India 52Computer Measurement Group, India

Commodities Exchange: Online Risk Management

Trading System

RiskManagementSystem

CollateralFallsShort

Prevent Client/Clearing Member from Trading

OnlineTrades

Alerts

Initial Deposit of Collateral Long/Short Positions on

Contracts Contract/Commodity Price

Changes Risk Parameters Change during

day

Clearing Member

Client1 ClientK


Will standard architecture on commodity servers suffice?

Application Server2 CPU Database Server

2 CPU

Risk Management System?



Computations:

• Position Monitoring, Mark to Market, P&L, Open Interest, Exposure Margins

• SPAN: Initial Margin (Scanning Risk), Inter-Commodity Spread Charge, Inter-Month Spread Charge, Short Option Margin, Net Option Value

• Collateral Management

Functionality is complexLet’s look at a simpler problem that reflects the same computational challenge & come back later


Workload Requirements

• Trades/Day : 10 Million

• Peak Trades/Sec : 300

• Traders : 1 Million


P&L Computation

Time Txn Stock Quantity Price Total Amount

t1 BUY Cisco 100 950 95,000

t2 BUY IBM 200 30 6000

t3 SELL Cisco 40 975 39,000

t4 SELL IBM 200 31 6200

Trader A

Profit(Cisco, t4) = -95000 + 39000 + (100-40)*970 = -56000 + 58200 = 2200

Current Cisco price is 970

In general Profit on a given stock S at time t:= sum of txn values up to time t +

(netpositions on stock at time t) * price of stock at time t

Buy txns take –ve value, sell +ve value

Biggest culprit


P&L Computation

int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];

loop forever

t = get_next_trade();

sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;

sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;

netpositions[t.buyer][t.stock] + = t.quantity;

netpositions[t.seller][t.stock] − = t.quantity;

loop for all traders r

profit[r] = profit[r] – profitperstock[r][t.stock];

profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;

profit[r] = profit[r]+ profitperstock[r][t.stock];

end loop

end loop


• Profit has to be kept updated for every price change– For all traders

• Inner Loop: 8 Computations– 4 Computations (+ + * +)– Loop Counter– 3 Assignments

• Actual Computational Complexity– 20 times as complex as displayed algorithm

• Number of traders: 1 million

P&L Computational Analysis


• SLA Expectation: 300 trades / sec

• Computations/trade– 8 computations x 1 million traders x 20 = 160

million

• Computations/sec = 160 million x 300 trades/sec– 48 billion computations/sec!

• Out of reach of contemporary servers that time!

Can we deliver within an IT budget?

P&L Computational Analysis


Test Environment

• Server• 8 Xeon 5560 cores• 2.8 GHz• 8 GB RAM

• OS: Centos 5.3• Linux kernel 2.6.18

• Programming Language : C• Compilers: gcc and icc


Test InputsNumber of Trades 1 MillionNumber of Traders 100,000Number of Stocks 100

Trade File Size 20 MB

Trades % Stock %20% 30%20 % 60%60% 10%

Trade Distribution


P&L Computation: Baselining

Trades/sec Overall Gain

Baseline Performance gcc 190

gcc –O3 323 70%


P&L Computation: Transpose

int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];

loop forever










end loop

end loop

Tra-der

Stock s1

Stocksi

r1

r2

r3

Trade t

Very Poor Caching


Matrix Layout

Trader Stock s1 . Stocksi

r1

r2

r3

Memory LayoutTrader r1 Trader r2 Trader r3 Trader r4

S1

S2

Si

S1

S2

S iS1

S2

S iS 1

S2

S i


Matrix Layout - Optimized

Optimized Memory LayoutStock S1 Stock S2 Stock S3 Stock S4

r1r2 rn r1r2 rnr1r2 rnr1r2 rn

Stock Trader r1

Trader r2

Trader r3

S1

S2

S3



int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXSTOCKS][MAXTRADERS]; // net positions per stockint sumtxnvalue[MAXSTOCKS][MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS][MAXTRADERS];

loop forever







profit[r] = profit[r] – profitperstock[t.stock][r];

profitperstock[t.stock][r] = sumtxnvalue[t.stock][r] + netpositions[t.stock][r] * t.price;

profit[r] = profit[r]+ profitperstock[t.stock][r];

end loop

end loop

Stock Tra-derr1

Tra-derri

s1

si

Trade t

Very Good Caching



Trades/sec

Overall Gain Immediate Gain

Baseline Performance gcc

190

gcc –O3 323 1.7X

Transpose of Trader/Stock

4750 25X 14.7X

Intel Compiler

Trades/sec


icc –fast (not –O3) 6850 36X 37%


P&L Computation: Use of Partial Sums

int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXSTOCKS] [MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS] [MAXTRADERS];

loop forever










end loop

end loop

This can be maintained cumulatively for the trader. Need not be per stock.


P&L Computation: Use of Partial sumsint profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)

loop forever


sumtxnvalue[t.buyer] − = t.quantity * t.price;

sumtxnvalue[t.seller] + = t.quantity * t.price;



sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];

sumposvalue[t.seller] − = t.quantity * ltp[t.stock];


sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);

profit[r] = sumtxnvalue[r] + sumposvalue[r];

end loop

ltp[t.stock] = t.price;

end loop

Monetary Value of all stock positionstime of trade

Trades/sec


Use of Partial Sum 9650 50X 41%


P&L Computation: Skip Zero Values

int netpositions [MAXSTOCKS] [MAXTRADERS];


if (netpositions[t.stock][r] != 0)

sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);


endif

end loop

Majority of the values of this matrix are 0, thanks to hot stocks

Trades/sec


Skip Zero Values 10800 56X 12%


• There is a large percentage of cold stocks– Those which are held by very few traders

• In the last optimization an “if” check was added to avoid computation– If the trader does not hold the traded stock

• Is there any benefit if the trader record is not accessed at all?– We are computing for 100,000 traders

P&L Computation: Cold Stocks


P&L Computation: Sparse Matrix Representation

Stock A B C D E

s1 1 1 0 0 0

s2 1 1 1 0 0

s3 1 0 0 1 1

Flags Table – This Stock owned by who?

Updated in outer loopStock Count T0 T1 T2 . .

s1 2 A B 0 0 0

s2 3 A C B 0 0

s3 3 A E D 0 0

Traversed in outer loop

Traders Indexes/stock


P&L Computation: Sparse Matrix Representation

int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)

loop forever


sumtxnvalue[t.buyer] − = t.quantity * t.price;

sumtxnvalue[t.seller] + = t.quantity * t.price;



sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];

sumposvalue[t.seller] − = t.quantity * ltp[t.stock];


sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);


end loop

ltp[t.stock] = t.price;

end loop

Traverse list of trader count for stock less than threshold

Trades/sec


Sparse Matrix 36000 189X 3.24X


P&L Computation: Clustering

struct TraderRecord { int profit; int sumtxnvalue int sumposvalue;} Trades/

secOverall Gain Immediate

Gain

Clustering 70000 368X 94%

int profit[MAXTRADERS];int sumtxnvalue [MAXTRADERS]; int sumposvalue [MAXTRADERS];

Poor caching for sparse matrix lists

Better caching performance!


P&L Computation: Precompute Price Difference

Trades/sec Overall Gain Immediate Gain


int netpositions [MAXSTOCKS] [MAXTRADERS];





end loop

Loop Invariant: Move to outside the loop


P&L Computation: Loop Unrolling

Trades/sec



#pragma unroll





end loop



Trading System

RiskManagementSystem

CollateralFallsShort

Prevent Client/Clearing Member from Trading

OnlineTrades

AlertsClearing Member

Client1 ClientK

Initial Deposit of Collateral Long/Short Positions on

Contracts Contract/Commodity Price

Changes Risk Parameters Change during

day


P&L Computation: Batching of Trades

Trades/sec Overall Gain Immediate Gain

Batching of 100 trades 150000 789X 1.88X

Batching of 1000 trades 400000 2105X 2.67X

Batch n trades and use ltp of last trade // increases risk by a small delay





end loop

So far all this is with only one thread!!!


P&L Computation: Use of Parallel Processing

Trades/sec Overall Gain

Immediate Gain

OpenMP 1.2 million 5368X 2.55X

#pragma openmp with chunks (32 threads on 8 core Intel server)





end loop


P&L Computation: Summary of OptimizationsOptimization Trades/sec Immediate

GainOverall

Gain

Baseline gcc 190

gcc –O3 320 1.70X 1.7X

Transpose of Trader/Stock 4750 14.70X 25X

Intel Compiler icc –fast 6850 1.37X 36X

Use of Partial Sums 9650 1.41X 50X

Skip Zero Values 10,800 1.12X 56X

Sparse Matrix 36,000 3.24X 189X

Clustering of Arrays 70,000 1.94X 368X

Precompute Price Diff 75,000 1.07X 394X

Loop Unrolling 80,000 1.07X 421X

Batching of 100 Trades 150,000 1.88X 789X

Batching of 1000 Trades 400,000 2.67X 2105X

OpenMP 1,020,000 2.55X 5368X

Single Thread

8 CPU, 32 Threads


BACKGROUNDLattice Boltzmann on GPU


2-D Square Lid Driven Cavity Problem

Moving Top Lid

L

L

X

Y

U

Fluid

Flow is generated by continuously moving top lid at a constant velocity.


Level 1

Time (ms) MGUPS Remarks

520727.1 5.034192 Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.

/*CPU Code*/for(y=0; y<(ny-2); y++){ for(x=0; x<(nx-2); x++) { -- }}

/*GPU Code*//*for(int y=0; y<(ny-2); y++){*/

if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { -- } }

Replace outer loop Iterations with threads. Total Threads=(ny-2), Each thread working on (nx-2)

grid points. MGUPS = (GridSize x TimeIterations) / (Time x 1000000)


Level 1 (Cont.)


Level 2


115742 22.64899 Loop Collapsing

/*GPU Code Level 1*/

if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { for(y=0; y<(ny-2); y++)-- } }

/*GPU Code with Loop Fusion*/

if(tid < ((ny-2)*(nx-2))){ y = (tid/(nx-2))+1; x = (tid%(nx-2))+1;

--}

Collapsing of 2 nested loops into one to exhibit massive parallelism.

Total threads=[(ny-2)*(nx-2)], Now each thread working on 1 grid point.


About GPU Constant Memory

Can be used for data that will not change over the course of kernel execution.

Define constant memory using __constant__ cudMemcpyToSymbol will copy data to constant memory. Constant memory is cached. Constant memory is read-only. Just 64 KB.

SM 1 SM 2 SM 14

Global Memory

Constant Memory

Tesla C2075


Level 3


113061.8 23.186 Copied Lattice Structure in GPU Constant Memory

__constant__ Lattice lattice_dev_const[1];cudaMemcpyToSymbol(lattice_dev_const, lattice, sizeof(Lattice));

typedef struct Lattice{ int Cs[9]; int Lattice_velocities[9][2]; real_dt Lattice_constants[9][4]; real_dt ek_i[9][9]; real_dt w_k[9]; real_dt ac_i[9]; real_dt gamma9[9];}Lattice;


Level 4


40044.5 65.5 Coalesced Memory Access pattern for Node Structure

typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;

Type

Type

Vel[2]

Vel[2]

Density

Density F[9]F[9] Ftmp[

9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]

Grid Point 0 Grid Point 1


Level 4 (Cont.)


Level 4 (Cont.)

Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]


T - 0

T - 1

(All Threads simultaneously accessing Density)

Stride


Level 4 (Cont.)

Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]


T - 0

T - 1


Stride

Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]Type

Type

Vel[2]

Vel[2]

Density


9]Ftmp[

9]

T - 0

T - 1


Coalesced Access pattern

Efficient access of global memory

Inefficient access of global memory


Level 4 (Cont.)

typedef struct Type{ int *val;}Type;typedef struct Vel{ real_dt *val;}Vel;typedef struct Density{ real_dt *val;}Density;typedef struct F{ real_dt *val;}F;typedef struct Ftmp{ real_dt *val;}Ftmp;

typedef struct Node_map{Type type;Vel vel[2];Density density;F f[9];Ftmp ftmp[9];

}Node_dev;

typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;


Level 5


14492.6 180.9 Arithmetic Optimizations

for(int k=3; k<SPEEDS; k++){ //mk[k] = lattice_dev_const->gamma9[k]*mk[k]; //mk[k] = lattice_dev_const->gamma9[k] * mk[k] / lattice_dev_const->w_k[k]; mk[k] = lattice_dev_const->gamma9_div_wk[k]*mk[k];}

for(int i=0; i<SPEEDS; i++){ f_neq = 0.0; for(int k=0; k<SPEEDS; k++) { //f_neq += ((lattice_dev_const->ek_i[k][i] * mk[k]) / lattice_dev_const->w_k[k]); f_neq += lattice_dev_const->ek_i[k][i]*mk[k]; }}


Level 5 (Cont.)


Level 6


8309.662109 315.468903 Algorithmic Optimization

nnn vF ,,nFtmp

nnv ,

nF

Global barrier

Collision Streaming

Collision stores Ftmp to GPU Global Memory. Streaming loads Ftmp from GPU Global Memory. Global Memory Load/Store operations are

expensive.


Level 6 (Cont.)

Collision Streaming

Pull Ftmp from Neighbors needs Synchronization.


Level 6 (Cont.)

Collision Streaming

Instead Push Ftmp to Neighbors – No need of Synchronization


Level 6 (Cont.)

Collision & Streaming can be one kernel. Saves one Load/Store from/to Global Memory.

nnn vF ,,

nnv ,

nFtmp

nFtmp

nF


Optimizations Achieved on GPU using CUDA

Levels Time (ms)

MGUPS (Million Grid Updates Per Second)

Remarks

1 520727.1 5.034192Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.

2 115742 22.64899 Loop Collapsing

3 113061.8 23.186 Copied Lattice Structure in GPU Constant Memory

4 40044.5 65.5 Coalesced Memory Access pattern for Node Structure

5 14492.6 180.9 Arithmetic Optimizations

6 8309.662109 315.468903 Algorithmic Optimization

CUDA Card: Tesla C2075 (448 Cores, 14 SM, Fermi, Compute 2.0)


Recap


• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation

- 100 -


Closing Comments• OLTP applications seldom require HPC technologies

– Unless it is an application that needs to respond in microseconds• Algo trading etc

• Can HPC technologies be used to speed up my data-transformation (ETL/ELT) and reporting workloads?

– Sure – you have to let go the ease of using 3rd party products & databases• If you don’t want to –customizing a specific bottleneck process could help

– Stay tuned to companies innovating in this space – • e.g SQREAM – implements databases operations on GPU’s

• Investing in an HPC cluster and technologies not enough– Also investing people who understand

• Underlying technologies• Applications

- 101 -


www.cmgindia.org

Q&A

[email protected]

http://www.cmgindia.org/

computer measurement group, india 0 0 hpc tutorial manoj nambiar, performance engineering...

Documents