computer measurement group, india 0 0 hpc tutorial manoj nambiar, performance engineering...
TRANSCRIPT
Computer Measurement Group, India 1Computer Measurement Group, India 1
www.cmgindia.org
HPC TutorialManoj Nambiar, Performance Engineering Innovation LabsParallelization and Optimization CoE
Computer Measurement Group, India 2
A Common Expectation
- 2 -
Our ERP application has slowed down. All the
departments are complaining.
Let’s use HPC
Computer Measurement Group, India 3
Agenda
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application (if time permits)
Computer Measurement Group, India 4
Thank You
Designing an Airplane for performance ……
Problem: Calculate Total Lift and Drag on the plane for a wind-speed of 150 m/s
Computer Measurement Group, India 5
Performance Assurance – Airplanes vs Software
AssuranceApproach
Airplane Software
Testing Wind Tunnel Testing Load Testing with virtual users
Simulation CFD Simulation Discrete Event Simulation
Analytical None MVA, BCMP, M/M/k etc
AccuracyCost
Computer Measurement Group, India 6
Thank You
CFD Example – Problem Decomposition
Methodology
1.Partition volume into cells
2. For a number of time steps
2.a For each cell 2.a.1 calculate velocities 2.a.2 calculate pressure 2.a.3 calculate turbulence
All cells have to be in equilibrium with each other.
Becomes a large AX=b problem. This problem is partitioned into groups of cells which are assigned to CPUsEach CPU can compute in parallel but the also have to communicate to each other
Computer Measurement Group, India 7
A serial algorith for Ax = B
Compute Complexity – O(n2)
Computer Measurement Group, India 8
What kind of H/W and S/W do we need
• Take an example Ax=b solver– Order of computational complexity is n2
– Where n is the number of cells in which the domain is divided
• Higher the number of cells – Higher the accuracy
• Typical number of cells– In 10’s of millions
• Very prohibitive to run sequentially
• Increase in memory requirements will need proportionally higher number of servers
Parallel implementation is needed on a large cluster or servers
Computer Measurement Group, India 9
Software
• Lets look at the software aspect first
– Then we look at the hardware
Computer Measurement Group, India 10
Work Load Balancing• After solving Ax=B
– Some elements of x need to be exchanged with neighbor groups
– Every group (process) has to send and receive values with its neighbors• For the next Guass Seidel iteration
Also need to check that all values of x have converged
Should this using TCP/IP or 3 tier web/app/database architecture?
Computer Measurement Group, India 11
Why TCP/IP wont suffice
• Philosophically – NO– These parallel programs are peers– No one process is client or server
• Technically – NO– There can be as much as 10000 parallel processes
• Need to keep a directory of public server IP and port for each process– TCP is a stream oriented protocol
• Applications need to pass messages
• Changing the size of the cluster is tedious
Computer Measurement Group, India 12
Why 3 tier application will not suffice?
• 3 tier applications are meant to serve end user transactions– This application is not transactional
• Database is not needed for these applications– No need to first persist and then read data
• This kind of I/O will impact performance significantly• Better to store data in RAM
– ACID properties of the database are not required• Applications are not transactional in nature
– SQL is a major overhead considering data velocity requirements
• Managed Frameworks like J2EE, .NET not optimal for such requirements
Computer Measurement Group, India 13
MPI to the rescue
• A message oriented interface
• Has an API sparring 300 functions– Support complex messaging requirements
• A very simple interface for parallel programming
• Also portable regardless of the size of the deployment cluster
Computer Measurement Group, India 14
MPI_Functions
• MPI_Send
• MPI_Recv
• MPI_Wait
• MPI_Reduce– SUM– MIN– MAX– ….
Computer Measurement Group, India 15
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India 16
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India 17
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India 18
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India 19
Sample MPI program – parallel addition of a large array
Computer Measurement Group, India 20
MPI – Send, Recv and Wait
If you have some computation to be donewhile waiting to receive a message from a peerThis is the place to do it
Computer Measurement Group, India 21
Hardware
• Lets look at the Hardware
– Clusters– Servers– Coprocessors– Parallel File System
Computer Measurement Group, India 22
HPC Cluster
Not very different from regular data center clusters
Computer Measurement Group, India 23
Now lets look inside a server
Coprocessor’s go here
NUMA
Computer Measurement Group, India 24
Parallelism in Hardware
• Multi-server/Multi-node
• Multi-sockets
• Multi-core
• Co-processors– Many Core– GPU
• Vector Processing
Mult-socket server board
Multi-core CPU
Computer Measurement Group, India 25
Coprocessor - GPU
• SM – Streaming Multi-processor• Device RAM – high speed GDDR5 RAM• Extreme multi-threading – thousands of threads
PCIE Card
Computer Measurement Group, India 26
Inside a GPU streaming multiprocessor (SM)• An SM can be compared to a CPU core
• A GPU core is essentially an ALU
• All cores execute the same instruction at a time– What happens to “if-then-else”?
• A warp is software equivalent of a CPU thread.– Scheduled independently– A warp instruction executed by all cores at a time
• Many warps can be scheduled on an SM– Just like many threads on a CPU– When 1 warp is scheduled to run other warps are moving data
• A collection of warps concurrently running on an SM make a block– Conversely an SM can run only one block at a time
Efficiency is achieved when there is one warp in 1 stage of the execution pipeline
Computer Measurement Group, India 27
How S/W runs on the GPU
1. A CPU process/thread initiates data transfer from CPU memory to GPU memory
2. The CPU invokes a function (kernel) that runs on the GPU– CPU specifies the number of blocks and blocks per thread– Each block is scheduled on one SM– After all blocks complete execution CPU is woken
3. CPU fetches the kernel output from the GPU memoryThis is known as offload mode of execution
Computer Measurement Group, India 28
Co-Processor – Many Integrated Core (MiC)
• Cores are same as Intel Pentium CPU’s– With vector processing instructions
• L2 Level cache is accessible by all the cores
Execution Modes• Native• Offload• Symmetric
Computer Measurement Group, India 29
What is vector processing?
A B
C
A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8
C1 C2 C3 C4 C5 C6 C7 C8
ALU in an ordinary CPU core
ALU in an CPU core with vector processing
Vector registers
1 arithmetic operation per instruction cycle
8 arithmetic operations per instruction cycle
for(i=1; i< 8; i++) c[i] = a[i]+b[i];
VADD C, A, B
ADD C, A, B
Computer Measurement Group, India 30
HPC Networks – Bandwidth and Latency
Computer Measurement Group, India 31
Hierarchical network
• The most intuitive design of a network– Not uncommon in data centers
• What happens when the 1st 8 nodes need to communicate to the next 8?– Remember that all links have the same bandwidth
Top of rack
End of row switch
Computer Measurement Group, India 32
Clos Network
• Can be likened to a replicated hierarchical network– All nodes can talk to all other nodes – Dynamic routing capability essential in the switches
Computer Measurement Group, India 33
Common HPC Network Technology - Infiniband
• Technology used for building high throughput low latency network– Competes with Ethernet
• To use Infiniband – You need a separate NiC on the server– An Infiniband switch– An Infiniband Cable
• Messaging supported in Infiniband– a direct memory access read from or, write to, a remote node
• (RDMA).– a channel send or receive– a transaction-based operation (that can be reversed)– a multicast transmission.– an atomic operation
Computer Measurement Group, India 34
Parallel File Systems - Lustre
• Parallel file systems give the same file system interface to legacy applications• Can be built out of commodity hardware and storage.
Computer Measurement Group, India 35
HPC Applications - Modeling and Simulation• Aerodynamics
– Vehicular design
• Energy and Resources– Seismic Analysis– Geo-Physics– Mining
• Molecular Dynamics– Drug Discovery– Structural Biology
• Weather Forecasting
Simulation OR Physical Experimentation
Prototype
Final Design
Lab Verification
HPC or no HPC?
Accuracy Speed
Power Cost
From Natural Science to Software
Computer Measurement Group, India 36
Relatively Newer & Upcoming Applications• Finance
– Risk Computations– Options Pricing– Fraud Detection– Low Latency trading
• Image Processing– Medical Imaging– Image Analysis– Enhancement and Restoration
• Bio-Informatics– Genomics
Video Analytics– Face Detection– Surveillance
Internet of Things– Smart City– Smart Water– eHealth
Knowledge of core algorithms is key
Computer Measurement Group, India 37
Technology Trends Impacting Performance & Availability
• Multi-Core, Speeds not increasing
• Memory Evolution– Lower memory per core– Relatively Low Memory Bandwidth– Deep Cache & Memory Hierarchies
• Heterogeneous Computing– Coprocessors.
• Vector Processing
Temperature fluctuation induced slowdowns
Memory error induced slowdowns
Network communication errors
Large sized cluster– Increased failure probability
Algorithms need to be re-engineered to make best use of trends
Computer Measurement Group, India 38
Knowing Performance Bounds
• Amdahl’s Law– Maximum speed up achievable sp = (s + (1-s)/p)-1
– Where s is the fraction of code that has to run sequentially
Also Important to take problem size into account when estimating speedups
Compute/Communication ratio is key.
Typically – Higher the problem size - higher the the ratio- Better the speed up
Computer Measurement Group, India 39
Quick Hardware Recap
FLOPS Bound
Bandwidth Bound
What about server clusters?
Computer Measurement Group, India 40
FLOPS and Bandwidth dependencies
• FLOPS – Floating operations per second– Frequency– No of CPU sockets– No of cores/per socket– No of Hyper-threads per core– No of vector units per core / hyperthead
• Bandwidths (Bytes/sec)– Level in the hierarchy – Registers, L1, L2, L3, DRAM– Serial / Parallel– Memory attached to same CPU socket or another CPU
Why are we not talking about memory latencies?
Computer Measurement Group, India 41
Know your performance bounds
• Above information can also be obtained from product data sheets• What do you gain by knowing performance bounds?
GPU
Computer Measurement Group, India 42
Other ways to gauge performance
• CPU speed– SPEC – integer and floating point benchmark
• Memory Bandwidth– Streams benchmark
Computer Measurement Group, India 43
Basic Problem
• Consider the following code– double a[N], b[N], c[N], d[N];– int i;– for (i = 0; i < N-1; i++) a[i] = b[i] + c[i]*d[i];
• If N = 1012
• And the code has to complete in 1 second?– How many Xeon E5-2670 CPU sockets would you need?– Is this memory bound or CPU bound?
Computer Measurement Group, India 44
General guiding principles for performance optimization
• Minimize communication requirements between parallel processes / threads
• If communication is essential then– Hide communication delays by overlapping compute and communication
• Maximize data locality– Helps caching– Good NUMA page placement
• Do not forget to use compiler optimization flags
• Implement weighted decomposition of workload– In a cluster with heterogeneous compute capabilities
Let your profiling results guide you on the next steps
Computer Measurement Group, India 45
Optimization Guidelines for GPU platforms• Minimize use of “if-then-else” or any other branching
– they cause divergence
• Tune the number of threads per block– Too many will exhaust caches and registers in the SM– Too few will underutilize GPU capacity
• Use device memory for constants
• Use shared memory for frequently accessed data
• Use sequential memory access instead of strided
• Coalesce memory accesses
• Use streams to overlap compute and communications
Computer Measurement Group, India 46
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
Data Structure Primitive Tasks
Computer Measurement Group, India 47
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
• Combine sender and receiver• Eliminate communication• Increase Locality
• Combine senders and receivers• Reduces number of message transmissions
Computer Measurement Group, India 48
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
• Mapping
NODE 1 NODE 2 NODE 3
Computer Measurement Group, India 49
Agenda
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation– OpenFOAM - CFD application Xeon Phi (if time permits)
Computer Measurement Group, India 50
Multi-core Performance Enhancement: Case Study
Computer Measurement Group, India 51
Background
• Risk Management in a commodities exchange
• Risk computed post trade– Clearing and settlement – T+2
• Risk details updated on screen– Alerting is controlled by human operators
Computer Measurement Group, India 52Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Trading System
RiskManagementSystem
CollateralFallsShort
Prevent Client/Clearing Member from Trading
OnlineTrades
Alerts
Initial Deposit of Collateral Long/Short Positions on
Contracts Contract/Commodity Price
Changes Risk Parameters Change during
day
Clearing Member
Client1 ClientK
Computer Measurement Group, India 53Computer Measurement Group, India
Will standard architecture on commodity servers suffice?
Application Server2 CPU Database Server
2 CPU
Risk Management System?
Computer Measurement Group, India 54Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Computations:
• Position Monitoring, Mark to Market, P&L, Open Interest, Exposure Margins
• SPAN: Initial Margin (Scanning Risk), Inter-Commodity Spread Charge, Inter-Month Spread Charge, Short Option Margin, Net Option Value
• Collateral Management
Functionality is complexLet’s look at a simpler problem that reflects the same computational challenge & come back later
Computer Measurement Group, India 55Computer Measurement Group, India
Workload Requirements
• Trades/Day : 10 Million
• Peak Trades/Sec : 300
• Traders : 1 Million
Computer Measurement Group, India 56Computer Measurement Group, India
P&L Computation
Time Txn Stock Quantity Price Total Amount
t1 BUY Cisco 100 950 95,000
t2 BUY IBM 200 30 6000
t3 SELL Cisco 40 975 39,000
t4 SELL IBM 200 31 6200
Trader A
Profit(Cisco, t4) = -95000 + 39000 + (100-40)*970 = -56000 + 58200 = 2200
Current Cisco price is 970
In general Profit on a given stock S at time t:= sum of txn values up to time t +
(netpositions on stock at time t) * price of stock at time t
Buy txns take –ve value, sell +ve value
Biggest culprit
Computer Measurement Group, India 57Computer Measurement Group, India
P&L Computation
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Computer Measurement Group, India 58Computer Measurement Group, India
• Profit has to be kept updated for every price change– For all traders
• Inner Loop: 8 Computations– 4 Computations (+ + * +)– Loop Counter– 3 Assignments
• Actual Computational Complexity– 20 times as complex as displayed algorithm
• Number of traders: 1 million
P&L Computational Analysis
Computer Measurement Group, India 59Computer Measurement Group, India
• SLA Expectation: 300 trades / sec
• Computations/trade– 8 computations x 1 million traders x 20 = 160
million
• Computations/sec = 160 million x 300 trades/sec– 48 billion computations/sec!
• Out of reach of contemporary servers that time!
Can we deliver within an IT budget?
P&L Computational Analysis
Computer Measurement Group, India 60Computer Measurement Group, India
Test Environment
• Server• 8 Xeon 5560 cores• 2.8 GHz• 8 GB RAM
• OS: Centos 5.3• Linux kernel 2.6.18
• Programming Language : C• Compilers: gcc and icc
Computer Measurement Group, India 61Computer Measurement Group, India
Test InputsNumber of Trades 1 MillionNumber of Traders 100,000Number of Stocks 100
Trade File Size 20 MB
Trades % Stock %20% 30%20 % 60%60% 10%
Trade Distribution
Computer Measurement Group, India 62Computer Measurement Group, India
P&L Computation: Baselining
Trades/sec Overall Gain
Baseline Performance gcc 190
gcc –O3 323 70%
Computer Measurement Group, India 63Computer Measurement Group, India 63
P&L Computation: Transpose
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stockint sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction valuesint profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Tra-der
Stock s1
Stocksi
r1
r2
r3
Trade t
Very Poor Caching
Computer Measurement Group, India 64Computer Measurement Group, India
Matrix Layout
Trader Stock s1 . Stocksi
r1
r2
r3
Memory LayoutTrader r1 Trader r2 Trader r3 Trader r4
S1
S2
Si
S1
S2
S iS1
S2
S iS 1
S2
S i
Computer Measurement Group, India 65Computer Measurement Group, India
Matrix Layout - Optimized
Optimized Memory LayoutStock S1 Stock S2 Stock S3 Stock S4
r1r2 rn r1r2 rnr1r2 rnr1r2 rn
Stock Trader r1
Trader r2
Trader r3
S1
S2
S3
Computer Measurement Group, India 66Computer Measurement Group, India
P&L Computation: Transpose
int profit[MAXTRADERS]; // array of trader profitsint netpositions[MAXSTOCKS][MAXTRADERS]; // net positions per stockint sumtxnvalue[MAXSTOCKS][MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS][MAXTRADERS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[t.stock][r];
profitperstock[t.stock][r] = sumtxnvalue[t.stock][r] + netpositions[t.stock][r] * t.price;
profit[r] = profit[r]+ profitperstock[t.stock][r];
end loop
end loop
Stock Tra-derr1
Tra-derri
s1
si
Trade t
Very Good Caching
Computer Measurement Group, India 67Computer Measurement Group, India
P&L Computation: Transpose
Trades/sec
Overall Gain Immediate Gain
Baseline Performance gcc
190
gcc –O3 323 1.7X
Transpose of Trader/Stock
4750 25X 14.7X
Intel Compiler
Trades/sec
Overall Gain Immediate Gain
icc –fast (not –O3) 6850 36X 37%
Computer Measurement Group, India 68Computer Measurement Group, India
P&L Computation: Use of Partial Sums
int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXSTOCKS] [MAXTRADERS]; // net transaction valuesint profitperstock[MAXSTOCKS] [MAXTRADERS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] + netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
This can be maintained cumulatively for the trader. Need not be per stock.
Computer Measurement Group, India 69Computer Measurement Group, India 69
P&L Computation: Use of Partial sumsint profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer] − = t.quantity * t.price;
sumtxnvalue[t.seller] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
ltp[t.stock] = t.price;
end loop
Monetary Value of all stock positionstime of trade
Trades/sec
Overall Gain Immediate Gain
Use of Partial Sum 9650 50X 41%
Computer Measurement Group, India 70Computer Measurement Group, India
P&L Computation: Skip Zero Values
int netpositions [MAXSTOCKS] [MAXTRADERS];
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
endif
end loop
Majority of the values of this matrix are 0, thanks to hot stocks
Trades/sec
Overall Gain Immediate Gain
Skip Zero Values 10800 56X 12%
Computer Measurement Group, India 71Computer Measurement Group, India
• There is a large percentage of cold stocks– Those which are held by very few traders
• In the last optimization an “if” check was added to avoid computation– If the trader does not hold the traded stock
• Is there any benefit if the trader record is not accessed at all?– We are computing for 100,000 traders
P&L Computation: Cold Stocks
Computer Measurement Group, India 72Computer Measurement Group, India
P&L Computation: Sparse Matrix Representation
Stock A B C D E
s1 1 1 0 0 0
s2 1 1 1 0 0
s3 1 0 0 1 1
Flags Table – This Stock owned by who?
Updated in outer loopStock Count T0 T1 T2 . .
s1 2 A B 0 0 0
s2 3 A C B 0 0
s3 3 A E D 0 0
Traversed in outer loop
Traders Indexes/stock
Computer Measurement Group, India 73Computer Measurement Group, India 73
P&L Computation: Sparse Matrix Representation
int profit[MAXTRADERS]; // array of trader profitsint netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stockint sumtxnvalue [MAXTRADERS]; // net transaction valuesint sumposvalue [MAXTRADERS]; // sum of netpositions * stock price int ltp[MAXSTOCKS]; // latest stock price (last traded price)
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer] − = t.quantity * t.price;
sumtxnvalue[t.seller] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
ltp[t.stock] = t.price;
end loop
Traverse list of trader count for stock less than threshold
Trades/sec
Overall Gain Immediate Gain
Sparse Matrix 36000 189X 3.24X
Computer Measurement Group, India 74Computer Measurement Group, India
P&L Computation: Clustering
struct TraderRecord { int profit; int sumtxnvalue int sumposvalue;} Trades/
secOverall Gain Immediate
Gain
Clustering 70000 368X 94%
int profit[MAXTRADERS];int sumtxnvalue [MAXTRADERS]; int sumposvalue [MAXTRADERS];
Poor caching for sparse matrix lists
Better caching performance!
Computer Measurement Group, India 75Computer Measurement Group, India
P&L Computation: Precompute Price Difference
Trades/sec Overall Gain Immediate Gain
Clustering 75000 394X 7%
int netpositions [MAXSTOCKS] [MAXTRADERS];
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Loop Invariant: Move to outside the loop
Computer Measurement Group, India 76Computer Measurement Group, India
P&L Computation: Loop Unrolling
Trades/sec
Overall Gain Immediate Gain
Clustering 80000 421X 7%
#pragma unroll
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Computer Measurement Group, India 77Computer Measurement Group, India
Commodities Exchange: Online Risk Management
Trading System
RiskManagementSystem
CollateralFallsShort
Prevent Client/Clearing Member from Trading
OnlineTrades
AlertsClearing Member
Client1 ClientK
Initial Deposit of Collateral Long/Short Positions on
Contracts Contract/Commodity Price
Changes Risk Parameters Change during
day
Computer Measurement Group, India 78Computer Measurement Group, India
P&L Computation: Batching of Trades
Trades/sec Overall Gain Immediate Gain
Batching of 100 trades 150000 789X 1.88X
Batching of 1000 trades 400000 2105X 2.67X
Batch n trades and use ltp of last trade // increases risk by a small delay
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
So far all this is with only one thread!!!
Computer Measurement Group, India 79Computer Measurement Group, India
P&L Computation: Use of Parallel Processing
Trades/sec Overall Gain
Immediate Gain
OpenMP 1.2 million 5368X 2.55X
#pragma openmp with chunks (32 threads on 8 core Intel server)
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Computer Measurement Group, India 80Computer Measurement Group, India
P&L Computation: Summary of OptimizationsOptimization Trades/sec Immediate
GainOverall
Gain
Baseline gcc 190
gcc –O3 320 1.70X 1.7X
Transpose of Trader/Stock 4750 14.70X 25X
Intel Compiler icc –fast 6850 1.37X 36X
Use of Partial Sums 9650 1.41X 50X
Skip Zero Values 10,800 1.12X 56X
Sparse Matrix 36,000 3.24X 189X
Clustering of Arrays 70,000 1.94X 368X
Precompute Price Diff 75,000 1.07X 394X
Loop Unrolling 80,000 1.07X 421X
Batching of 100 Trades 150,000 1.88X 789X
Batching of 1000 Trades 400,000 2.67X 2105X
OpenMP 1,020,000 2.55X 5368X
Single Thread
8 CPU, 32 Threads
Computer Measurement Group, India 81
BACKGROUNDLattice Boltzmann on GPU
Computer Measurement Group, India 82Computer Measurement Group, India
2-D Square Lid Driven Cavity Problem
Moving Top Lid
L
L
X
Y
U
Fluid
Flow is generated by continuously moving top lid at a constant velocity.
Computer Measurement Group, India 83Computer Measurement Group, India
Level 1
Time (ms) MGUPS Remarks
520727.1 5.034192 Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.
/*CPU Code*/for(y=0; y<(ny-2); y++){ for(x=0; x<(nx-2); x++) { -- }}
/*GPU Code*//*for(int y=0; y<(ny-2); y++){*/
if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { -- } }
Replace outer loop Iterations with threads. Total Threads=(ny-2), Each thread working on (nx-2)
grid points. MGUPS = (GridSize x TimeIterations) / (Time x 1000000)
Computer Measurement Group, India 84Computer Measurement Group, India
Level 1 (Cont.)
Computer Measurement Group, India 85Computer Measurement Group, India
Level 2
Time (ms) MGUPS Remarks
115742 22.64899 Loop Collapsing
/*GPU Code Level 1*/
if(tid < (ny-2)){ for(x=0; x<(nx-2); x++) { for(y=0; y<(ny-2); y++)-- } }
/*GPU Code with Loop Fusion*/
if(tid < ((ny-2)*(nx-2))){ y = (tid/(nx-2))+1; x = (tid%(nx-2))+1;
--}
Collapsing of 2 nested loops into one to exhibit massive parallelism.
Total threads=[(ny-2)*(nx-2)], Now each thread working on 1 grid point.
Computer Measurement Group, India 86Computer Measurement Group, India
About GPU Constant Memory
Can be used for data that will not change over the course of kernel execution.
Define constant memory using __constant__ cudMemcpyToSymbol will copy data to constant memory. Constant memory is cached. Constant memory is read-only. Just 64 KB.
SM 1 SM 2 SM 14
Global Memory
Constant Memory
Tesla C2075
Computer Measurement Group, India 87Computer Measurement Group, India
Level 3
Time (ms) MGUPS Remarks
113061.8 23.186 Copied Lattice Structure in GPU Constant Memory
__constant__ Lattice lattice_dev_const[1];cudaMemcpyToSymbol(lattice_dev_const, lattice, sizeof(Lattice));
typedef struct Lattice{ int Cs[9]; int Lattice_velocities[9][2]; real_dt Lattice_constants[9][4]; real_dt ek_i[9][9]; real_dt w_k[9]; real_dt ac_i[9]; real_dt gamma9[9];}Lattice;
Computer Measurement Group, India 88Computer Measurement Group, India
Level 4
Time (ms) MGUPS Remarks
40044.5 65.5 Coalesced Memory Access pattern for Node Structure
typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
Computer Measurement Group, India 89Computer Measurement Group, India
Level 4 (Cont.)
Computer Measurement Group, India 90Computer Measurement Group, India
Level 4 (Cont.)
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
T - 0
T - 1
(All Threads simultaneously accessing Density)
Stride
Computer Measurement Group, India 91Computer Measurement Group, India
Level 4 (Cont.)
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
Grid Point 0 Grid Point 1
T - 0
T - 1
(All Threads simultaneously accessing Density)
Stride
Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]Type
Type
Vel[2]
Vel[2]
Density
Density F[9]F[9] Ftmp[
9]Ftmp[
9]
T - 0
T - 1
(All Threads simultaneously accessing Density)
Coalesced Access pattern
Efficient access of global memory
Inefficient access of global memory
Computer Measurement Group, India 92Computer Measurement Group, India
Level 4 (Cont.)
typedef struct Type{ int *val;}Type;typedef struct Vel{ real_dt *val;}Vel;typedef struct Density{ real_dt *val;}Density;typedef struct F{ real_dt *val;}F;typedef struct Ftmp{ real_dt *val;}Ftmp;
typedef struct Node_map{Type type;Vel vel[2];Density density;F f[9];Ftmp ftmp[9];
}Node_dev;
typedef struct Node /* AoS, (ny*nx) elements */{ int Type; real_dt Vel[2]; real_dt Density; real_dt F[9]; real_dt Ftmp[9];}Node;
Computer Measurement Group, India 93Computer Measurement Group, India
Level 5
Time (ms) MGUPS Remarks
14492.6 180.9 Arithmetic Optimizations
for(int k=3; k<SPEEDS; k++){ //mk[k] = lattice_dev_const->gamma9[k]*mk[k]; //mk[k] = lattice_dev_const->gamma9[k] * mk[k] / lattice_dev_const->w_k[k]; mk[k] = lattice_dev_const->gamma9_div_wk[k]*mk[k];}
for(int i=0; i<SPEEDS; i++){ f_neq = 0.0; for(int k=0; k<SPEEDS; k++) { //f_neq += ((lattice_dev_const->ek_i[k][i] * mk[k]) / lattice_dev_const->w_k[k]); f_neq += lattice_dev_const->ek_i[k][i]*mk[k]; }}
Computer Measurement Group, India 94Computer Measurement Group, India
Level 5 (Cont.)
Computer Measurement Group, India 95Computer Measurement Group, India
Level 6
Time (ms) MGUPS Remarks
8309.662109 315.468903 Algorithmic Optimization
nnn vF ,,nFtmp
nnv ,
nF
Global barrier
Collision Streaming
Collision stores Ftmp to GPU Global Memory. Streaming loads Ftmp from GPU Global Memory. Global Memory Load/Store operations are
expensive.
Computer Measurement Group, India 96Computer Measurement Group, India
Level 6 (Cont.)
Collision Streaming
Pull Ftmp from Neighbors needs Synchronization.
Computer Measurement Group, India 97Computer Measurement Group, India
Level 6 (Cont.)
Collision Streaming
Instead Push Ftmp to Neighbors – No need of Synchronization
Computer Measurement Group, India 98Computer Measurement Group, India
Level 6 (Cont.)
Collision & Streaming can be one kernel. Saves one Load/Store from/to Global Memory.
nnn vF ,,
nnv ,
nFtmp
nFtmp
nF
Computer Measurement Group, India 99Computer Measurement Group, India
Optimizations Achieved on GPU using CUDA
Levels Time (ms)
MGUPS (Million Grid Updates Per Second)
Remarks
1 520727.1 5.034192Simply ported CPU code to GPU.Structure Node & Lattice in GPU Global Memory.
2 115742 22.64899 Loop Collapsing
3 113061.8 23.186 Copied Lattice Structure in GPU Constant Memory
4 40044.5 65.5 Coalesced Memory Access pattern for Node Structure
5 14492.6 180.9 Arithmetic Optimizations
6 8309.662109 315.468903 Algorithmic Optimization
CUDA Card: Tesla C2075 (448 Cores, 14 SM, Fermi, Compute 2.0)
Computer Measurement Group, India 100
Recap
• Part – I – A sample domain problem– Hardware & Software
• Part – II Performance Optimization Case Studies– Online Risk Management– Lattice Boltzmann implementation
- 100 -
Computer Measurement Group, India 101
Closing Comments• OLTP applications seldom require HPC technologies
– Unless it is an application that needs to respond in microseconds• Algo trading etc
• Can HPC technologies be used to speed up my data-transformation (ETL/ELT) and reporting workloads?
– Sure – you have to let go the ease of using 3rd party products & databases• If you don’t want to –customizing a specific bottleneck process could help
– Stay tuned to companies innovating in this space – • e.g SQREAM – implements databases operations on GPU’s
• Investing in an HPC cluster and technologies not enough– Also investing people who understand
• Underlying technologies• Applications
- 101 -
Computer Measurement Group, India 102Computer Measurement Group, India 102
www.cmgindia.org
Q&A