introduction to parallel and high-performance...

91
Introduction to Parallel and High-Performance Computing (with Machine-Learning applications) Anshul Gupta, IBM Research Prabhanjan (Anju) Kambadur, Bloomberg L.P.

Upload: others

Post on 09-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Introduction to Parallel and High-Performance Computing

(with Machine-Learning applications)

Anshul Gupta, IBM Research

Prabhanjan (Anju) Kambadur, Bloomberg L.P.

Page 2: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Introduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1

Parallel computing basics and parallel algorithm analysis

Part 2

Parallel algorithms for Building a Classifier

Page 3: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

Page 4: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

• Parallel computing platforms

Page 5: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

• Parallel computing platforms

• Parallel algorithm basics

Page 6: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

• Parallel computing platforms

• Parallel algorithm basics

• Decomposition for parallelism

Page 7: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

• Parallel computing platforms

• Parallel algorithm basics

• Decomposition for parallelism

• Parallel programming models

Page 8: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Part 1: Parallel and High-Performance Computing

•Why parallel computing?

• Parallel computing platforms

• Parallel algorithm basics

• Decomposition for parallelism

• Parallel programming models

• Parallel algorithm analysis

Page 9: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Why parallel computing?

Microprocessor trends: 1972--2015

Kirk M. Bresniker, Sharad Singhal, R. Stanley Williams, "Adapting to Thrive in a New Economy of Memory Abundance",Computer, vol.48, no. 12, pp. 44-53, Dec. 2015, doi:10.1109/MC.2015.368

Page 10: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel computing platforms

Highest level of parallelism:

• Compute nodes on an interconnection network

• Possibly, thousands of nodes

• Distributed memory

• Distributed or shared address space

• Scalability analysis crucial

Page 11: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel computing platforms (cont.)

A generalized compute node

• (Possibly) multiple CPUs

• (Possibly) multiple GPUs

Page 12: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel computing platforms (cont.)

Page 13: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel computing platforms (cont.)

Shared memory on a node, but Non-Uniform Memory Access (NUMA).

Page 14: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel computing platforms (cont.)

Nvidia GeForce GTX 680 (Kepler)

• 4 GPCs (graphics processing clusters)

• 8 SMXs (streaming multiprocessors)

• 192 X 8 = 1536 CUDA cores

Page 15: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel hardware hierarchy

Node ensemble

Nodes

CPUs

Cores

GPUs

GPCs

SMXs

CUDA cores

Page 16: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel program hierarchy

Process groups

Processes

Thread pools

Threads

GPU thread grids

GPU thread blocks

GPU threads

Page 17: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel programming paradigms

Distributed Memory Shared Memory Accelerator

Page 18: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel programming paradigms

Distributed Memory

• Multiple processes

• Distributed address space

• Explicit data movement

• Locality!

Page 19: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel programming paradigms

Distributed Memory

• Multiple processes

• Distributed address space

• Explicit data movement

• Locality!

Shared Memory

• Multiple threads

• Shared address space

• Implicit data movement

• Locality!

Page 20: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel programming paradigms

Distributed Memory

• Multiple processes

• Distributed address space

• Explicit data movement

• Locality!

Shared Memory

• Multiple threads

• Shared address space

• Implicit data movement

• Locality!

Accelerator

• Host memory PCIe Device memory

• Explicit data movement

• Locality!

Page 21: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Algorithm design

• Algorithm design is critical to devising a computing solution

Page 22: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Algorithm design

• Algorithm design is critical to devising a computation solution

• Serial algorithm is a recipe or sequence of basic steps or operations

Page 23: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Algorithm design

• Algorithm design is critical to devising a computation solution

• Serial algorithm is a recipe or sequence of basic steps or operations

• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources

Page 24: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Algorithm design

• Algorithm design is critical to devising a computation solution

• Serial algorithm is a recipe or sequence of basic steps or operations

• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources

• Specifying parallel algorithm involves a lot more than specifying a sequence of basic steps

Page 25: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Algorithm design

• Algorithm design is critical to devising a computation solution

• Serial algorithm is a recipe or sequence of basic steps or operations

• Parallel algorithm is a recipe for solving the given problem using an ensemble of hardware resources

• Specifying parallel algorithm involves a lot more than specifying a sequence of basic steps

Page 26: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently

Page 27: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently

• Mapping concurrent pieces of work onto computing agents running in parallel

Page 28: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently

• Mapping concurrent pieces of work onto computing agents running in parallel

• Making the input, output, and intermediate data available to the right computing agent at the right time

Page 29: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently

• Mapping concurrent pieces of work onto computing agents running in parallel

• Making the input, output, and intermediate data available to the right computing agent at the right time

• Managing simultaneous requests for shared data

Page 30: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently

• Mapping concurrent pieces of work onto computing agents running in parallel

• Making the input, output, and intermediate data available to the right computing agent at the right time

• Managing simultaneous requests for shared data

• Synchronizing computing agents for correct program execution

Page 31: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design steps

• Identifying portions of work that can be performed concurrently - Decomposition

• Mapping concurrent pieces of work onto computing agents running in parallel

• Making the input, output, and intermediate data available to the right computing agent at the right time – Data Dependencies

• Managing simultaneous requests for shared data

• Synchronizing computing agents for correct program execution –Task Dependencies

Page 32: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition Data Decomposition

Page 33: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

Page 34: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

• Tasks share or exchange data as needed

Page 35: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

• Tasks share or exchange data as needed

• May be static or dynamic

Page 36: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

• Tasks share or exchange data as needed

• May be static or dynamic

Data Decomposition

• Data is partitioned (input, output, or intermediate)

Page 37: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

• Tasks share or exchange data as needed

• May be static or dynamic

Data Decomposition

• Data is partitioned

• Partitions are assigned to computing agents

• “Owner computes” rule

Page 38: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Task Decomposition

• Concurrent tasks are identified and mapped onto to threads or processes

• Tasks share or exchange data as needed

• May be static or dynamic

Data Decomposition

• Data is partitioned

• Partitions are assigned to computing agents

• “Owner computes” rule

• Usually static

Page 39: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Decomposition for concurrency

Data Decomposition

Task Decomposition

Hybrid Decompositions

(Example: sparse matrix factorization)

+ =

Page 40: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Task decomposition example

Chess Program

• Each task evaluates all moves of a single piece (branch-and-bound)

• Small data (board position) can be replicated

• Dynamic load balancing required

R1 B1 P1K1 Q

. . .

P2

Page 41: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Data decomposition example

Dense Matrix-Vector Multiplication

. =

P1

P2

P3

P4

P5

P6

P7

P8

P9

Page 42: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

Page 43: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

• Devise the best decomposition strategy at the given level

Page 44: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

• Devise the best decomposition strategy at the given level

• Computing agents are likely to be parallel themselves

Page 45: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

• Devise the best decomposition strategy at the given level

• Computing agents are likely to be parallel themselves

• Minimize interactions, synchronization, and data movement among computing agents

Page 46: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

• Devise the best decomposition strategy at the given level

• Computing agents are likely to be parallel themselves

• Minimize interactions, synchronization, and data movement among computing agents

• Minimize load imbalance and idling among computing agents

Page 47: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel application design guidelines

• Focus on one level of hierarchy at a time – from top to bottom.

• Devise the best decomposition strategy at the given level

• Computing agents are likely to be parallel themselves

• Minimize interactions, synchronization, and data movement among computing agents

• Minimize load imbalance and idling among computing agents

Page 48: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

Page 49: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

Page 50: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes

Page 51: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes

• Overhead, sum of all wasted compute resources: TO = pTP – TS

Page 52: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes

• Overhead, sum of all wasted compute resources: TO = pTP – TS

• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)

Page 53: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes

• Overhead, sum of all wasted compute resources: TO = pTP – TS

• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)

• Efficiency, fraction of overall time spent doing useful work: E = S/p = TS/pTP = TS/(TS+TO)

Page 54: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

• Serial run time, TS: time required by best known method on a single computing agent

• Problem size, W = total amount of work: TS = kW

• Parallel run time, TP: time elapsed between start of computation until the last of the p computing agents finishes

• Overhead, sum of all wasted compute resources: TO = pTP – TS

• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)

• Efficiency, fraction of overall time spent doing useful work: E = S/p = TS/pTP = TS/(TS+TO)

Page 55: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

Page 56: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm analysis

Page 57: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Isoefficiency function

• Function fE(p) of the number of computing agents p by which the problem size Wmust grow in order to maintain a given efficiency E.

Page 58: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Isoefficiency function

• Function fE(p) of the number of computing agents p by which the problem size Wmust grow with p in order to maintain a given efficiency E.

• Captures the effect of communication, load-imbalance, contention, serial-bottlenecks, etc.

Page 59: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Isoefficiency function

Page 60: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis

E = S/p = TS/pTP

Page 61: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Page 62: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW

Page 63: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW

W = TO . E/k(1-E)

Page 64: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis

E = S/p = TS/pTP

Since TO = pTP – TS or pTP = TS +TO

Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW

W = TO . E/k(1-E)

W ~ TO

Page 65: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis: W = O(n3)

Algorithm A

TP = O(n3/p) + O(n2/√p)

Algorithm B

TP = O(n3/p) + O(n√n)

Page 66: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis: W = O(n3)

Algorithm A

TP = O(n3/p) + O(n2/√p)

W = O(n3) => n3 ~ n2√p

Algorithm B

TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n1.5p

Page 67: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis: W = O(n3)

Algorithm A

TP = O(n3/p) + O(n2/√p)

W = O(n3) => n3 ~ n2√p

n ~ √p

Algorithm B

TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n1.5p

n1.5 ~ p

Page 68: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis: W = O(n3)

Algorithm A

TP = O(n3/p) + O(n2/√p)

W = O(n3) => n3 ~ n2√p

n ~ √p

W = O(n3) = O(p1.5)

Algorithm B

TP = O(n3/p) + O(n√n)

W = O(n3) => n3 ~ n1.5p

n1.5 ~ p

W = O(n3) = O(p2)

Page 69: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design and analysis

Dense Matrix-Vector Multiplication (1-D decomposition)

. =

P1

P2

P3

P4

P5

P6

P7

P8

P9

A x y

Page 70: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel algorithm design and analysis

Dense Matrix-Vector Multiplication (2-D decomposition)

P1P1

. =

P1 P2 P3

P5 P6

P7 P8 P9

A x y

Page 71: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel matrix-vector multiplication: 1-D decomposition

Page 72: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Parallel matrix-vector multiplication: 2-D decomposition

Page 73: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

Page 74: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

TO = tsplog(p) + twpn

Page 75: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

TO = tsplog(p) + twpn

W = O(n2)

Page 76: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

TO = tsplog(p) + twpn

W = O(n2)

1: n2 ~ plog(p)

Page 77: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

TO = tsplog(p) + twpn

W = O(n2)

1: n2 ~ plog(p)

2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)

Page 78: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (1-D decomposition)

TP = n2/p + tslog(p) + twn

TO = tsplog(p) + twpn

W = O(n2)

1: n2 ~ plog(p)

2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)

Page 79: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

Page 80: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

TO = tsplog(p) + twn√plog(p)

Page 81: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

TO = tsplog(p) + twn√plog(p)

W = O(n2)

Page 82: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

TO = tsplog(p) + twn√plog(p)

W = O(n2)

1: n2 ~ plog(p)

Page 83: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

TO = tsplog(p) + twn√plog(p)

W = O(n2)

1: n2 ~ plog(p)

2: n2 ~ n√plog(p), or n ~ √plog(p),

or W = O(n2) = O(plog2p)

Page 84: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Scalability analysis of matrix-vector multiplication (2-D decomposition)

TP = n2/p + tslog(p) + tw(n/√p)log(p)

TO = tsplog(p) + twn√plog(p)

W = O(n2)

1: n2 ~ plog(p)

2: n2 ~ n√plog(p), or n ~ √plog(p),

or W = O(n2) = O(plog2p)

Page 85: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Isoefficiency function of dense sparse matrix-vector multiplication

1-D decomposition

W ~ p2

2-D decomposition

W ~ plog2p

2-D decomposition is likely to yield higher speedups, require smaller

problems to deliver the speedups, and scale more readily to larger number

of computing agents.

Page 86: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Concluding remarks

• Parallelism necessary for continued performance improvement.

Page 87: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Concluding remarks

• Parallelism necessary for continued performance improvement.

• Complex hierarchy of parallel computing hardware and programming paradigms

Page 88: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Concluding remarks

• Parallelism necessary for continued performance improvement.

• Complex hierarchy of parallel computing hardware and programming paradigms

• Systematic top down parallel application design

Page 89: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Concluding remarks

• Parallelism necessary for continued performance improvement.

• Complex hierarchy of parallel computing hardware and programming paradigms

• Systematic top down parallel application design

• Decomposition strategy is critical

Page 90: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Concluding remarks

• Parallelism necessary for continued performance improvement.

• Complex hierarchy of parallel computing hardware and programming paradigms

• Systematic top down parallel application design

• Decomposition strategy is critical

• Analysis important to understand scalability

Page 91: Introduction to Parallel and High-Performance …auai.org/uai2016/tutorials_pres/parallel.pdfIntroduction to Parallel and High-Performance Computing (with Machine-Learning applications)

Thank you!