cx: a scalable, robust network for parallel computing peter cappello & dimitrios mourloukos...

Post on 22-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CX: A Scalable, Robust Network for Parallel Computing

Peter Cappello & Dimitrios Mourloukos

Computer Science

UCSB

2

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

3

Introduction

• “Listen to the technology!” Carver Mead

4

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

5

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

6

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth increasing & getting cheaper

7

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth is increasing & getting cheaper

– Communication latency is not decreasing

8

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth increasing & getting cheaper

– Communication latency is not decreasing

– Human technology is getting neither

cheaper nor faster.

9

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

10

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

2. Jobs complete with high probability

despite faulty components

11

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

2. Jobs complete with high probability

despite faulty components

3. Application program is oblivious to:• Number of processors

• Inter-process communication

• Fault tolerance

12

Heterogeneous machine/OS

Introduction

Fundamental Issue: Heterogeneity

M1

OS1

M2

OS2

M3

OS3

M4

OS4

M5

OS5…

13

Heterogeneous machine/OS

Introduction

Fundamental Issue: Heterogeneity

M1

OS1

M2

OS2

M3

OS3

M4

OS4

M5

OS5…

Functionally Homogeneous

JVM

14

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

15

Related work

• Cilk Cilk-NOW Atlas

– DAG computational model

– Work-stealing

16

Related work

• Linda Piranha JavaSpaces

– Space-based coordination

– Decoupled communication

17

Related work

• Charlotte (Milan project / Calypso prototype)

– High performance Fault tolerance not

achieved via transactions

– Fault tolerance via eager scheduling

18

Related work

• SuperWeb JavelinJavelin++– Architecture: client, broker, host

19

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

20

API

DAG Computational model

int f( int n )

{

if ( n < 2 )

return n;

else

return f( n-1 ) + f( n-2 );

}

21

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

Method invocation tree

22

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

Method invocation tree

23

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

f(2) f(1) f(1) f(0)

Method invocation tree

24

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

f(1) f(1) f(0)

f(1) f(0)

Method invocation tree

f(2)

25

DAG Computational Model / API

f(4) execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

execute( ) {

setArg( , in[0] + in[1] );

}

f(n)

+

+

f(n-2)

26

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

27

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

28

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

f(1) f(0)

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

29

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

30

Architecture: Basic Entities

CONSUMERPRODUCTION

NETWORK

CLUSTERNETWORK

register ( spawn | getResult )* unregister

31

Architecture: Cluster

TASKSERVERPRODUCER

PRODUCER

PRODUCER

PRODUCER

32

A Cluster at Work

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

33

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4)

34

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4) f(4)

35

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4)

36

Decompose

execute( )

{

if ( n < 2 )

setArg( ArgAddr, n );

else

{

spawn ( + );

spawn ( f(n-1) );

spawn ( f(n-2) );

}

}

37

A Cluster at Work

f(4)

f(3) f(2)

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

f(4)

+

f(3)

f(2)

38

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3) f(2)

+

39

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3)

f(2)

f(3) f(2)

+

40

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3) f(2)

+

41

A Cluster at Work

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2) +

f(2)

f(1)

+

f(1) f(0)

42

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

f(2)

f(1)

+

f(1) f(0)

+

f(2) f(1) f(1) f(0)

+

+

43

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

f(2)

f(1) +

f(1) f(0)

f(2)

f(1)

+

f(2) f(1) f(1) f(0)

+

+

44

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++f(1) +

f(0)

f(2)

f(1)

+

f(2) f(1) f(1) f(0)

+

+

45

Compute Base Case

execute( )

{

if ( n < 2 )

setArg( ArgAddr, n );

else

{

spawn ( + );

spawn ( f(n-1) );

spawn ( f(n-2) );

}

}

46

A Cluster at Work

+

f(2) f(1) f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++f(1) +

f(0)

f(2)

f(1)

+

f(1)

f(0)

47

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

f(0)f(1)

+

f(1)

f(0)

48

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

f(0)f(1)

+

f(1)

f(0)f(1)

f(0)

49

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)f(1)

f(0)

50

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)f(1)

f(0)

51

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)

+

52

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(1)

f(0)

+

53

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(1)

f(0)

+

+

f(1)

54

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

+

f(1)

55

Compose

execute( )

{

setArg( ArgAddr, in[0] + in[1] );

}

56

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

+

f(1)

57

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

58

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

f(0)

59

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

60

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

61

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

+

62

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

63

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

+

64

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

65

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

66

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

67

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

68

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

+

69

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

70

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

71

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

72

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

73

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

+

74

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

75

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

R

76

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

R

1. Result object is sent to Production Network

2. Production Network returns it to Consumer

77

Task Server ProxyOverlap Communication with Computation

PRODUCER

Task Server Proxy

OUTBOX

INBOXCOMMCOMP

READY

WAITING

TASK SERVER

PRIORITY Q

78

Architecture Work stealing & eager scheduling

• A task is removed from server only after a

complete signal is received.

• A task may be assigned to multiple producers

– Balance task load among producers of varying

processor speeds

– Tasks on failed/retreating producers are re-assigned.

79

Architecture: Scalability

• A cluster tolerates producer:

– Retreat

– Failure

• 1 task server however is a:

– Bottleneck

– Single point of failure.

• We introduce a network of task servers.

80

Scalability: Class loading

1. CX class loader loads classes (Consumer JAR) in each server’s class cache

2. Producer loads classes from its server

81

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

82

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

83

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

When server fails,its sibling restores stateto replacement server

84

Architecture

Production network of clusters

• Network tolerates single server failure.

• Restores ability to tolerate a single failure.

ability to tolerate a sequence of failures

85

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

86

Preliminary experiments

• Experiments run on Linux cluster

– 100 port Lucent P550 Cajun Gigabit Switch

• Machine

– 2 Intel EtherExpress Pro 100 Mb/s Ethernet cards

– Red Hat Linux 6.0

– JDK 1.2.2_RC3

– Heterogeneous

• processor speeds

• processors/machine

87

Fibonacci Tasks with Synthetic Load

+

+

f(n-1)

+

+

f(n)

f(n-2)

execute( ) {

if ( n < 2 )

synthetic workload();

setArg( , n );

else {

synthetic workload();

spawn ( );

spawn ( );

spawn ( );

}

}

execute( ) {

synthetic workload();

setArg( , in[0] + in[1] );

}

88

TSEQ vs. T1 (seconds)Computing F(8)

Workload TSEQ T1 Efficiency

4.522 497.420 518.816 0.96

3.740 415.140 436.897 0.95

2.504 280.448 297.474 0.94

1.576 179.664 199.423 0.90

0.914 106.024 120.807 0.88

0.468 56.160 65.767 0.85

0.198 24.750 29.553 0.84

0.058 8.120 11.386 0.71

89

Parallel Efficiency over 60 nodes

0

0.2

0.4

0.6

0.8

1

1.2

F(13) Fib(14) Fib(15) Fib(16) Fib(17) Fib(18)

Par

alle

l E

ffic

ien

cy

Workload 1

Workload 2

Parallel efficiency for F(13) = 0.87Parallel efficiency for F(18) = 0.99

Average task time:Workload 1 = 1.8 sec.Workload 2 = 3.7 sec.

90

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

91

Current work

• Implement CX market maker (broker)

Solves discovery problem between Consumers & Production

networks

• Enhance Producer with Lea’s Fork/Join Framework

– See gee.cs.oswego.edu

CONSUMER PRODUCTIONNETWORKCONSUMERCONSUMERCONSUMER

PRODUCTIONNETWORK

PRODUCTIONNETWORK

PRODUCTIONNETWORK

MARKETMAKER} {

JINI Service

92

Current work

• Enhance computational model: branch & bound.

– Propagate new bounds thru production network: 3 steps

PRODUCTION NETWORK

SEARCH TREE

TERMINATE!

BRANCH

93

Current work

• Enhance computational model: branch & bound.

– Propagate new bounds thru production network: 3 steps

PRODUCTION NETWORK

SEARCH TREE

TERMINATE!

94

Current work

• Investigate computations that appear

ill-suited to adaptive parallelism

– SOR

– N-body.

96

Introduction

Fundamental Issues

• Communication latency

Long latency Overlap computation with communication.

• Robustness

Massive parallelism faults

• Scalability

Massive parallelism login privileges cannot be required.

• Ease of use

Jini easy upgrade of system components

97

Related work

• Market mechanisms– Huberman, Waldspurger, Malone, Miller &

Drexler, Newhouse & Darlington

98

Related work

• CX integrates

– DAG computational model

– Work-stealing scheduler

– Space-based, decoupled communication

– Fault-tolerance via eager scheduling

– Market mechanisms (incentive to participate)

99

Architecture Task identifier

• Dag has spawn tree• TaskID = path id• Root.TaskID = 0• TaskID used to detect

duplicate: – Tasks– Results.

F(4)

F(3) F(2)

+

F(2) F(1) F(1) F(0)

F(1) F(0)

+

+

+

0

000

2

1

1

1

1

22

2

100

Architecture: Basic Entities

• Consumer

Seeks computing resources.

• Producer

Offers computing resources.

• Task Server

Coordinates task distribution among its producers.

• Production Network

A network of task servers & their associated producers.

101

Defining Parallel Efficiency

• Scalar: Homogeneous set of P machines:

Parallel efficiency = (T1 / P) / TP

• Vector: Heterogeneous set of P machines:

P = [ P1, P2, …, Pd ], where there are

P1 machines of type 1,

P2 machines of type 2, …

Pd machines of type d :

Parallel efficiency = ( P1 / T1 + P2 / T2 + … Pd / Td ) –1 / TP

102

Future work

• Support special hardware / data: inter-server task

movement.

– Diffusion model:

Tasks are homogeneous gas atoms diffusing through network.

– N-body model: Each kind of atom (task) has its own:

• Mass (resistance to movement: code size, input size, …)

• attraction/repulsion to different servers

Or other “massive” entities, such as:

» special processors

» large data base.

103

Future Work

• CX preprocessor to simplify API.

top related