poster: fox

1
SIMULATION FOX: A Fault-Oblivious Extreme Scale Execution Environment http://fox.xstack.org Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable energy-constrained hardware environment with billions of threads of execution. Further, these systems must be designed with failure in mind. FOX is a new system for the exascale which will support distributed data objects as first class objects in the operating system itself. This memory-based data store will be named and accessed as part of the file system name space of the application. We can build many types of objects with this data store, including data-driven work queues, which will in turn support applications with inherent resilience. Work queues are a familiar concept in many areas of computing; we will apply the work queue concept to applications. Shown below is a graphical data flow description of a quantum chemistry application, which can be executed using a work queue approach. We have simulated an application modified to use this model and, in simulation, parallel performance improves. Shown below is the comparative performance, with the traditional SPMD-style program on the top, and the data-driven program on the bottom. White space indicates idle computing nodes. The second cycle begins at t=16.7 on the data-driven program, whereas it begins later -- at t=16.9 -- on the SPMD program. On the data driven program, utilization is higher, cycle times for an iteration are shorter, and work on the next cycle can begin well before the current cycle completes. Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, “Scalable Elastic Systems Architecture”, To appear at RESOLVE Workshop co- located with ASPLOS 2011, March 5, 2011. Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, “Why Elasticity Matters”, Under submission to HotOS 2011. "Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching", N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily, ACM International Conference on Computing Frontiers (CF'11), May 2011 (Accepted) "A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models", N. Ali, S. Krishnamoorthy, N. Govind, B. Palmer, 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP'11), February 2011 "Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory", Roger Pearce, Maya Gokhale, Nancy M. Amato, SC10 Technical Program. OVERVIEW OPERATING SYSTEM Dependency graph for a portion of the Hartree-Fock procedure. G G(i,j,k,l) D(k,l) F F(i,j) G(i,j,k,l) G(i,k,j,l) F(i,i) J F(i,j) F(j,j) S(i,i) S(i,j) S(j,j) J(d,i,j) F(i,i) R F(i,j) F(j,j) J(d,i,j) Fi(d,s,i,j) D(0,0) F F F F F F S(0,0) J(0) R(0) R(0) R(0) D(1,0) S(1,0) D(1,1) S(1,1) D(2,0) S(2,0) R(0) R(0) D(2,1) S(2,1) D(2,2) S(2,2) J(0) R(0) R(0) Fi(0,2,0,0) R(0) Fi(0,1,0,0) F(0,0) G G(0,0,0,0) G G(1,1,0,0) G G(2,2,0,0) R(0) R(0) F(1,0) F(1,1) G G(1,1,1,1) G G(2,2,1,1) J(0,1,0) R(0) R(0) J(0) R(0) R(0) Fi(0,1,2,2) F(2,2) G G(2,2,2,2) R(0) Fi(0,1,1,1) Fi(0,1,2,1) F(2,0) F(2,1) J(0,2,1) Si(0,1,1,1) Si(0,1,2,1) R(0) R(0) Fi(0,2,2,0) Fi(0,1,1,0) Fi(0,1,2,0) J(0,2,0) Si(0,1,0,0) Si(0,1,2,2) Si(0,2,2,0) Si(0,1,1,0) Si(0,1,2,0) Fi(0,2,1,1) Fi(0,2,2,2) Elementary Operations Assembly Dependency Graph J G Fi F A comparison of processor utilization for traditional (top) and data-centric (bottom) approaches using simulated timings for a portion of the Hartree-Fock procedure. SST/macro The Structural Simulation Toolkit (SST) macroscale components provide the ability to explore the interaction of software and hardware for full scale machines using a coarse-grained simulation approach. The parallel machine is represented by models which are used to estimate the performance of processing and network components. An application can be represented by a “skeleton” code which replicates the control flow and message passing behavior of the real application without the cost of doing actual message passing or heavyweight computation. A skeleton application models the control flow, communication pattern, and computation pattern of the application in a coarse-grained manner. This method provides a powerful approach to evaluate efficiency and scalability at extreme scales and to experiment with code reorganization or high-level refactoring without having to rewrite the numerical part of an application. // Set up the instructions object to tell the processor model // how many fused multiply-add instructions each compute call executes boost::shared_ptr<sstmac::eventdata> instructions = sstmac::eventdata::construct(); instructions->set_event("FMA",blockrowsize*blockcolsize*blocklnksize); // Iterate over number of remote row and column blocks for (int i=0; i<nblock-1; i++) { std::vector<sstmac::mpiapi::mpirequest_t> reqs; // Begin non-blocking left shift of A blocks sstmac::mpiapi::mpirequest_t req; mpi()->isend(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myleft), sstmac::mpitag(0), world, req); reqs.push_back(req); mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myright), sstmac::mpitag(0), world, req); reqs.push_back(req); // Begin non-blocking down shift of B blocks sstmac::mpiapi::mpirequest_t req; mpi()->isend(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myup), sstmac::mpitag(0), world, req); reqs.push_back(req); mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(mydown), sstmac::mpitag(0), world, req); reqs.push_back(req); // Simulate computation with current blocks compute_api()->compute(instructions); std::vector<sstmac::mpiapi::const_mpistatus_t> statuses; // Wait for data needed for next iteration mpi()->waitall(reqs, statuses); } // Simulate computation with blocks received during last loop iteration compute_api()->compute(instructions); The comparison between the traditional systolic matrix-matrix multiply algorithm and a fully asynchronous data-driven approach shows a key pitfall of many block synchronous algorithms -- even though their theoretical peak scaling may be very good, significant performance degradation is seen for alternate network topologies or non-ideal layout of tasks onto distributed nodes. The synchronized communications and static load balancing also lead to immediate performance degradation if any node is running at reduced performance. It is also readily apparent from these results that even very simple load balancing quickly leads to significant gains when the performance of one or more nodes is degraded. Using simulation to explore fault oblivious programming models Major effort is required to rewrite an application using a new programming model, thus it is highly desirable to attempt to understand the ease of expressing algorithms in the new model and the expected performance advantages on future machines before doing so. Skeleton applications provide the basis for programming model exploration in SST/macro. The application control flow is represented by lightweight threads and each of these threads represents one or more threads in the application (depending on the level of detail desired in the model). This approach allows programmers to specify control flow in a straightforward way. A code fragment implementing a skeleton program for a dense matrix multiplication (restricted to square blocks). PUBLICATIONS Checksum-based schemes ensure fault-tolerance of read-only data with much lower space overheads. We have extended the approaches to the computation and mapping of parities to processors as employed by fault tolerant linear algebra, an algorithmic fault tolerance technique for linear algebraic operations. The scalable algorithms developed support generalized Cartesian distributions involving arbitrary processor counts, arbitrary specification of failure units, and colocation of parities with data blocks. In the 2-D processor grid in the figure, note the irregular distribution of processors (denoted by P_(i,j)) among the failure units (denoted by N_k). For modified data, we create shadow copies of important data structures. All operations that modify the original data structures are duplicated on the shadow data. By mirroring the application data on distinct failure units, we ensure that we can recover its state in the event of failures. We ensured fault tolerance for the key matrices in the most expensive Hartree-Fock procedure (shown above) with almost unnoticeable overheads. Fault tolerant data stores (shown to the right) are based on the premise that applications have a few critical data structures and that each data structure is accessed in a specific form in a given phase. By limiting our attention to certain significant data structures, we limit the state to be saved. The access mode associated with a data structure in a given phase can be used to tailor the form of the data store. Rather than treating an SMP node as the unit of failure, our approaches can adapt to arbitrary failure units (eg., processors sharing a power supply unit). The graph to the left shows the cost of two electron contribution (twoel) in NWChem Hartree-Fock calculation with and without the maintenance of shadow copies (fock matrix size=3600x3600). The two lines overlap due to the small overhead and the included error bars are not visible due to small standard deviation. The graph to the right shows time to determine the location and distribution of parities for a given data distribution (and different FT distribution schemes) with increase in process count. This time is shown to be small even for large processor counts, showing the feasibility of this approach on future extreme scale systems. This approach was used to compute the checksums and recover lost blocks on failure. Both actions are shown to be scalable (details in CFʼ11 publication). In addition to performance we are studying energy consumption of fault tolerance techniques. The graph to the left is the power signature of HF implementation with 4Be on a rack of 224 cores on NW-ICE (with liquid cooling; i.e., local cooling and PSU included). We use time dilation to show the 11 power spikes representing the 11 iterations until convergence. The initial spike is attributed to the initialization and the generation of the Schwartz matrix. Exploration of different infrastructures enabling increasing levels of system specialization for applications at decreasing levels of granularity with the goal of increasing scalability, reliability, performance, & utilization. Emerging exascale applications will feature dynamic mapping of resources to computational nodes. The goal of the Scalable Elastic Systems Architecture (SESA) is to provide system support for these new environments. SESA is a multi-layer organization that exploits prior results for SMP scalable software. It introduces an infrastructure for developing application-tuned libraries of elastic components that encapsulate and account for hardware change. Our LibraryOS approach to SESA allows us to collapse deep software stacks into a single layer which is tightly coupled to the underlying hardware, increasing performance, efficiency, and lowering latency of key operations. The proposed framework provides a simple, uniform file system abstraction as the mechanism for parallel computation and distribution of data to computation blocks. Locating and accessing tasks from a processʼs work queue is accomplished through normal file system operations such as listing directory contents and opening files. Data distribution and access are performed by writing and reading files within directory hierarchies. Shown above is the user view (at the top) of the file /data/dht_A and the way the view of one file maps down to resources provided by one node (at bottom). The node contains a part of the file /data/dht_A, which is distributed across several nodes, with redundant block storage. The OS supports a fault oblivious abstraction to the application and user, with fault handling and hiding infrastructure in the work and data distribution and systems software support layers. Weak scaling parallel efficiency of the systolic matrix-matrix multiply algorithm under a variety of conditions: 1) full crossbar network (DS), 2) fat-tree formed from radix 36 switches (DS), 3) 2D torus network, 4) 2D torus network (DS),5) fat-tree formed from radix 36 switches with a single degraded node (DS), 6) 2D torus with a single degraded node. DS designates use of the direct send algorithm. In the “degraded node” runs, the computational performance of one node was reduced to half that of the other nodes in the system. This type of performance reduction is found in real systems due to thermal ramp-down or resource contention. 0.00 0.20 0.40 0.60 0.80 1.00 10 2 10 3 10 4 10 5 10 6 Parallel Efficiency Number of Cores Crossbar (DS) Fat-tree (DS) Torus Torus (DS) Degraded Fat-tree (DS) Degraded Torus A comparison of the effect of a single degraded node on the parallel efficiency of an actor-based algorithm with simple dynamic load-balancing and the traditional systolic algorithm. A fat-tree formed from radix 36 switches was simulated. In both the systolic and direct send algorithm, the full matrix is evenly decomposed onto all nodes in the parallel system. The systolic algorithm diffuses data between nodes as the calculation progresses while the direct send (DS) algorithm fetches data directly from the neighbor node that owns the given submatrix. The actor model implementation uses the direct send algorithm exclusively. 0.00 0.20 0.40 0.60 0.80 1.00 1 4 16 64 256 1024 4096 Parallel Efficiency Number of Nodes Theoretically optimal scaling Expected without load balancing Simple dynamic load balancing Systolic algorithm Coarse-grained operation queue Fine- Preload Ready Compute Resource Compute Resource queues grained queues queues node n node 0 % ls /data/dht_A ... User's Desktop Abstract File Interface Systems Software Support Programming Model Support Distributed Data Store Task Management Application Workflow App 1 App i ... Fault Hiding Layers Fault Oblivious Layers dht_A.part1 dht_A partn App 1 task 1 App 1 task n ... App 2 task 3 matrix_C.part3 task management data.part1 taskdata. partn FAULT TOLERANCE 10 0 10 1 10 2 10 3 10 4 32 64 128 256 512 1024 204 Time per step (s) Number of processors NWChem twoel NWChem twoel + FT 10 2 10 3 10 4 10 5 10 6 2 10 2 12 2 14 2 16 2 18 Compute time (μs) Number of processor cores ALL-ROW PARTIAL-ROW ALL-SYM PARTIAL-SYM

Upload: eric-van-hensbergen

Post on 14-Dec-2014

512 views

Category:

Documents


1 download

DESCRIPTION

FOX overview poster presented at the DOE Office of Science ASCR Exascale Research Kickoff Meeting.More details: http://fox.xstack.org

TRANSCRIPT

Page 1: Poster: FOX

SIMU

LATION

FOX: A Fault-Oblivious Extreme Scale Execution Environment

http://fox.xstack.org

Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable energy-constrained hardware environment with billions of threads of execution. Further, these systems must be designed with failure in mind.

FOX is a new system for the exascale which will support distributed data objects as first class objects in the operating system itself. This memory-based data store will be named and accessed as part of the file system name space of the application. We can build many types of objects with this data store, including data-driven work queues, which will in turn support applications with inherent resilience.

Work queues are a familiar concept in many areas of computing; we will apply the work queue concept to applications. Shown below is a graphical data flow description of a quantum chemistry application, which can be executed using a work queue approach.

We have simulated an application modified to use this model and, in simulation, parallel performance improves. Shown below is the comparative performance, with the traditional SPMD-style program on the top, and the data-driven program on the bottom. White space indicates idle computing nodes. The second cycle begins at t=16.7 on the data-driven program, whereas it begins later -- at t=16.9 -- on the SPMD program. On the data driven program, utilization is higher, cycle times for an iteration are shorter, and work on the next cycle can begin well before the current cycle completes.

• Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, “Scalable Elastic Systems Architecture”, To appear at RESOLVE Workshop co-located with ASPLOS 2011, March 5, 2011.

• Dan Schatzberg, Jonathan Appavoo, Orran Krieger, and Eric Van Hensbergen, “Why Elasticity Matters”, Under submission to HotOS 2011.

• "Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching", N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily, ACM International Conference on Computing Frontiers (CF'11), May 2011 (Accepted)

• "A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models", N. Ali, S. Krishnamoorthy, N. Govind, B. Palmer, 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP'11), February 2011

• "Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory", Roger Pearce, Maya Gokhale, Nancy M. Amato, SC10 Technical Program.

OVERVIEW

OPER

ATING

SYSTEM

Dependency graph for a portion of the Hartree-Fock procedure.

G

G(i,j,k,l)

D(k,l)

F

F(i,j)

G(i,j,k,l) G(i,k,j,l)

F(i,i)

J

F(i,j) F(j,j) S(i,i) S(i,j) S(j,j)

J(d,i,j)

F(i,i)

R

F(i,j) F(j,j) J(d,i,j)

Fi(d,s,i,j)

D(0,0)

F F FF FF

S(0,0)

J(0)

R(0) R(0)R(0)

D(1,0)

S(1,0)

D(1,1)

S(1,1)

D(2,0)

S(2,0)

R(0) R(0)

D(2,1)

S(2,1)

D(2,2)

S(2,2)

J(0)

R(0)

R(0)

Fi(0,2,0,0)

R(0)

Fi(0,1,0,0)

F(0,0)

G

G(0,0,0,0)

G

G(1,1,0,0)

G

G(2,2,0,0)

R(0)R(0)

F(1,0) F(1,1)

G

G(1,1,1,1)

G

G(2,2,1,1)

J(0,1,0)

R(0)R(0)

J(0)

R(0)

R(0)

Fi(0,1,2,2)

F(2,2)

G

G(2,2,2,2)

R(0)

Fi(0,1,1,1)Fi(0,1,2,1)

F(2,0)F(2,1)

J(0,2,1)

Si(0,1,1,1)Si(0,1,2,1)

R(0) R(0)

Fi(0,2,2,0)

Fi(0,1,1,0)

Fi(0,1,2,0)

J(0,2,0)

Si(0,1,0,0)

Si(0,1,2,2) Si(0,2,2,0)

Si(0,1,1,0)

Si(0,1,2,0)

Fi(0,2,1,1)

Fi(0,2,2,2)

Elementary Operations

Assembly

Dependency

Graph

JG

Fi

F

A comparison of processor utilization for traditional (top) and data-centric (bottom) approaches using simulated timings for a portion of the Hartree-Fock procedure.

SST/macroThe Structural Simulation Toolkit (SST) macroscale components provide the ability to explore the interaction of software and hardware for full scale machines using a coarse-grained simulation approach. The parallel machine is represented by models which are used to estimate the performance of processing and network components. An application can be represented by a “skeleton” code which replicates the control flow and message passing behavior of the real application without the cost of doing actual message passing or heavyweight computation.

A skeleton application models the control flow, communication pattern, and computation pattern of the application in a coarse-grained manner. This method provides a powerful approach to evaluate efficiency and scalability at extreme scales and to experiment with code reorganization or high-level refactoring without having to rewrite the numerical part of an application.

// Set up the instructions object to tell the processor model // how many fused multiply-add instructions each compute call executes boost::shared_ptr<sstmac::eventdata> instructions = sstmac::eventdata::construct(); instructions->set_event("FMA",blockrowsize*blockcolsize*blocklnksize); // Iterate over number of remote row and column blocks for (int i=0; i<nblock-1; i++) { std::vector<sstmac::mpiapi::mpirequest_t> reqs; // Begin non-blocking left shift of A blocks sstmac::mpiapi::mpirequest_t req; mpi()->isend(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myleft), sstmac::mpitag(0), world, req); reqs.push_back(req); mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myright), sstmac::mpitag(0), world, req); reqs.push_back(req); // Begin non-blocking down shift of B blocks sstmac::mpiapi::mpirequest_t req; mpi()->isend(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(myup), sstmac::mpitag(0), world, req); reqs.push_back(req); mpi()->irecv(blocksize, sstmac::mpitype::mpi_double, sstmac::mpiid(mydown), sstmac::mpitag(0), world, req); reqs.push_back(req); // Simulate computation with current blocks compute_api()->compute(instructions); std::vector<sstmac::mpiapi::const_mpistatus_t> statuses; // Wait for data needed for next iteration mpi()->waitall(reqs, statuses); } // Simulate computation with blocks received during last loop iteration compute_api()->compute(instructions);

The comparison between the traditional systolic matrix-matrix multiply algorithm and a fully asynchronous data-driven approach shows a key pitfall of many block synchronous algorithms -- even though their theoretical peak scaling may be very good, significant performance degradation is seen for alternate network topologies or non-ideal layout of tasks onto distributed nodes. The synchronized communications and static load balancing also lead to immediate performance degradation if any node is running at reduced performance. It is also readily apparent from these results that even very simple load balancing quickly leads to significant gains when the performance of one or more nodes is degraded.

Using simulation to explore fault oblivious programming models

Major effort is required to rewrite an application using a new programming model, thus it is highly desirable to attempt to understand the ease of expressing algorithms in the new model and the expected performance advantages on future machines before doing so. Skeleton applications provide the basis for programming model exploration in SST/macro. The application control flow is represented by lightweight threads and each of these threads represents one or more threads in the application (depending on the level of detail desired in the model). This approach allows programmers to specify control flow in a straightforward way.

A code fragment implementing a skeleton program for a dense matrix multiplication (restricted to square blocks).

PUB

LICATIO

NS

Checksum-based schemes ensure fault-tolerance of read-only data with much lower space overheads. We have extended the approaches to the computation and mapping of parities to processors as employed by fault tolerant linear algebra, an algorithmic fault tolerance technique for linear algebraic operations. The scalable algorithms developed support generalized Cartesian distributions involving arbitrary processor counts, arbitrary specification of failure units, and colocation of parities with data blocks. In the 2-D processor grid in the figure, note the irregular distribution of processors (denoted by P_(i,j)) among the failure units (denoted by N_k).For modified data, we create shadow copies of

important data structures. All operations that modify the original data structures are duplicated on the shadow data. By mirroring the application data on distinct failure units, we ensure that we can recover its state in the event of failures. We ensured fault tolerance for the key matrices in the most expensive Hartree-Fock procedure (shown above) with almost unnoticeable overheads.

Fault tolerant data stores (shown to the right)are based on the premise that applications have a few critical data structures and that each data structure is accessed in a specific form in a given phase. By limiting our attention to certain significant data structures, we limit the state to be saved. The access mode associated with a data structure in a given phase can be used to tailor the form of the data store. Rather than treating an SMP node as the unit of failure, our approaches can adapt to arbitrary failure units (eg., processors sharing a power supply unit).

The graph to the left shows the cost of two electron contribution (twoel) in NWChem Hartree-Fock calculation with and without the maintenance of shadow copies (fock matrix size=3600x3600). The two lines overlap due to the small overhead and the included error bars are not visible due to small standard deviation.

The graph to the right shows time to determine the location and distribution of parities for a given data distribution (and different FT distribution schemes) with increase in process count. This time is shown to be small even for large processor counts, showing the feasibility of this approach on future extreme scale systems. This approach was used to compute the checksums and recover lost blocks on failure. Both actions are shown to be scalable (details in CFʼ11 publication).

In addition to performance we are studying energy consumption of fault tolerance techniques. The graph to the left is the power signature of HF implementation with 4Be on a rack of 224 cores on NW-ICE (with liquid cooling; i.e., local cooling and PSU included). We use time dilation to show the 11 power spikes representing the 11 iterations until convergence. The initial spike is attributed to the initialization and the generation of the Schwartz matrix.

Exploration of different infrastructures enabling increasing levels of system specialization for applications at decreasing levels of granularity with the goal of increasing scalability, reliability, performance, & utilization.

Emerging exascale applications will feature dynamic mapping of resources to computational nodes. The goal of the Scalable Elastic Systems Architecture (SESA) is to provide system support for these new environments.

SESA is a multi-layer organization that exploits prior results for SMP scalable software. It introduces an infrastructure for developing application-tuned libraries of elastic components that encapsulate and account for hardware change.

Our LibraryOS approach to SESA allows us to collapse deep software stacks into a single layer which is tightly coupled to the underlying hardware, increasing performance, efficiency, and lowering latency of key operations.

The proposed framework provides a simple, uniform file system abstraction as the mechanism for parallel computation and distribution of data to computation blocks.

Locating and accessing tasks from a processʼs work queue is accomplished through normal file system operations such as listing directory contents and opening files.

Data distribution and access are performed by writing and reading files within directory hierarchies.

Shown above is the user view (at the top) of the file /data/dht_A and the way the view of one file maps down to resources provided by one node (at bottom). The node contains a part of the file /data/dht_A, which is distributed across several nodes, with redundant block storage. The OS supports a fault oblivious abstraction to the application and user, with fault handling and hiding infrastructure in the work and data distribution and systems software support layers.

Weak scaling parallel efficiency of the systolic matrix-matrix multiply algorithm under a variety of conditions: 1) full crossbar network (DS), 2) fat-tree formed from radix 36 switches (DS), 3) 2D torus network, 4) 2D torus network (DS),5) fat-tree formed from radix 36 switches with a single degraded node (DS), 6) 2D torus with a single degraded node. DS designates use of the direct send algorithm.

In the “degraded node” runs, the computational performance of one node was reduced to half that of the other nodes in the system. This type of performance reduction is found in real systems due to thermal ramp-down or resource contention.

0.00

0.20

0.40

0.60

0.80

1.00

102 103 104 105 106

Para

llel E

ffic

iency

Number of Cores

Crossbar (DS)Fat-tree (DS)

TorusTorus (DS)

Degraded Fat-tree (DS)Degraded Torus

A comparison of the effect of a single degraded node on the parallel efficiency of an actor-based algorithm with simple dynamic load-balancing and the traditional systolic algorithm. A fat-tree formed from radix 36 switches was simulated.

In both the systolic and direct send algorithm, the full matrix is evenly decomposed onto all nodes in the parallel system. The systolic algorithm diffuses data between nodes as the calculation progresses while the direct send (DS) algorithm fetches data directly from the neighbor node that owns the given submatrix. The actor model implementation uses the direct send algorithm exclusively.

0.00

0.20

0.40

0.60

0.80

1.00

1 4 16 64 256 1024 4096

Para

llel E

ffici

ency

Number of Nodes

Theoretically optimal scalingExpected without load balancing

Simple dynamic load balancingSystolic algorithm

Coarse-grainedoperation queue

Fine-

Preload

Ready

Compute Resource Compute Resource

queuesgrained

queues

queues

node nnode 0

% ls /data/dht_A...

User's Desktop

AbstractFile

Interface

Systems Software Support

Programming Model

Support

Distributed Data Store

Task Management

Application Workflow

App 1 App i...

Fault HidingLayers

Fault Oblivious

Layers

dht_A.part1 dht_Apartn

App 1task 1

App 1task n...App 2

task 3

matrix_C.part3 task management data.part1taskdata.

partn

FAU

LT TOLER

AN

CE

100

101

102

103

104

32 64 128 256 512 1024 2048

Tim

e pe

r ste

p (s

)

Number of processors

NWChem twoelNWChem twoel + FT

102

103

104

105

106

210 212 214 216 218

Co

mp

ute

tim

e (

µs)

Number of processor cores

ALL-ROWPARTIAL-ROWALL-SYMPARTIAL-SYM