falkon, a fast and lightfalkon, a fast and light--weight weight task...

36
Falkon, a Fast and Light Falkon, a Fast and Light-weight weight tasK executiON framework for tasK executiON framework for Clusters, Grids, and Clusters, Grids, and Supercomputers Supercomputers Supercomputers Supercomputers Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago In Collaboration with: Ian Foster, Mike Wilde, Zhao Zhang, Catalin Dumitrescu, Yong Zhao, Pete Beckman, Kamil Iskra, Alex Szalay, Philip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, Ben Clifford, plus others… IEEE/ACM Supercomputing 2008 Argonne National Laboratory Booth November 19 th , 2008

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK executiON framework for tasK executiON framework for

Clusters, Grids, and Clusters, Grids, and SupercomputersSupercomputersSupercomputersSupercomputers

Ioan RaicuDistributed Systems LaboratoryComputer Science Department

University of Chicago

In Collaboration with: Ian Foster, Mike Wilde, Zhao Zhang, Catalin Dumitrescu, Yong Zhao, Pete Beckman, Kamil Iskra, Alex Szalay, Philip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, Ben Clifford, plus others…

IEEE/ACM Supercomputing 2008Argonne National Laboratory Booth

November 19th, 2008

Page 2: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Problem Types

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

2Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

Page 3: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

MTC: Many Task Computing

• Bridge the gap between HPC and HTC• Loosely coupled applications with HPC orientations• HPC comprising of multiple distinct activities, coupled

via file system operations or message passingvia file system operations or message passing• Emphasis on many resources over short time periods• Tasks can be:

– small or large, independent and dependent, uniprocessor or multiprocessor, compute-intensive or data-intensive, static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled, large number of tasks, large quantity of computing, and large volumes of data… 3

Page 4: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Obstacles running MTC apps in Clusters/Grids

10000

100000

1000000

ough

put (

Mb/

s)

GPFS RLOCAL RGPFS R+WLOCAL R+W

10000

100000

1000000

ough

put (

Mb/

s)

GPFS RLOCAL RGPFS R+WLOCAL R+W

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

4

System Comments Throughput (tasks/sec)

Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45

Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42

Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22

System Comments Throughput (tasks/sec)

Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45

Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42

Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22

100

1000

1 10 100 1000Number of Nodes

Thro

100

1000

1 10 100 1000Number of Nodes

Thro

Page 5: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Solutions

• Falkon: A Fast and Light-weight tasK executiON framework– Goal: enable the rapid and efficient execution of many independent

jobs on large compute clusters– Combines three components:

• A streamlined task dispatcher• Resource provisioning through multi-level scheduling techniques

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

5

Resource provisioning through multi level scheduling techniques• Data diffusion and data-aware scheduling to leverage the co-located

computational and storage resources

• Swift: A parallel programming system for loosely coupled applications– Applications cover many domains: Astronomy, astro-physics, medicine,

chemistry, economics, climate modeling, data analytics

Page 6: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon Overview

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

6

Page 7: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Distributed Falkon Architecture

Dispatcher1

Executor1

Client

Login Nodes(x10)

I/O Nodes(x640)

Compute Nodes (x40K)

7

Provisioner

Cobalt

ClientExecutor

256

DispatcherN

Executor1

Executor256

Page 8: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Dispatch Throughput

30003500400045005000

2534

3186 3071

(task

s/se

c)

30003500400045005000

2534

3186 3071

(task

s/se

c)

8

0500

10001500200025003000

ANL/UC, Java 200 CPUs 1 service

ANL/UC, C 200 CPUs1 service

SiCortex, C 5760 CPUs

1 service

BlueGene/P, C 4096 CPUs

1 service

BlueGene/P, C163840 CPUs 640 services

604

1758

Thro

ughp

ut (

Executor Implementation and Various Systems

0500

10001500200025003000

ANL/UC, Java 200 CPUs 1 service

ANL/UC, C 200 CPUs1 service

SiCortex, C 5760 CPUs

1 service

BlueGene/P, C 4096 CPUs

1 service

BlueGene/P, C163840 CPUs 640 services

604

1758

Thro

ughp

ut (

Executor Implementation and Various SystemsRunning 1 Million Jobs in 10 Minutes via the Falkon Fast and Light-weight tasK executiON framework

System Comments Throughput (tasks/sec)

Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45

Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42

Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22

System Comments Throughput (tasks/sec)

Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45

Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42

Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22

Page 9: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Efficiency

90%

100%30%

40%

50%

60%

70%

80%

90%

100%

Effic

ienc

y

32 seconds16 seconds8 seconds4 seconds2 seconds1 second

90%

10%

20%

30%

40%

50%

60%

70%

80%

90%

256 1024 4096 16384 65536 163840

Effic

ienc

y

Number of Processors

256 seconds128 seconds64 seconds32 seconds16 seconds8 seconds4 seconds2 seconds1 second

0%

10%

20%

Number of Processors

Page 10: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon Endurance Test

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

10

Page 11: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon Monitoring

• Workload• 160K CPUs

1M tasks

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

11

• 1M tasks• 60 sec per task

• 17.5K CPU hours in 7.5 min• Throughput: 2312 tasks/sec• 85% efficiency

Page 12: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon Activity History (10 months)

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

12

Page 13: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Virtual Node(s)Abstractcomputation

S iftS i t

Specification Execution

Virtual Node(s)

file1

Scheduling

Execution Engine(Karajan w/

Swift Runtime)

Swift Architecture

Provisioning

FalkonResource

Provisioner

SwiftScript

Virtual DataCatalog

SwiftScriptCompiler

Provenancedata

ProvenancedataProvenance

collector

launcher

launcher

file2

file3

AppF1

AppF2

Swift runtimecallouts

CC CC

Status reportingAmazon

EC2

13

Page 14: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Functional MRI (fMRI)

Scalable Resource Management in Clouds and Grids 14

• Wide range of analyses– Testing, interactive analysis,

production runs– Data mining– Parameter studies

Page 15: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Completed Milestones: fMRI Application

48085000

6000GRAMGRAM/Clustering

• GRAM vs. Falkon: 85%~90% lower run time• GRAM/Clustering vs. Falkon: 40%~74% lower run time

Scalable Resource Management in Clouds and Grids 15Falkon: a Fast and Light-weight tasK executiON framework

1239

2510

3683

456866 992 1123

120 327 546 678

0

1000

2000

3000

4000

120 240 360 480Input Data Size (Volumes)

Tim

e (s

)

Falkon

Page 16: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

B. Berriman, J. Good (Caltech)J. Jacob, D. Katz (JPL)

Scalable Resource Management in Clouds and Grids 16

Page 17: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Completed Milestones: Montage Application

3000

3500

GRAM/ClusteringMPI

• GRAM/Clustering vs. Falkon: 57% lower application run time• MPI* vs. Falkon: 4% higher application run time• * MPI should be lower bound

Scalable Resource Management in Clouds and Grids 17Falkon: a Fast and Light-weight tasK executiON framework

0

500

1000

1500

2000

2500

mProjec

t

mDiff/Fit

mBackg

round

mAdd(su

b)

mAdd total

Components

Tim

e (s

)

MPIFalkon

Page 18: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Hadoop vs. Swift

• Classic benchmarks for MapReduce– Word Count– Sort

• Swift performs similar or better than Hadoop

Scalable Resource Management in Clouds and Grids 18

p p(on 32 processors)

Sort

4285

733

25

83

512

1

10

100

1000

10000

10MB 100MB 1000MBData Size

Tim

e (s

ec)

Swift+Falkon

Hadoop

Word Count

221

11431795

863

46887860

1

10

100

1000

10000

75MB 350MB 703MBData Size

Tim

e (s

ec)

Swift+PBSHadoop

Page 19: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Molecular Dynamics

• Determination of free energies in aqueous solution– Antechamber – coordinates

Charmm solution– Charmm – solution– Charmm - free energy

19Scalable Resource Management in Clouds and Grids

Page 20: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

0 1800 3600 5400 7200 9000 10800 12600 144001

Time (sec)

• 244 molecules 20497 jobs• 15091 seconds on 216 CPUs 867.1 CPU hours• Efficiency: 99.8%• Speedup: 206.9x 8.2x faster than GRAM/PBS• 50 molecules w/ GRAM (4201 jobs) 25.3 speedup

MolDyn Application

Scalable Resource Management in Clouds and Grids 2020

1100120013001400150016001700180019001

1000111001120011300114001150011600117001180011900120001

Task

ID

waitQueueTime execTime resultsQueueTime

Page 21: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

MARS Economic Modeling on IBM BG/P

• CPU Cores: 2048• Tasks: 49152• Micro-tasks: 7077888• Elapsed time: 1601 secs• CPU Hours: 894

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

21

• CPU Hours: 894• Speedup: 1993X (ideal 2048)• Efficiency: 97.3%

Page 22: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

MARS Economic Modeling on IBM BG/P (128K CPUs)

3500

4000

4500

5000

800000

1000000

ec)

s

ProcessorsActive TasksTasks CompletedThroughput (tasks/sec)

• CPU Cores: 130816• Tasks: 1048576• Elapsed time: 2483 secs• CPU Years: 9.3

Speedup: 115168X (ideal 130816)

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

22

0

500

1000

1500

2000

2500

3000

3500

0

200000

400000

600000

Thro

ughp

ut (t

asks

/se

Task

s C

ompl

eted

Num

ber o

f Pro

cess

ors

Time (sec)

Speedup: 115168X (ideal 130816)Efficiency: 88%

Page 23: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

23

Page 24: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Many Many Tasks:Identifying Potential Drug Targets

2M+ ligandsProtein xtarget(s)

(Mike Kubal, Benoit Roux, and others)24

Page 25: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

start

DOCK6Receptor

(1 per protein:defines pocket

to bind to)

ZINC3-D

structures

NAB scriptparameters

(defines flexibleresidues, #MDsteps)

BuildNABScript

NABScript

NABScript

Template

Amber prep:2. AmberizeReceptor4. perl: gen nabscript

FREDReceptor

(1 per protein:defines pocket

to bind to)

Manually prepDOCK6 rec file

Manually prepFRED rec file

1 protein(1MB)

6 GB2M

structures(6 GB)

DOCK6FRED ~4M x 60s x 1 cpu~60K cpu-hrs

PDBprotein

descriptions

Many Many Tasks:Identifying Potential Drug Targets

report ligands complexes

Amber Score:1. AmberizeLigand3. AmberizeComplex5. RunNABScript

end

Amber ~10K x 20m x 1 cpu~3K cpu-hrs

Select best ~500

~500 x 10hr x 100 cpu~500K cpu-hrsGCMC

Select best ~5KSelect best ~5K

For 1 target:4 million tasks

500,000 cpu-hrs(50 cpu-years)25

Page 26: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

DOCK on SiCortex

• CPU cores: 5760• Tasks: 92160• Elapsed time: 12821 sec• Compute time: 1.94 CPU years

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

26

• Average task time: 660.3 sec• Speedup: 5650X (ideal 5760)• Efficiency: 98.2%

Page 27: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

DOCK on the BG/P

CPU cores: 118784Tasks: 934803Elapsed time: 2.01 hoursCompute time: 21.43 CPU yearsAverage task time: 667 secRelative Efficiency: 99 7%

27

Relative Efficiency: 99.7%(from 16 to 32 racks)Utilization: • Sustained: 99.6%• Overall: 78.3%

Time (secs)

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

Page 28: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Data Diffusion

• Resource acquired in response to demand

• Data and applications diffuse from archival storage to newly acquired resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

28

• Resource “caching” allows faster responses to subsequent requests – Cache Eviction Strategies:

RANDOM, FIFO, LRU, LFU• Resources are released

when demand drops

Shared File System

Provisioned Resources

Shared File System

Provisioned Resources

Page 29: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Data Diffusion

• Considers both data and computations to optimize performance– Supports data-aware scheduling– Can optimize compute utilization, cache hit performance, or

29

a mixture of the two• Decrease dependency of a shared file system

– Theoretical linear scalability with compute resources– Significantly increases meta-data creation and/or

modification performance• Central for “data-centric task farm” realization

Page 30: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Data Diffusion:Good-cache-compute

1GB

1.5GB 0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

30

2GB

4GB

0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

Page 31: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Data Diffusion:Throughput and Response Time

Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 100Gb/s vs. 6Gb/s

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

31

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB

max-compute-util, 4GB

T

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB

max-compute-util, 4GB

T

1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB

max-compute-util, 4GB

Ave

rage

Res

pons

e Ti

me

(sec

) 1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB

max-compute-util, 4GB

Ave

rage

Res

pons

e Ti

me

(sec

)

Response Time – 3 sec vs 1569 sec 506X

Page 32: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Sin-Wave Workload

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

32

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

Page 33: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

All-Pairs Workload500x500 on 200 CPUs

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

Efficiency: 75%

33

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

Page 34: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Limitations of Data Diffusion

• Needs Java 1.4+• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Per task workings set must fit in local storage• Task definition must include input/output files

metadata• Data access patterns: write once, read many

34

Page 35: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

Mythbusting

• Embarrassingly Happily parallel apps are trivial to run– Logistical problems can be tremendous

• Loosely coupled apps do not require “supercomputers”– Total computational requirements can be enormous– Individual tasks may be tightly coupled

W kl d f tl i l l t f I/O

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

35

– Workloads frequently involve large amounts of I/O– Make use of idle resources from “supercomputers” via backfilling – Costs to run “supercomputers” per FLOP is among the best

• BG/P: 0.35 gigaflops/watt (higher is better)• SiCortex: 0.32 gigaflops/watt• BG/L: 0.23 gigaflops/watt• x86-based HPC systems: an order of magnitude lower

• Loosely coupled apps do not require specialized system software• Shared file systems are good for all applications

– They don’t scale proportionally with the compute resources– Data intensive applications don’t perform and scale well

Page 36: Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK …iraicu/research/presentations/2008_SC08... · 2008. 11. 19. · in Clusters/Grids 10000 100000 1000000 o ughput

More Information

• More information: http://people.cs.uchicago.edu/~iraicu/• Related Projects:

– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php

• Funding:NASA: Ames Research Center Graduate Student Research Program

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

36

– NASA: Ames Research Center, Graduate Student Research Program• Jerry C. Yan, NASA GSRP Research Advisor

– DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy

– NSF: TeraGrid