falkon, a fast and lightfalkon, a fast and light--weight weight task...

Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK executiON framework for tasK executiON framework for

Clusters, Grids, and Clusters, Grids, and SupercomputersSupercomputersSupercomputersSupercomputers

Ioan RaicuDistributed Systems LaboratoryComputer Science Department

University of Chicago

In Collaboration with: Ian Foster, Mike Wilde, Zhao Zhang, Catalin Dumitrescu, Yong Zhao, Pete Beckman, Kamil Iskra, Alex Szalay, Philip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, Ben Clifford, plus others…

IEEE/ACM Supercomputing 2008Argonne National Laboratory Booth

November 19th, 2008

Problem Types

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

2Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

MTC: Many Task Computing

• Bridge the gap between HPC and HTC• Loosely coupled applications with HPC orientations• HPC comprising of multiple distinct activities, coupled

via file system operations or message passingvia file system operations or message passing• Emphasis on many resources over short time periods• Tasks can be:

– small or large, independent and dependent, uniprocessor or multiprocessor, compute-intensive or data-intensive, static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled, large number of tasks, large quantity of computing, and large volumes of data… 3

Obstacles running MTC apps in Clusters/Grids

10000

100000

1000000

ough

put (

Mb/

s)

GPFS RLOCAL RGPFS R+WLOCAL R+W

10000

100000

1000000

ough

put (

Mb/

s)

GPFS RLOCAL RGPFS R+WLOCAL R+W

Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers

4

System Comments Throughput (tasks/sec)

Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45

Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42

Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22





100

1000

1 10 100 1000Number of Nodes

Thro

100

1000

1 10 100 1000Number of Nodes

Thro

Solutions

• Falkon: A Fast and Light-weight tasK executiON framework– Goal: enable the rapid and efficient execution of many independent

jobs on large compute clusters– Combines three components:

• A streamlined task dispatcher• Resource provisioning through multi-level scheduling techniques


5

Resource provisioning through multi level scheduling techniques• Data diffusion and data-aware scheduling to leverage the co-located

computational and storage resources

• Swift: A parallel programming system for loosely coupled applications– Applications cover many domains: Astronomy, astro-physics, medicine,

chemistry, economics, climate modeling, data analytics

Falkon Overview


6

Distributed Falkon Architecture

Dispatcher1

Executor1

Client

Login Nodes(x10)

I/O Nodes(x640)

Compute Nodes (x40K)

7

Provisioner

Cobalt

ClientExecutor

256

DispatcherN

Executor1

Executor256

Dispatch Throughput

30003500400045005000

2534

3186 3071

(task

s/se

c)

30003500400045005000

2534

3186 3071

(task

s/se

c)

8

0500

10001500200025003000

ANL/UC, Java 200 CPUs 1 service

ANL/UC, C 200 CPUs1 service

SiCortex, C 5760 CPUs

1 service

BlueGene/P, C 4096 CPUs

1 service

BlueGene/P, C163840 CPUs 640 services

604

1758

Thro

ughp

ut (

Executor Implementation and Various Systems

0500

10001500200025003000

ANL/UC, Java 200 CPUs 1 service

ANL/UC, C 200 CPUs1 service

SiCortex, C 5760 CPUs

1 service

BlueGene/P, C 4096 CPUs

1 service

BlueGene/P, C163840 CPUs 640 services

604

1758

Thro

ughp

ut (

Executor Implementation and Various SystemsRunning 1 Million Jobs in 10 Minutes via the Falkon Fast and Light-weight tasK executiON framework









Efficiency

90%

100%30%

40%

50%

60%

70%

80%

90%

100%

Effic

ienc

y

32 seconds16 seconds8 seconds4 seconds2 seconds1 second

90%

10%

20%

30%

40%

50%

60%

70%

80%

90%

256 1024 4096 16384 65536 163840

Effic

ienc

y

Number of Processors

256 seconds128 seconds64 seconds32 seconds16 seconds8 seconds4 seconds2 seconds1 second

0%

10%

20%

Number of Processors

Falkon Endurance Test


10

Falkon Monitoring

• Workload• 160K CPUs

1M tasks


11

• 1M tasks• 60 sec per task

• 17.5K CPU hours in 7.5 min• Throughput: 2312 tasks/sec• 85% efficiency

Falkon Activity History (10 months)


12

Virtual Node(s)Abstractcomputation

S iftS i t

Specification Execution

Virtual Node(s)

file1

Scheduling

Execution Engine(Karajan w/

Swift Runtime)

Swift Architecture

Provisioning

FalkonResource

Provisioner

SwiftScript

Virtual DataCatalog

SwiftScriptCompiler

Provenancedata

ProvenancedataProvenance

collector

launcher

launcher

file2

file3

AppF1

AppF2

Swift runtimecallouts

CC CC

Status reportingAmazon

EC2

13

Functional MRI (fMRI)

Scalable Resource Management in Clouds and Grids 14

• Wide range of analyses– Testing, interactive analysis,

production runs– Data mining– Parameter studies

Completed Milestones: fMRI Application

48085000

6000GRAMGRAM/Clustering

• GRAM vs. Falkon: 85%~90% lower run time• GRAM/Clustering vs. Falkon: 40%~74% lower run time

Scalable Resource Management in Clouds and Grids 15Falkon: a Fast and Light-weight tasK executiON framework

1239

2510

3683

456866 992 1123

120 327 546 678

0

1000

2000

3000

4000

120 240 360 480Input Data Size (Volumes)

Tim

e (s

)

Falkon

B. Berriman, J. Good (Caltech)J. Jacob, D. Katz (JPL)


Completed Milestones: Montage Application

3000

3500

GRAM/ClusteringMPI

• GRAM/Clustering vs. Falkon: 57% lower application run time• MPI* vs. Falkon: 4% higher application run time• * MPI should be lower bound

Scalable Resource Management in Clouds and Grids 17Falkon: a Fast and Light-weight tasK executiON framework

0

500

1000

1500

2000

2500

mProjec

t

mDiff/Fit

mBackg

round

mAdd(su

b)

mAdd total

Components

Tim

e (s

)

MPIFalkon

Hadoop vs. Swift

• Classic benchmarks for MapReduce– Word Count– Sort

• Swift performs similar or better than Hadoop


p p(on 32 processors)

Sort

4285

733

25

83

512

1

10

100

1000

10000

10MB 100MB 1000MBData Size

Tim

e (s

ec)

Swift+Falkon

Hadoop

Word Count

221

11431795

863

46887860

1

10

100

1000

10000

75MB 350MB 703MBData Size

Tim

e (s

ec)

Swift+PBSHadoop

Molecular Dynamics

• Determination of free energies in aqueous solution– Antechamber – coordinates

Charmm solution– Charmm – solution– Charmm - free energy

19Scalable Resource Management in Clouds and Grids

0 1800 3600 5400 7200 9000 10800 12600 144001

Time (sec)

• 244 molecules 20497 jobs• 15091 seconds on 216 CPUs 867.1 CPU hours• Efficiency: 99.8%• Speedup: 206.9x 8.2x faster than GRAM/PBS• 50 molecules w/ GRAM (4201 jobs) 25.3 speedup

MolDyn Application


1100120013001400150016001700180019001

1000111001120011300114001150011600117001180011900120001

Task

ID

waitQueueTime execTime resultsQueueTime

MARS Economic Modeling on IBM BG/P

• CPU Cores: 2048• Tasks: 49152• Micro-tasks: 7077888• Elapsed time: 1601 secs• CPU Hours: 894


21

• CPU Hours: 894• Speedup: 1993X (ideal 2048)• Efficiency: 97.3%

MARS Economic Modeling on IBM BG/P (128K CPUs)

3500

4000

4500

5000

800000

1000000

ec)

s

ProcessorsActive TasksTasks CompletedThroughput (tasks/sec)

• CPU Cores: 130816• Tasks: 1048576• Elapsed time: 2483 secs• CPU Years: 9.3

Speedup: 115168X (ideal 130816)


22

0

500

1000

1500

2000

2500

3000

3500

0

200000

400000

600000

Thro

ughp

ut (t

asks

/se

Task

s C

ompl

eted

Num

ber o

f Pro

cess

ors

Time (sec)

Speedup: 115168X (ideal 130816)Efficiency: 88%


23

Many Many Tasks:Identifying Potential Drug Targets

2M+ ligandsProtein xtarget(s)

(Mike Kubal, Benoit Roux, and others)24

start

DOCK6Receptor

(1 per protein:defines pocket

to bind to)

ZINC3-D

structures

NAB scriptparameters

(defines flexibleresidues, #MDsteps)

BuildNABScript

NABScript

NABScript

Template

Amber prep:2. AmberizeReceptor4. perl: gen nabscript

FREDReceptor

(1 per protein:defines pocket

to bind to)

Manually prepDOCK6 rec file

Manually prepFRED rec file

1 protein(1MB)

6 GB2M

structures(6 GB)

DOCK6FRED ~4M x 60s x 1 cpu~60K cpu-hrs

PDBprotein

descriptions

Many Many Tasks:Identifying Potential Drug Targets

report ligands complexes

Amber Score:1. AmberizeLigand3. AmberizeComplex5. RunNABScript

end

Amber ~10K x 20m x 1 cpu~3K cpu-hrs

Select best ~500

~500 x 10hr x 100 cpu~500K cpu-hrsGCMC

Select best ~5KSelect best ~5K

For 1 target:4 million tasks

500,000 cpu-hrs(50 cpu-years)25

DOCK on SiCortex

• CPU cores: 5760• Tasks: 92160• Elapsed time: 12821 sec• Compute time: 1.94 CPU years


26

• Average task time: 660.3 sec• Speedup: 5650X (ideal 5760)• Efficiency: 98.2%

DOCK on the BG/P

CPU cores: 118784Tasks: 934803Elapsed time: 2.01 hoursCompute time: 21.43 CPU yearsAverage task time: 667 secRelative Efficiency: 99 7%

27

Relative Efficiency: 99.7%(from 16 to 32 racks)Utilization: • Sustained: 99.6%• Overall: 78.3%

Time (secs)


Data Diffusion

• Resource acquired in response to demand

• Data and applications diffuse from archival storage to newly acquired resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

28

• Resource “caching” allows faster responses to subsequent requests – Cache Eviction Strategies:

RANDOM, FIFO, LRU, LFU• Resources are released

when demand drops

Shared File System

Provisioned Resources

Shared File System

Provisioned Resources

Data Diffusion

• Considers both data and computations to optimize performance– Supports data-aware scheduling– Can optimize compute utilization, cache hit performance, or

29

a mixture of the two• Decrease dependency of a shared file system

– Theoretical linear scalability with compute resources– Significantly increases meta-data creation and/or

modification performance• Central for “data-centric task farm” realization

Data Diffusion:Good-cache-compute

1GB

1.5GB 0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

30

2GB

4GB

0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0


0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0


0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


Data Diffusion:Throughput and Response Time

Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 100Gb/s vs. 6Gb/s

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

31

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB

max-compute-util, 4GB

T

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB


T

1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB


Ave

rage

Res

pons

e Ti

me

(sec

) 1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB


Ave

rage

Res

pons

e Ti

me

(sec

)

Response Time – 3 sec vs 1569 sec 506X

Sin-Wave Workload

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

32

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


All-Pairs Workload500x500 on 200 CPUs

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

Efficiency: 75%

33

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

Limitations of Data Diffusion

• Needs Java 1.4+• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Per task workings set must fit in local storage• Task definition must include input/output files

metadata• Data access patterns: write once, read many

34

Mythbusting

• Embarrassingly Happily parallel apps are trivial to run– Logistical problems can be tremendous

• Loosely coupled apps do not require “supercomputers”– Total computational requirements can be enormous– Individual tasks may be tightly coupled

W kl d f tl i l l t f I/O


35

– Workloads frequently involve large amounts of I/O– Make use of idle resources from “supercomputers” via backfilling – Costs to run “supercomputers” per FLOP is among the best

• BG/P: 0.35 gigaflops/watt (higher is better)• SiCortex: 0.32 gigaflops/watt• BG/L: 0.23 gigaflops/watt• x86-based HPC systems: an order of magnitude lower

• Loosely coupled apps do not require specialized system software• Shared file systems are good for all applications

– They don’t scale proportionally with the compute resources– Data intensive applications don’t perform and scale well

More Information

• More information: http://people.cs.uchicago.edu/~iraicu/• Related Projects:

– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php

• Funding:NASA: Ames Research Center Graduate Student Research Program


36

– NASA: Ames Research Center, Graduate Student Research Program• Jerry C. Yan, NASA GSRP Research Advisor

– DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy

– NSF: TeraGrid

falkon, a fast and lightfalkon, a fast and light--weight weight task...

Documents