falkon, a fast and lightfalkon, a fast and light--weight weight task...
TRANSCRIPT
Falkon, a Fast and LightFalkon, a Fast and Light--weight weight tasK executiON framework for tasK executiON framework for
Clusters, Grids, and Clusters, Grids, and SupercomputersSupercomputersSupercomputersSupercomputers
Ioan RaicuDistributed Systems LaboratoryComputer Science Department
University of Chicago
In Collaboration with: Ian Foster, Mike Wilde, Zhao Zhang, Catalin Dumitrescu, Yong Zhao, Pete Beckman, Kamil Iskra, Alex Szalay, Philip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, Ben Clifford, plus others…
IEEE/ACM Supercomputing 2008Argonne National Laboratory Booth
November 19th, 2008
Problem Types
Input Data Size
Hi
Data Analysis, Mining
Big Data and Many Tasks
Input Data Size
Hi
Data Analysis, Mining
Big Data and Many Tasks
2Number of Tasks
Med
Low1 1K 1M
Heroic MPI
Tasks Many Loosely Coupled Apps
Number of Tasks
Med
Low1 1K 1M
Heroic MPI
Tasks Many Loosely Coupled Apps
MTC: Many Task Computing
• Bridge the gap between HPC and HTC• Loosely coupled applications with HPC orientations• HPC comprising of multiple distinct activities, coupled
via file system operations or message passingvia file system operations or message passing• Emphasis on many resources over short time periods• Tasks can be:
– small or large, independent and dependent, uniprocessor or multiprocessor, compute-intensive or data-intensive, static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled, large number of tasks, large quantity of computing, and large volumes of data… 3
Obstacles running MTC apps in Clusters/Grids
10000
100000
1000000
ough
put (
Mb/
s)
GPFS RLOCAL RGPFS R+WLOCAL R+W
10000
100000
1000000
ough
put (
Mb/
s)
GPFS RLOCAL RGPFS R+WLOCAL R+W
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
4
System Comments Throughput (tasks/sec)
Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45
Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42
Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22
System Comments Throughput (tasks/sec)
Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45
Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42
Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22
100
1000
1 10 100 1000Number of Nodes
Thro
100
1000
1 10 100 1000Number of Nodes
Thro
Solutions
• Falkon: A Fast and Light-weight tasK executiON framework– Goal: enable the rapid and efficient execution of many independent
jobs on large compute clusters– Combines three components:
• A streamlined task dispatcher• Resource provisioning through multi-level scheduling techniques
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
5
Resource provisioning through multi level scheduling techniques• Data diffusion and data-aware scheduling to leverage the co-located
computational and storage resources
• Swift: A parallel programming system for loosely coupled applications– Applications cover many domains: Astronomy, astro-physics, medicine,
chemistry, economics, climate modeling, data analytics
Falkon Overview
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
6
Distributed Falkon Architecture
Dispatcher1
Executor1
Client
Login Nodes(x10)
I/O Nodes(x640)
Compute Nodes (x40K)
7
Provisioner
Cobalt
ClientExecutor
256
DispatcherN
Executor1
Executor256
Dispatch Throughput
30003500400045005000
2534
3186 3071
(task
s/se
c)
30003500400045005000
2534
3186 3071
(task
s/se
c)
8
0500
10001500200025003000
ANL/UC, Java 200 CPUs 1 service
ANL/UC, C 200 CPUs1 service
SiCortex, C 5760 CPUs
1 service
BlueGene/P, C 4096 CPUs
1 service
BlueGene/P, C163840 CPUs 640 services
604
1758
Thro
ughp
ut (
Executor Implementation and Various Systems
0500
10001500200025003000
ANL/UC, Java 200 CPUs 1 service
ANL/UC, C 200 CPUs1 service
SiCortex, C 5760 CPUs
1 service
BlueGene/P, C 4096 CPUs
1 service
BlueGene/P, C163840 CPUs 640 services
604
1758
Thro
ughp
ut (
Executor Implementation and Various SystemsRunning 1 Million Jobs in 10 Minutes via the Falkon Fast and Light-weight tasK executiON framework
System Comments Throughput (tasks/sec)
Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45
Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42
Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22
System Comments Throughput (tasks/sec)
Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45
Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2Condor (v6.8.2) - Production 0.42
Condor (v6.9.3) - Development 11Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22
Efficiency
90%
100%30%
40%
50%
60%
70%
80%
90%
100%
Effic
ienc
y
32 seconds16 seconds8 seconds4 seconds2 seconds1 second
90%
10%
20%
30%
40%
50%
60%
70%
80%
90%
256 1024 4096 16384 65536 163840
Effic
ienc
y
Number of Processors
256 seconds128 seconds64 seconds32 seconds16 seconds8 seconds4 seconds2 seconds1 second
0%
10%
20%
Number of Processors
Falkon Endurance Test
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
10
Falkon Monitoring
• Workload• 160K CPUs
1M tasks
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
11
• 1M tasks• 60 sec per task
• 17.5K CPU hours in 7.5 min• Throughput: 2312 tasks/sec• 85% efficiency
Falkon Activity History (10 months)
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
12
Virtual Node(s)Abstractcomputation
S iftS i t
Specification Execution
Virtual Node(s)
file1
Scheduling
Execution Engine(Karajan w/
Swift Runtime)
Swift Architecture
Provisioning
FalkonResource
Provisioner
SwiftScript
Virtual DataCatalog
SwiftScriptCompiler
Provenancedata
ProvenancedataProvenance
collector
launcher
launcher
file2
file3
AppF1
AppF2
Swift runtimecallouts
CC CC
Status reportingAmazon
EC2
13
Functional MRI (fMRI)
Scalable Resource Management in Clouds and Grids 14
• Wide range of analyses– Testing, interactive analysis,
production runs– Data mining– Parameter studies
Completed Milestones: fMRI Application
48085000
6000GRAMGRAM/Clustering
• GRAM vs. Falkon: 85%~90% lower run time• GRAM/Clustering vs. Falkon: 40%~74% lower run time
Scalable Resource Management in Clouds and Grids 15Falkon: a Fast and Light-weight tasK executiON framework
1239
2510
3683
456866 992 1123
120 327 546 678
0
1000
2000
3000
4000
120 240 360 480Input Data Size (Volumes)
Tim
e (s
)
Falkon
B. Berriman, J. Good (Caltech)J. Jacob, D. Katz (JPL)
Scalable Resource Management in Clouds and Grids 16
Completed Milestones: Montage Application
3000
3500
GRAM/ClusteringMPI
• GRAM/Clustering vs. Falkon: 57% lower application run time• MPI* vs. Falkon: 4% higher application run time• * MPI should be lower bound
Scalable Resource Management in Clouds and Grids 17Falkon: a Fast and Light-weight tasK executiON framework
0
500
1000
1500
2000
2500
mProjec
t
mDiff/Fit
mBackg
round
mAdd(su
b)
mAdd total
Components
Tim
e (s
)
MPIFalkon
Hadoop vs. Swift
• Classic benchmarks for MapReduce– Word Count– Sort
• Swift performs similar or better than Hadoop
Scalable Resource Management in Clouds and Grids 18
p p(on 32 processors)
Sort
4285
733
25
83
512
1
10
100
1000
10000
10MB 100MB 1000MBData Size
Tim
e (s
ec)
Swift+Falkon
Hadoop
Word Count
221
11431795
863
46887860
1
10
100
1000
10000
75MB 350MB 703MBData Size
Tim
e (s
ec)
Swift+PBSHadoop
Molecular Dynamics
• Determination of free energies in aqueous solution– Antechamber – coordinates
Charmm solution– Charmm – solution– Charmm - free energy
19Scalable Resource Management in Clouds and Grids
0 1800 3600 5400 7200 9000 10800 12600 144001
Time (sec)
• 244 molecules 20497 jobs• 15091 seconds on 216 CPUs 867.1 CPU hours• Efficiency: 99.8%• Speedup: 206.9x 8.2x faster than GRAM/PBS• 50 molecules w/ GRAM (4201 jobs) 25.3 speedup
MolDyn Application
Scalable Resource Management in Clouds and Grids 2020
1100120013001400150016001700180019001
1000111001120011300114001150011600117001180011900120001
Task
ID
waitQueueTime execTime resultsQueueTime
MARS Economic Modeling on IBM BG/P
• CPU Cores: 2048• Tasks: 49152• Micro-tasks: 7077888• Elapsed time: 1601 secs• CPU Hours: 894
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
21
• CPU Hours: 894• Speedup: 1993X (ideal 2048)• Efficiency: 97.3%
MARS Economic Modeling on IBM BG/P (128K CPUs)
3500
4000
4500
5000
800000
1000000
ec)
s
ProcessorsActive TasksTasks CompletedThroughput (tasks/sec)
• CPU Cores: 130816• Tasks: 1048576• Elapsed time: 2483 secs• CPU Years: 9.3
Speedup: 115168X (ideal 130816)
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
22
0
500
1000
1500
2000
2500
3000
3500
0
200000
400000
600000
Thro
ughp
ut (t
asks
/se
Task
s C
ompl
eted
Num
ber o
f Pro
cess
ors
Time (sec)
Speedup: 115168X (ideal 130816)Efficiency: 88%
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
23
Many Many Tasks:Identifying Potential Drug Targets
2M+ ligandsProtein xtarget(s)
(Mike Kubal, Benoit Roux, and others)24
start
DOCK6Receptor
(1 per protein:defines pocket
to bind to)
ZINC3-D
structures
NAB scriptparameters
(defines flexibleresidues, #MDsteps)
BuildNABScript
NABScript
NABScript
Template
Amber prep:2. AmberizeReceptor4. perl: gen nabscript
FREDReceptor
(1 per protein:defines pocket
to bind to)
Manually prepDOCK6 rec file
Manually prepFRED rec file
1 protein(1MB)
6 GB2M
structures(6 GB)
DOCK6FRED ~4M x 60s x 1 cpu~60K cpu-hrs
PDBprotein
descriptions
Many Many Tasks:Identifying Potential Drug Targets
report ligands complexes
Amber Score:1. AmberizeLigand3. AmberizeComplex5. RunNABScript
end
Amber ~10K x 20m x 1 cpu~3K cpu-hrs
Select best ~500
~500 x 10hr x 100 cpu~500K cpu-hrsGCMC
Select best ~5KSelect best ~5K
For 1 target:4 million tasks
500,000 cpu-hrs(50 cpu-years)25
DOCK on SiCortex
• CPU cores: 5760• Tasks: 92160• Elapsed time: 12821 sec• Compute time: 1.94 CPU years
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
26
• Average task time: 660.3 sec• Speedup: 5650X (ideal 5760)• Efficiency: 98.2%
DOCK on the BG/P
CPU cores: 118784Tasks: 934803Elapsed time: 2.01 hoursCompute time: 21.43 CPU yearsAverage task time: 667 secRelative Efficiency: 99 7%
27
Relative Efficiency: 99.7%(from 16 to 32 racks)Utilization: • Sustained: 99.6%• Overall: 78.3%
Time (secs)
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
Data Diffusion
• Resource acquired in response to demand
• Data and applications diffuse from archival storage to newly acquired resources
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
28
• Resource “caching” allows faster responses to subsequent requests – Cache Eviction Strategies:
RANDOM, FIFO, LRU, LFU• Resources are released
when demand drops
Shared File System
Provisioned Resources
Shared File System
Provisioned Resources
Data Diffusion
• Considers both data and computations to optimize performance– Supports data-aware scheduling– Can optimize compute utilization, cache hit performance, or
29
a mixture of the two• Decrease dependency of a shared file system
– Theoretical linear scalability with compute resources– Significantly increases meta-data creation and/or
modification performance• Central for “data-centric task farm” realization
Data Diffusion:Good-cache-compute
1GB
1.5GB 0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0 001
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0 001
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
30
2GB
4GB
0.001
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
3600
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
3600
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
60
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
60
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
Data Diffusion:Throughput and Response Time
Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 100Gb/s vs. 6Gb/s
1
10
100
Thro
ughp
ut (G
b/s)
Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)
1
10
100
Thro
ughp
ut (G
b/s)
Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)
31
0.1Ideal first-
availablegood-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit,
4GB
max-compute-util, 4GB
T
0.1Ideal first-
availablegood-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit,
4GB
max-compute-util, 4GB
T
1084
230 287
3.1
114
1569
3.4
1
10
100
1000
10000
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit, 4GB
max-compute-util, 4GB
Ave
rage
Res
pons
e Ti
me
(sec
) 1084
230 287
3.1
114
1569
3.4
1
10
100
1000
10000
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit, 4GB
max-compute-util, 4GB
Ave
rage
Res
pons
e Ti
me
(sec
)
Response Time – 3 sec vs 1569 sec 506X
Sin-Wave Workload
• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
32
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
%
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
%
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
All-Pairs Workload500x500 on 200 CPUs
60%70%80%90%100%
4856647280
Mis
s
(Gb/
s)
60%70%80%90%100%
4856647280
Mis
s
(Gb/
s)
Efficiency: 75%
33
0%10%20%30%40%50%
08
16243240
Cach
e Hi
t/
Thro
ughp
ut (
Time (sec)
Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)
0%10%20%30%40%50%
08
16243240
Cach
e Hi
t/
Thro
ughp
ut (
Time (sec)
Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)
Limitations of Data Diffusion
• Needs Java 1.4+• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Per task workings set must fit in local storage• Task definition must include input/output files
metadata• Data access patterns: write once, read many
34
Mythbusting
• Embarrassingly Happily parallel apps are trivial to run– Logistical problems can be tremendous
• Loosely coupled apps do not require “supercomputers”– Total computational requirements can be enormous– Individual tasks may be tightly coupled
W kl d f tl i l l t f I/O
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
35
– Workloads frequently involve large amounts of I/O– Make use of idle resources from “supercomputers” via backfilling – Costs to run “supercomputers” per FLOP is among the best
• BG/P: 0.35 gigaflops/watt (higher is better)• SiCortex: 0.32 gigaflops/watt• BG/L: 0.23 gigaflops/watt• x86-based HPC systems: an order of magnitude lower
• Loosely coupled apps do not require specialized system software• Shared file systems are good for all applications
– They don’t scale proportionally with the compute resources– Data intensive applications don’t perform and scale well
More Information
• More information: http://people.cs.uchicago.edu/~iraicu/• Related Projects:
– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php
• Funding:NASA: Ames Research Center Graduate Student Research Program
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
36
– NASA: Ames Research Center, Graduate Student Research Program• Jerry C. Yan, NASA GSRP Research Advisor
– DOE: Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy
– NSF: TeraGrid