celebrating diversity in volunteer computing david p. anderson space sciences lab u.c. berkeley...

35
Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Upload: matilda-powers

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network  performance  availability  proxies system availability reliability  crashes, invalid results, cheating

TRANSCRIPT

Page 1: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Celebrating Diversityin Volunteer Computing

David P. AndersonSpace Sciences Lab

U.C. Berkeley

Sept. 1, 2008

Page 2: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Background

Volunteer computing distributed scientific computing using

volunteered resources (desktops, laptops, game consoles, cell phones, etc.)

BOINC middleware for volunteer (and desktop grid)

computing

Page 3: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Diversity of resources CPU type, number, speed RAM, disk Coprocessors OS type and version network

performance availability proxies

system availability reliability

crashes, invalid results, cheating

Page 4: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Diversity of applications

Resource requirements CPU, coprocessors, RAM, storage, network

Completion time constraints Numerical properties

same result on all CPUs a little different unboundedly different

Page 5: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

IBM World Community Grid “Umbrella” project sponsored by IBM

Rice genome study: Univ. of Washington Protein X-ray crystallography: Ontario Cancer

Inst. African climate study: Univ. of Capetown Dengue fever drug discovery: Univ. of Texas Human protein folding: NYU, Univ. of Washington HIV drug discovery: Scripps Institute

Started Nov. 2004 390,000 volunteers total 167,000 years of CPU time Currently ~170 TeraFLOPS

Page 6: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

CPU type

1

10

100

1000

10000

100000A

MD

Ath

lon

AM

D A

thlo

n 64

AM

D A

thlo

n FX

AM

D A

thlo

n M

PA

MD

Ath

lon

X2

AM

D A

thlo

n X

PA

MD

Dur

onA

MD

Geo

deA

MD

K6

AM

D K

7A

MD

Opt

eron

AM

D O

ther

AM

D P

heno

mIn

tel C

eler

onIn

tel C

ore

2In

tel C

ore

2 D

uoIn

tel C

ore

2 Q

uad

Inte

l Cor

e D

uoIn

tel O

ther

Inte

l Pen

tium

Inte

l Pen

tium

4In

tel P

entiu

m D

Inte

l Pen

tium

IIIn

tel P

entiu

m II

IIn

tel P

entiu

m M

Inte

l Xeo

nTr

ansm

eta

Cen

taur

Hau

lsIB

M P

ower

PC

Processory Type

Num

ber o

f Com

pute

rs

Page 7: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

# cores

1

10

100

1000

10000

100000

1 2 3 4 6 8 14 16 24 32 64

Cores

Num

ber o

f Com

pute

rs

Page 8: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

OS type

1

10

100

1,000

10,000

100,000

1,000,000

Type of Operating System

Num

ber o

f Com

pute

rs

Page 9: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

RAM

1

10

100

1000

10000

1000000

512

1024

1536

2048

2560

3072

3584

4096

4608

5248

5760

6272

7296

7808

8704

1011

2

1126

4

1318

4

1433

6

1587

2

1638

4

RAM (in MB)

Num

ber o

f Com

pute

rs

Page 10: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Free disk space

1

10

100

1000

100000 8 16 24 32 40 48 56 64 72 80 88 96 104

112

120

128

136

144

152

160

168

176

184

192

200

208

216

224

232

240

248

Available Disk Space (GB)

Num

ber o

f Com

pute

rs

Page 11: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Availability

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Percent Available

Num

ber o

f Com

pute

rs

Page 12: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Job error rate

1

10

100

1000

10000

100000

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Percent Error

Num

ber o

f Com

pute

rs

Page 13: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Average turnaround time

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 39 75 111 147 183 219 255 291 327 363 399 435 471 519

Hours

Num

Com

pute

rs

Page 14: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Current WCG applications

Page 15: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Job dispatching

1M jobsschedulerclient

Goals maximize system throughput minimize time to batch completion minimize time to grant credit scale to >100 requests/sec

Page 16: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

BOINC scheduler architecture

Job queue(DB)Schedulerclient

FeederJob cache(shared memory)

Issues: what if cache fills up with unsendable jobs? what is client needs a job not in cache?

Page 17: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Homogeneous replication

Different platforms do FP math differently makes result validation difficult

Divide platforms into equivalence classes, send instances of a job to a single class

“Census” program computes distribution Scheduler: send committed jobs if possible

Win/Intel Win/AMD etc. uncommitted

Page 18: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Retry acceleration

Retries needed when: job times out error (crash) returned results fail to validate

Send retries to hosts that are: fast (low turnaround) reliable

Shorten latency bound of retries

Page 19: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Volunteer app selection

Volunteers can select apps opt to accept jobs from non-selected apps

Page 20: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Fast feasibility checks (no DB)

Client sends: hardware spec availability info list of jobs queued, in progress

Resource checks Completion time check

EDF simulation deadlines missed?

Page 21: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Slow feasibility checks (DB)

Is job still needed? Has another replica been sent to this

volunteer?

Page 22: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

jobApplication

Platform mechanism

Jobs are associated with apps, not versions

Win/x86 Win/x64 Linux/x86

App versions

Request message:

platform 0: Win64platform 1: Win32

Application

Win/x86 Win/x64 Linux/x86

App versions jobjob

Page 23: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Host punishment

The problem: hosts that error out all jobs Maintain M(h): max jobs per day for host h On each error, decrement M(h) On valid job, double M(h)

Page 24: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Anonymous platform mechanism

Rather than downloading apps from server, client has preexisting local apps.

Scheduler: if client has its own apps, only send it jobs for those apps.

Usage scenarios: Computers with unsupported platforms People who optimize apps Security-conscious people who want to inspect

the source code

Page 25: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Old scheduling policy

Job cache scan start from random point do fast feasibility checks lock job, do slow feasibility checks

Multiple scans send jobs committed to an HR class if fast host, send retries send work for selected apps is allowed, send work for non-selected apps

Problems rigid policy app == 1 CPU

Page 26: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Coprocessor and multi-thread apps

How to select the best version for a given host? How to estimate performance on the host?

Win/x86

single-threaded

multi-threaded CUDA

Page 27: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Multithread/coprocessor (cont.) How to decide which app version to use?

app versions have “plan class” string scheduler has project-supplied functionbool app_plan(SCHEDULER_REQUEST &sreq, char* plan_class, HOST_USAGE&);

returns: whether host can run app coprocessor usage CPU usage (possibly fractional) expected FLOPS cmdline to pass to app

embodies knowledge about sublinear speedup, etc. Scheduler: call app_plan() for each version, use

the one with highest expected FLOPS

Page 28: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Multithread/coprocessor (cont.)

Client coprocessor handling (currently just CUDA)

hardware check/report scheduling (coprocessors not timesliced)

CPU scheduling run enough apps to use at least N cores

Page 29: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Score-based scheduling

random

N

rank by score

feasible jobs

send M highest-scoring jobs

Page 30: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Terms in the score function

Bonus if host is fast and job is a retry job is committed to HR class app was selected by volunteer

Page 31: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Job size matching

Goal: send large jobs to fast hosts, small jobs to slow hosts reduce credit-granting delay reduce server occupancy time

Census program maintains host statistics Feeder maintains job size statistics Score penalty: |job - host|2

Page 32: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Adaptive replication

Goal: achieve a target level of reliability while reducing replication to 1+ε

Idea: replicate less (but always some) as a host becomes more trusted

Policy: maintain “invalid rate” E(h) per host. if E(h) > X, replicate (e.g., 2-fold) else replicate with probability E(h)/X

Is there a counterstrategy?

Page 33: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Server simulation

How do we know these policies are any good? How can we study alternatives?

In situ study is difficult SIMBA emulator (U. of Delaware):

SIMBA(emulates N clients)

BOINC server(not emulated)

Page 34: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Upcoming scheduler changes

Problems: only use 1 app version completion-time simulation is antiquated (doesn’t

reflect multithread, coprocessor, RAM limitations) New concept: resource signature

#CPUs, #coprocessors, RAM Do simulation based on “greedy EDF

scheduling” using resource signature Select app version that can use available

resources

Page 35: Celebrating Diversity in Volunteer Computing David P. Anderson Space Sciences Lab U.C. Berkeley Sept. 1, 2008

Conclusion

Volunteer computing has diverse resources and workloads

BOINC has mechanisms that deal effectively and efficiently with this diversity

Lots of fun research problems here!

[email protected]