grids : interaction and scaling (analysis on the grid) jeff templon nikhef acat’07 conference...

Grids : Interaction and Scaling

(Analysis on the Grid)

Jeff Templon

NIKHEF

ACAT’07 Conference

Amsterdam, 25 april 2007

http://gridportal.hep.ph.ic.ac.uk/rtm/

Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 2

Roadmap

Introduction to problem

Some history

Current Hot Topics

Some recommendations


(BTW, for those of you who don't know what I'm doing these days -- I'm managing the Hotmail development team. The Hotmail system serves hundreds of millions of users and runs on thousands of machines. While the Hotmail application is not particularly complex (hey, it's just email), the system itself is very complex. ALL of the hard bugs are 'weird interaction’bugs, and we've seen many, many bugs that occur only at scale.)

The thing that makes building software difficult is the complex, unpredictable interactions of software systems. (This is also the thing that makes software estimates so darn hard.) People try to compare writing software with building a bridge or a skyscraper. But, generally speaking, buildings and skyscrapers don't interact with one another, or at least, not in very complex ways. Software does. A bug in one piece of code can cause all kinds of weird behavior in other parts of the system.


Interviewing at Google

Lots of back-of-envelope computation and the like, too. A friend of mine thought he was doing well in his second Google phone interview when asked to sketch a way to compute bigram statistics for a corpus of a hundred million documents—he had started discussing std::map<std::string> and the like, and didn’t get why the interviewer seemed distinctly unimpressed, until I pointed out even if documents are only a couple thousand words each, where are you going to STORE those two hundred billion words—in memory?! That’s a job for an enterprise-scale database engine!

So, at least as far as the interviewing process goes, it seems designed for people with a vast array of interests related to programming, computation, modeling, data processing, networking, and good problem-rough-sizing abilities—I guess Google routinely faces problems that may not be hugely complex but are made so by the sheer scale involved.


Old-style analysis & interactions

% paw

PAW[1] > exec ana.kumac


Site

ALICE central services

New style analysis & interactions

Job 1 lfn1, lfn2, lfn3, lfn4

Job 2 lfn1, lfn2, lfn3, lfn4

Job 3 lfn1, lfn2, lfn3

Job 1.1 lfn1

Job 1.2 lfn2

Job 1.3 lfn3, lfn4

Job 2.1 lfn1, lfn3

Job 2.1 lfn2, lfn4

Job 3.1 lfn1, lfn3

Job 3.2 lfn2

Optimizer

ComputingAgent

RB

CE WN

Env OK?

Die with grac

e

Execs agent

Sends job agent to site

Yes No

Close SE’s & SoftwareMatchmaking

Receives work-load

Asks work-load

Retrieves workload

Sends job result

Updates TQ

Submits job UserALICE Job Catalogue

Submitsjob agent

VO-Box

LCG

User Job

ALICE catalogues

Registers output

lfn guid {se’s}

lfn guid {se’s}

lfn guid {se’s}

lfn guid {se’s}

lfn guid {se’s}

ALICE File Catalogue

packman

Thanks Fed


Scaling in Grid Era

Sites : 5 -> 200 (40)

Users : 5 -> 350 (70)

Cores : 25 -> 50 000 (2000)

Concurrent Jobs : 5 -> 50 000 (10 000)

Bytes : 100 Gb -> 5 Pb (50 000)

Factor 103 - 104 in six years


Metacomputing Directory Service


Observations

Globus MDSIf more than a few (5-10) sites subscribe to a higher level MDS server (GIIS) the population of the higher level GIIS started to take a very long time

MDS servers have the tendency to fail if the amount of data that they handle exceeds a limit (this is not a hard limit, but we could see that the rate of failures increases rapidly with the amount of data)


New architecture, same interface

Concept : one short lunch

Implementation : one long day

Why? We had seen it ...


C = =Grid software service

(like http server)

InformationSystem

CC

C

C

C

C

C

Information System is Central Nervous System of Grid

Info system defines grid



I.S.

CC

C

C

C

C

C

Computing Task Submission

W.M.S.

proxy + command;(data);

Coarse Requirements

Candidate Clusters


WMS ScalingJob Submission============== The basic job submission system works, and has done so in previous releases, in that a single job can be submitted and run successfully. However, if large numbers of jobs are submitted the performance is found to degrade rapidly. Potentially the facility for resubmission of failed jobs, and the plan to check for dying daemons and restart them, should help performance with respect to the 1.1 testbed, but we don’t think that they should be considered a substitute for finding and fixing the underlying problems. (WP Managers, June 2002)

Problems:

o Underdimensioning (max open files)

o Memory Leaks

In addition, Stephen Burke submitted 9 jobs, all of which went to NIKHEF and immediately failed with a Globus error. Further tests were impossible because the resource broker became unreachable.


CE Scaling

CE

Jobmanager

Job onWN

Job manager “babysits” job andreports status back to mother

WMS

JobfromWMS


CE Scaling

CE

Jobmanager

Job onWN

JobsfromWMS

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN

Jobmanager

Job onWN



(like http server)

InformationSystem

CC

C

C

C

C

C

Information System is Central Nervous System of Grid

Info system defines grid


QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.


Scaling : LRMS operations (counting)

OpenPBS worked fine for 10 WN

Worked almost fine for 70 WN

Definitely doesn’t work fine for 100 WN

Almost no site uses OpenPBS now

Torque 1 : could not handle load from queries by job manager!


Current Hot Topics

Castor (LSF “jobs”)

Data Cleanup Can take a very long time (5 sec per file x 20 000 files = 1

day) Sometimes delete doesn’t really delete (“advisory” delete) Production team: staff spending ~ 5% time doing similar stuff

ATLAS : disk space is full Event inflation “private” replicas “dark storage”


LHCb Denial of Service Scenario

Analysis Job Needs to process ~ 20 input files

Retrieves these from local storage at start of job

Production Manager drinks coffee, starts to submit

~ 200 jobs arrive at site in burst

~ 4000 “get” requests arrive at local storage system in burst


D0 Mirror Image Stager is off-site (IN2P3 or FNAL)

Jobs wait for files : burn farm allocation waiting ...

Day / Night effect (best between 4 and 7 our time ...)


Scaling Challenges for Analysis

# of users (? -> ??)

Amount of data (factor 10 in next year?)

Amount of files (?)

Lack of organization Maybe statistical mechanics will save us

Number of assumptions VM techniques may save us


Some Things to Consider When Developing Your Systems


Employ Passive Stability

Black Hole Site Prints default values to info system if info production prog

fails

Default values are “zero”

“I can run your job in zero seconds from now” -> 2 x 109

Job Submission Loop: Unless stop signal seen : keep submitting (bad dog, no bone)

If sys responds with “yes I have work” then submit again


Plan for Failure

Hard for Physicists : “Maxwell’s Equations were down today”

Right now (20:36 24 april 2007) : 25 of 218 LCG sites are reporting error condition

Last week : a single disk enclosure at NIKHEF pinned about one bit in 109 for files > 64 MB

File size was fine

Root could open file!!!

Some fool (so, genius) decided to record md5 checksums


Log-ging instead of log-in

“Can you please give me login access to the WN so I can see why my job is stuck?” No.

“Can you please send me the log file named athlog.txt associated with the job running on wn-bull-023.farm.nikhef.nl?” Yes.


Heterogeneity D0 Validation :

“You guys have those new Athlons don’t you? With some regular pentiums as well?”

They could see this in the histograms!!

Same binaries, different results

Is that a higgs or a woodcrest?

252 (???)AMD Opteron(tm) Processor 146AMD Opteron(tm) Processor 248AMD Opteron(tm) Processor 250ATHLONATHLON2000+AthlonAthlon 64 X2 4600+Athlon64AthlonMP2600Dual Core AMD Opteron(tm) Processor 275Dual Core OpteronEM64TElonex_2800IA32IA64IBMe326mIntelIntel Pentium D 840Intel(R) Pentium(R) 4 CPU 1.70GHzIntel(R) Xeon(R) 5130 @ 2.00GHzIntel(R) Xeon(R) CPU 5130 @ 2.00GHzIntel(R) Xeon(TM) CPU 3.00GHzIntel(R) Xeon(TM) CPU 3.06GHz [ 27 more ]


What to do? Deal with it … fact of life

Measure first, then optimize -> prototype! It is all too easy to be too clever “Preoptimization is the root of all evil” -- Knuth Find the scaling problems early on, before your code investment is

huge The Practice of Programming, Kernighan and Pike

Logging frameworks Can always turn off logging

Nwrong >> Nright : it’s probably a bug in your code! [ wasting others’ time ]

Think of us poor admin types …

Have fun … at least as much as we did getting it ready for you! (or come work with us … two (1 phd 1 postdoc) open positions)

grids : interaction and scaling (analysis on the grid) jeff templon nikhef acat’07 conference...

Documents

acat07 amsterdam

yearsjeff templon grids

kumac jeff templon grids

2lfn2lfn guid

hotmail system

software estimates

writing software

complex ways