grids : interaction and scaling (analysis on the grid) jeff templon nikhef acat’07 conference...
TRANSCRIPT
Grids : Interaction and Scaling
(Analysis on the Grid)
Jeff Templon
NIKHEF
ACAT’07 Conference
Amsterdam, 25 april 2007
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 2
Roadmap
Introduction to problem
Some history
Current Hot Topics
Some recommendations
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 3
(BTW, for those of you who don't know what I'm doing these days -- I'm managing the Hotmail development team. The Hotmail system serves hundreds of millions of users and runs on thousands of machines. While the Hotmail application is not particularly complex (hey, it's just email), the system itself is very complex. ALL of the hard bugs are 'weird interaction’bugs, and we've seen many, many bugs that occur only at scale.)
The thing that makes building software difficult is the complex, unpredictable interactions of software systems. (This is also the thing that makes software estimates so darn hard.) People try to compare writing software with building a bridge or a skyscraper. But, generally speaking, buildings and skyscrapers don't interact with one another, or at least, not in very complex ways. Software does. A bug in one piece of code can cause all kinds of weird behavior in other parts of the system.
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 4
Interviewing at Google
Lots of back-of-envelope computation and the like, too. A friend of mine thought he was doing well in his second Google phone interview when asked to sketch a way to compute bigram statistics for a corpus of a hundred million documents—he had started discussing std::map<std::string> and the like, and didn’t get why the interviewer seemed distinctly unimpressed, until I pointed out even if documents are only a couple thousand words each, where are you going to STORE those two hundred billion words—in memory?! That’s a job for an enterprise-scale database engine!
So, at least as far as the interviewing process goes, it seems designed for people with a vast array of interests related to programming, computation, modeling, data processing, networking, and good problem-rough-sizing abilities—I guess Google routinely faces problems that may not be hugely complex but are made so by the sheer scale involved.
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 5
Old-style analysis & interactions
% paw
PAW[1] > exec ana.kumac
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 6
Site
ALICE central services
New style analysis & interactions
Job 1 lfn1, lfn2, lfn3, lfn4
Job 2 lfn1, lfn2, lfn3, lfn4
Job 3 lfn1, lfn2, lfn3
Job 1.1 lfn1
Job 1.2 lfn2
Job 1.3 lfn3, lfn4
Job 2.1 lfn1, lfn3
Job 2.1 lfn2, lfn4
Job 3.1 lfn1, lfn3
Job 3.2 lfn2
Optimizer
ComputingAgent
RB
CE WN
Env OK?
Die with grac
e
Execs agent
Sends job agent to site
Yes No
Close SE’s & SoftwareMatchmaking
Receives work-load
Asks work-load
Retrieves workload
Sends job result
Updates TQ
Submits job UserALICE Job Catalogue
Submitsjob agent
VO-Box
LCG
User Job
ALICE catalogues
Registers output
lfn guid {se’s}
lfn guid {se’s}
lfn guid {se’s}
lfn guid {se’s}
lfn guid {se’s}
ALICE File Catalogue
packman
Thanks Fed
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 7
Scaling in Grid Era
Sites : 5 -> 200 (40)
Users : 5 -> 350 (70)
Cores : 25 -> 50 000 (2000)
Concurrent Jobs : 5 -> 50 000 (10 000)
Bytes : 100 Gb -> 5 Pb (50 000)
Factor 103 - 104 in six years
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 8
Metacomputing Directory Service
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 9
Observations
Globus MDSIf more than a few (5-10) sites subscribe to a higher level MDS server (GIIS) the population of the higher level GIIS started to take a very long time
MDS servers have the tendency to fail if the amount of data that they handle exceeds a limit (this is not a hard limit, but we could see that the rate of failures increases rapidly with the amount of data)
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 10
New architecture, same interface
Concept : one short lunch
Implementation : one long day
Why? We had seen it ...
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 11
C = =Grid software service
(like http server)
InformationSystem
CC
C
C
C
C
C
Information System is Central Nervous System of Grid
Info system defines grid
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 12
C = =Grid software service
I.S.
CC
C
C
C
C
C
Computing Task Submission
W.M.S.
proxy + command;(data);
Coarse Requirements
Candidate Clusters
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 13
WMS ScalingJob Submission============== The basic job submission system works, and has done so in previous releases, in that a single job can be submitted and run successfully. However, if large numbers of jobs are submitted the performance is found to degrade rapidly. Potentially the facility for resubmission of failed jobs, and the plan to check for dying daemons and restart them, should help performance with respect to the 1.1 testbed, but we don’t think that they should be considered a substitute for finding and fixing the underlying problems. (WP Managers, June 2002)
Problems:
o Underdimensioning (max open files)
o Memory Leaks
In addition, Stephen Burke submitted 9 jobs, all of which went to NIKHEF and immediately failed with a Globus error. Further tests were impossible because the resource broker became unreachable.
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 14
CE Scaling
CE
Jobmanager
Job onWN
Job manager “babysits” job andreports status back to mother
WMS
JobfromWMS
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 15
CE Scaling
CE
Jobmanager
Job onWN
JobsfromWMS
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jobmanager
Job onWN
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 16
C = =Grid software service
(like http server)
InformationSystem
CC
C
C
C
C
C
Information System is Central Nervous System of Grid
Info system defines grid
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 17
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 18
Scaling : LRMS operations (counting)
OpenPBS worked fine for 10 WN
Worked almost fine for 70 WN
Definitely doesn’t work fine for 100 WN
Almost no site uses OpenPBS now
Torque 1 : could not handle load from queries by job manager!
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 19
Current Hot Topics
Castor (LSF “jobs”)
Data Cleanup Can take a very long time (5 sec per file x 20 000 files = 1
day) Sometimes delete doesn’t really delete (“advisory” delete) Production team: staff spending ~ 5% time doing similar stuff
ATLAS : disk space is full Event inflation “private” replicas “dark storage”
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 20
LHCb Denial of Service Scenario
Analysis Job Needs to process ~ 20 input files
Retrieves these from local storage at start of job
Production Manager drinks coffee, starts to submit
~ 200 jobs arrive at site in burst
~ 4000 “get” requests arrive at local storage system in burst
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 21
D0 Mirror Image Stager is off-site (IN2P3 or FNAL)
Jobs wait for files : burn farm allocation waiting ...
Day / Night effect (best between 4 and 7 our time ...)
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 22
Scaling Challenges for Analysis
# of users (? -> ??)
Amount of data (factor 10 in next year?)
Amount of files (?)
Lack of organization Maybe statistical mechanics will save us
Number of assumptions VM techniques may save us
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 23
Some Things to Consider When Developing Your Systems
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 24
Employ Passive Stability
Black Hole Site Prints default values to info system if info production prog
fails
Default values are “zero”
“I can run your job in zero seconds from now” -> 2 x 109
Job Submission Loop: Unless stop signal seen : keep submitting (bad dog, no bone)
If sys responds with “yes I have work” then submit again
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 25
Plan for Failure
Hard for Physicists : “Maxwell’s Equations were down today”
Right now (20:36 24 april 2007) : 25 of 218 LCG sites are reporting error condition
Last week : a single disk enclosure at NIKHEF pinned about one bit in 109 for files > 64 MB
File size was fine
Root could open file!!!
Some fool (so, genius) decided to record md5 checksums
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 26
Log-ging instead of log-in
“Can you please give me login access to the WN so I can see why my job is stuck?” No.
“Can you please send me the log file named athlog.txt associated with the job running on wn-bull-023.farm.nikhef.nl?” Yes.
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 27
Heterogeneity D0 Validation :
“You guys have those new Athlons don’t you? With some regular pentiums as well?”
They could see this in the histograms!!
Same binaries, different results
Is that a higgs or a woodcrest?
252 (???)AMD Opteron(tm) Processor 146AMD Opteron(tm) Processor 248AMD Opteron(tm) Processor 250ATHLONATHLON2000+AthlonAthlon 64 X2 4600+Athlon64AthlonMP2600Dual Core AMD Opteron(tm) Processor 275Dual Core OpteronEM64TElonex_2800IA32IA64IBMe326mIntelIntel Pentium D 840Intel(R) Pentium(R) 4 CPU 1.70GHzIntel(R) Xeon(R) 5130 @ 2.00GHzIntel(R) Xeon(R) CPU 5130 @ 2.00GHzIntel(R) Xeon(TM) CPU 3.00GHzIntel(R) Xeon(TM) CPU 3.06GHz [ 27 more ]
Jeff Templon – Grids : Interaction and Scaling, ACAT’07 Amsterdam, 2007.04.25 - 28
What to do? Deal with it … fact of life
Measure first, then optimize -> prototype! It is all too easy to be too clever “Preoptimization is the root of all evil” -- Knuth Find the scaling problems early on, before your code investment is
huge The Practice of Programming, Kernighan and Pike
Logging frameworks Can always turn off logging
Nwrong >> Nright : it’s probably a bug in your code! [ wasting others’ time ]
Think of us poor admin types …
Have fun … at least as much as we did getting it ready for you! (or come work with us … two (1 phd 1 postdoc) open positions)