sam job submission what is sam? sam submit …… data management details. conclusions. rod walker,...
TRANSCRIPT
SAM Job Submission
• What is SAM?
• sam submit ……
• Data Management
• Details.
• Conclusions.
Rod Walker, 10th May, Gridpp, Manchester.
What is SAM?
• SAM is Sequential data Access via Meta-data• Project started in 1997 to handle D0’s needs for
Run II data system.• Current SAM team includes:
– Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)
• http://d0db.fnal.gov/sam
SAM is a Distributed SystemDatabaseServer(s)(Central Database)
NameServer
Global Resource
Manager(s)Log server
Station 1Servers
Station 2Servers
Station 3 Servers
Station nServers
Mass Storage System(s)
SharedGlobally
Local
SharedLocally
Arrows indicateControl and data flow
Job Submission
• Executable– Runtime environment
• Executable&assoc. files (user specific).• Experiment environment.
• Data– Dataset definition
• Select by metadata. • Converted to LFN`s at submit time, ie.datasets
change.• Build SQL query…then…execute query.
Job Running & Job Control
ClientLocal SM
(Station Master)
Batch SystemProcess Manager
(SAM wrapper script)User Task
Job Manager(Project Master)
2.submit to SM
4.submitTo BS
6.start job 8.invoke
5.Submission ok
10.resubmit
9.setJobCount/stop
3.invoke
jobEnd
1. sam submit –defname=mydata –script=myexe
7.Started
(Run this exe | on this data)
User exeUser exeUser exe
Job control
User exe
getNextFile()
Here`s the path to a local file: /sam/cache1/boo/mydata1.dat
WaitFinished
Replica Catalogue
LFN
PFNStager
Fetch PFN
BS
Release
12
34
Physics & wrapper
Replica Catalogue
• Combined with Metadata in an Oracle database, although logically distinct– Query on metadata to create a dataset
• list of LFN`s
• Experiment specific (D0/CDF).
– Query on LFN to locate physical file.• Generic replica catalogue.
• node:/path/to/cache/myfile.dat
Replica Catalogue
600,000 files increasing at 3000/day, 120TB.
150,000 in cache
5000 files per day replicated, 5000 destroyed.
½ million queries per day, (90% SELECT).
Cache Managment
• 13.6TB, in several 100 individually managed caches.• 1TB in and out/day (10k files)• Cache lifetime ~10 days• Various prescriptions for cache replacement, e.g. 1st in, 1st
out, last use.
70% hit rate(~6000 files/day)
Replication
• Easy – use your favourite ftp.
• BUT……what could go wrong.– Cache space – Cache Management.– network, dead node, corrupted file - retries.– dead disk, uncached – fail-over.– sluggish robot, slow delivery – hold job.
• A stroll through my log file.
05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:01:51 by eworker In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact [email protected] 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed,scheduling retry in 3 seconds
Retry
05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:02:35 by eworker In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact [email protected] 05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum numberof retrials exceeded. Will not retry again from this source!05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations:(cab:d0cs015.fnal.gov:/sam/cache/boo)05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred,selectingenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24)
Give up on this source.
Avoid this location. Get another location from RC, and retry.
05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:10:53 by eworker In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=1369320147LABEL=PRL859LOCATION=0000_000000000_0000067DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=160.38SEEK_TIME=73.47MOUNT_TIME=25.36QWAIT_TIME=65.79TIME2NOW=329.78STATUS=ok STDERR: Completed transferring 1369320147 bytes in 1 files in329.720216036 sec. Overall rate = 3.96 MB/sec. Drive rate = 8.14 MB/sec. Network rate = 8.13 MB/sec. Exit status
Got it
05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Rememberingthat job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held --------------------------05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:00:56 by eworker In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=788805399LABEL=PRL829LOCATION=0000_000000000_0000025DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=90.08SEEK_TIME=45.05MOUNT_TIME=27.14QWAIT_TIME=225.50TIME2NOW=392.28STATUS=ok STDERR: Completed transferring 788805399 bytes in 1 files in392.221878052 sec. Overall rate = 1.92 MB/sec. Drive rate = 8.35 MB/sec. Network rate = 8.35 MB/sec. Exit status = 0., method name: samcp Recommended action: Please contact [email protected]/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Willexecute: qrls 1760.gw39.hep.ph.ic.ac.uk
Hold in queue until 1st file delivered.
Release
File arrives