batch system usage - astroparticle physicsthe zeuthen farm using the sge batch farm 3 j. bazo 1 all...
TRANSCRIPT
![Page 1: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/1.jpg)
Batch system usageT
he Z
euth
en F
arm
J. B
azo
10.
11.2
010
![Page 2: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/2.jpg)
● Computing wikipage:● http://dvinfo.ifh.de
● Central email address for questions & requests:[email protected]
● Data storage:● AFS ( /afs/ifh.de/group/amanda/scratch/ your scratch area)● Lustre: fast parallel storage, use in batch jobs● dCache (acs): mass storage, files on tape
General stuff
J. Bazo The Zeuthen Farm
1
![Page 3: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/3.jpg)
J. Bazo The Zeuthen Farm
The Zeuthen Farm
J. Bazo 1
2
Resources:
Batch farm: ~700 cores, all Intel Xeon2.33-3.0 GHz, RAM: 2-4 GB/coreAll nodes run 64-bit SL5SUN Grid Engine (SGE) 6.2Shared between all groups (nic, that, amanda, z_nuastr, cta, pitz, atlas)
Cluster: 1024 cores, (accessible for theory groups)
Other farms:
GRIDZN: atlas, lhcb, cta (9%), icecube (6%), etc (you need a grid certificate)
NAF (National Analysis Facility: Physics at the Terascale: LHC experiments: cms, atlas, lhcb and ILC
Resources:Batch farm: ~700 cores, all Intel Xeon
2.33-3.0 GHz, RAM: 2-4 GB/core
Cluster: 1024 cores, 2.8 GHz, 3 GB/core All nodes run 64-bit SL5 Shared between all groups
Batch jobs: qsub: simulation, data processing, ... Interactive access: qrsh
heavy ROOT sessions, moving data, ... Most common mistake: failure to request resources
Other farms in Zeuthen: GridZn:atlas, lhcb, cta (9%), icecube (6%), etc NAF (National Analysis Facility: Physics at the Terascale, Strategic Helmholtz Alliance): cms, atlas, ilc, lhcb
![Page 4: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/4.jpg)
J. Bazo The Zeuthen Farm
Using the SGE Batch Farm
3
J. Bazo 1 All nodes run 64-bit SL5 Resources:
Batch farm: ~700 cores, all Intel Xeon2.33-3.0 GHz, RAM: 2-4 GB/core
Cluster: 1024 cores, 2.8 GHz, 3 GB/core Shared between all groups Batch jobs: simulation, data processing, ... Interactive access: qrsh
heavy ROOT sessions, moving data, ... Most common mistake: failure to request resources
Usage:
● 1. Split task into small jobs
● 2. Script them
● 3. Submit the job scripts
(qsub your_farm_script.sh)
#$ ... #$ ... #$ ... #$ ...
------------------------------
ordinary shell script
script to submit
interpreted by batch system
Commands:
qrsh : Interactive access, for heavy ROOT sessions, moving data, ...
qsub: Batch jobs, for simulation, data processing, ...
WorkingGroupServers only for small jobs. If more computing power is needed, use the farm.
![Page 5: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/5.jpg)
J. Bazo The Zeuthen Farm
● WorkingGroupServers only for small jobs, if more computing power is needed use the farm.● Usage:
● 1. Split task into small jobs● 2. Script them● 3. Submit the job scripts
What every script needs
J. Bazo 1
4
Batch Job Comment
#!/bin/zsh
#$ -S /bin/zsh shell to be used
#$ -l h_cpu=00:29:00 cpu time for this job
#$ -l h_vmem=850M maximum memory usage of this job
#$ -j y stderr and stdout are merged
#$ -m bea send mail on job's begin, end and abort (bea)
#$ -o /afs/ifh.de/.../FarmMessages
redirect output -o
#$ -cwd execute job from current directory
#$ -l os=sl5 operating system
#$ -P amanda project name
discouraged, only for testing
regularly delete filesIf directory is full, job will crash
obsolete since all systems have SL5 (64bit)
amanda has less priority than z_nuastr
![Page 6: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/6.jpg)
J. Bazo The Zeuthen Farm
What every script needs ... advices
5
J. Bazo 1
CPU time, 3 queues:
Short: 30minMedium: 12 hoursLong: 48 hours
There is no difference for time given inside time range: always give upper limit!(e.g. 29min)
If job last longer that requested, it will crash. Time your scripts prior to sending!
Use the short queue, it is usually empty!
Memory:
Less memory requested will give higher priority to your job.
If you request less memory than needed, your job will crash.
Always test!
Requesting resources
![Page 7: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/7.jpg)
J. Bazo The Zeuthen Farm
What every script needs ... advicesJ. Bazo 1
6
CPU time, 3 queues:Short: 30minMedium: 12 hoursLong: 48 hours
There is no difference for time given inside time range: always give upper limit(e.g. 29min)If job last longer that requested, it will crash. Time your scripts prior to sending!Use the short queue, it is usually empty!
Memory:Less memory requested will give higher priority to your job.If you request less memory than needed, your job will crash.
Shell script part
Shell script CommentSIG_EVT=$1 ... Input parameters
hostname; date some info you want in the stdout file
cd $TMPDIR always $TMPDIR, NOT /tmp !
cp .../infile ./ fetch input file(s)
your_program run the actual job, output to $TMPDIR
cp outfile /lustre/... store output file(s)
your_program example:
${WORK}/IceRec_v03/build64/env-shell.sh my_analysis $START $END $SIG_EVT $TIME
![Page 8: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/8.jpg)
J. Bazo The Zeuthen Farm
7
J. Bazo 1
Batch commands
● qsub : submit a job
● qstat : shows running/waiting jobsqstat -u jlbazo
[oreade30] ~ % qstat -u jlbazojob-ID prior name user state submit/start at queue slots ja-task-ID -----------------------------------------------------------------------------------------------------------------4095495 0.50003 JobAnalysi jlbazo r 11/08/2010 11:38:32 [email protected] 1
qstat -ext -u jlbazo (extended information: e.g. project )
● qhost : shows status of execution hosts
● qdel : delete submitted jobs qdel job_ID (delete job) qdel -u jlbazo (delete all jobs from user)
![Page 9: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/9.jpg)
J. Bazo The Zeuthen Farm
J. Bazo 1
8
Monitoring the farm activity
Useful script from Adam:myjobs.awk (you can copy it from: http://www-zeuthen.desy.de/~jlbazo/myjobs.awk)alias myjobs="qstat -g d|awk -f ~/myjobs.awk"
Fast look at the farm usage from others and your own jobs
![Page 10: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/10.jpg)
J. Bazo The Zeuthen Farm
9
J. Bazo 1
Monitoring the farm activity
Useful script from Adam:myjobs.awkalias myjobs="qstat -g d|awk -f ~/myjobs.awk"
Monitoring and ACcounting in the SGE BATch Farm
https://www-zeuthen.desy.de/dv-bin/batch/stat/sge
In October In 2010
![Page 11: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/11.jpg)
J. Bazo The Zeuthen Farm
● WorkingGroupServers only for small jobs, if more computing power is needed use the farm.● Usage:
● 1. Split task into small jobs● 2. Script them● 3. Submit the job scripts
Batch Scheduler Strategy
J. Bazo 1
11
Scheduling order
● SGE assigns tickets to each job.
● Job with most tickets is sent first.
● If requested resources are not available, next job in turn is tried.
● Project has big influence on scheduling policy.
● Number of tickets depends on the resources requested (mem, cpu).
● cpu parameter has a much bigger weight than mem parameter.
Request less resources for a faster scheduling (short queue, less mem)
When in need, use project z_nuastr , but think about others
![Page 12: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/12.jpg)
J. Bazo The Zeuthen Farm
What every script needs ... advicesJ. Bazo 1
10
CPU time, 3 queues:Short: 30minMedium: 12 hoursLong: 48 hours
There is no difference for time given inside time range: always give upper limit(e.g. 29min)If job last longer that requested, it will crash. Time your scripts prior to sending!Use the short queue, it is usually empty!
Memory:Less memory requested will give higher priority to your job.If you request less memory than needed, your job will crash.
Batch final recommendations
● Always send a few test jobs first
● Make sure you have sufficient filesystem quota for all job output
● Avoid:● jobs writing same file● too many jobs working in same directory● writing too much to stdout/err
● Usually, transfer data at beginning/end of job only
● Most of the time, work on the local disk ($TMPDIR)
● Avoid mass failures, they cause mail storms
● Most common mistake: failure to request resources
![Page 13: Batch system usage - Astroparticle PhysicsThe Zeuthen Farm Using the SGE Batch Farm 3 J. Bazo 1 All nodes run 64-bit SL5 Resources: Batch farm: ~700 cores, all Intel Xeon 2.33-3.0](https://reader035.vdocuments.net/reader035/viewer/2022071406/60fcf3922cd70e53d1668a0a/html5/thumbnails/13.jpg)
● Farm info:● https://dvinfo.ifh.de/Batch_System_Usage
● Introduction to DESY-computing:● http://www-zeuthen.desy.de/~wiesand/intro/intro10p1.pdf● http://www-zeuthen.desy.de/~wiesand/intro/intro10p2.pdf
J. Bazo The Zeuthen Farm
12
Resources