hpc documentation - read the docs

HPC DocumentationRelease 0.0

HPC group

September 28, 2016

Contents

1 News and Events 1

2 Getting help 3

3 General information 17

4 Development of applications on Stallo 41

5 Applications 53

i

CHAPTER 1

News and Events

1.1 News and notifications

News about planned/unplanned downtime, changes in hardware, and important changes in software will be publishedon the HPC UiT twitter account https://twitter.com/hpc_uit and on the login screen of stallo. For more information onthe different situations see below.

1.1.1 System status and activity

You can get a quick overview of the system load on Stallo on the Notur hardware page. More information on thesystem load, queued jobs, and node states can be found on the jobbrowser page (only visible from within the UiTnetwork).

1.1.2 Planned/unplanned downtime

In the case of a planned downtime, a reservation will be made in the queuing system, so new jobs, that would not finishuntil the downtime, won’t start. If the system goes down, running jobs will be restarted, if possible. We apologize forany inconveniences this may cause.

1.2 Status of migration to the SLURM resource manager

Our plan is to replace the current resource management system by SLURM during this autum. This is the same systemthat is currently used on Abel, the UiO supercomputer.

1.2.1 Reason

One reason for the switch is that the resource management system we currently use, the Torque-Maui schedulingsystem, is no longer actively maintained. The other reason is to make the user experience across Norwegian andNordic HPC systems more uniform and homogeneous.

1.2.2 Timeframe

The migration process depends on the integration of the accounting database with the new SLURM scheduling engine.

1

https://twitter.com/hpc_uit

http://www.notur.no/hardware/status/

http://stallo-login1.uit.no/jobbrowser/

http://slurm.schedmd.com/

http://www.uio.no/english/services/it/research/hpc/abel/

HPC Documentation, Release 0.0

1.2.3 Migration process

You will find here details of the migration process which we will outline well in advance of the planned switch.

2 Chapter 1. News and Events

CHAPTER 2

Getting help

2.1 Frequently asked questions

2.1.1 General

CPU v.s. core / processor core.

How should I interpret the term CPU in this documentation?

In this documentation we are frequently using the term CPU, which in most cases are equivalent to the more preciseterm processor core / core. The multi core age is here now :-)

I am getting hundreds of e-mails in a few minutes, what do I do?

Help - I am getting the same e-mail every 30 seconds. Already received more that 100 of them!!!

If you are receiving a lot of e-mails looking someting like this:

PBS Job Id: 52237.stallo.uit.no Job Name: Ni_II_PP_PPhMe2_C2_S00 job deleted Job deleted at request [email protected] MOAB_INFO: job exceeded wallclock limit

You can stop the job by running the command:

qsig -s NULL <job_id>

2.1.2 Account usage and login in

How can I check my disk quota and disk usage?

How large is my disk quota, and how much of it have I used?

To check how large your disk quota is, and how much of it you have used, you can use the following command:

quota -s

Its only the $HOME and $PROJECT disks that have quota.

3

mailto:[email protected]


How can I get information about my account?

How can I get information on how many CPU hours are used and left on my account?

For a simple summary, you can use the command ‘cost‘.

For more details, you can use the command ‘gstatement‘ :

gstatement --hours --summarize -p PROSJEKT -s YYYY-MM-DD -e YYYY-MM-DD

For a detailed overview over usage you can use:

gstatement --hours -p PROSJEKT -s YYYY-MM-DD -e YYYY-MM-DD

For more options see

gstatement --help

How do I change my password on stallo?

The passwd command does not seem to work. My password is reset back to the old one after a while. Why is thishappening?

The stallo system is using a centralised database for user management. This will override the password changes donelocally on stallo.

The password can be changed on the passwrod metacenter page, log in using your username on stallo and the NOTURdomain.

Exporting the display from a compute node to my desktop?

How can I export the display from a compute node to my desktop?

If you needs to export the display from a compute node to your desktop you should

1. first login to Stallo with display forwarding,

2. then you should reserve a node, with display forwarding, trough the queuing system

Below is an example on how you can do this:

ssh -Y stallo.uit.no 1) Long in on Stallo with display forwarding.qsub -lnodes=1,walltime=1:0:0 -I -X 2) Reserve and log in on a compute node with display forwarding.

This example assumes that you are running an X-server on your local desktop, which should be available for mostusers running Linux, Unix and Mac Os X. If you are using Windows you must install some X-server on your local PC.

How can I access a compute node from the login node?

How can I access a compute node in the cluster from the login node?

Log in to stallo.uit.no and type e.g.:

ssh compute-1-3

or us the shorter version:

ssh c1-3

4 Chapter 2. Getting help

https://www.metacenter.no/public/password/


My ssh connections are dying / freezing.

How to prevent your ssh connections from dying / freezing.

If your ssh connections more or less randomly are dying / freezing, try to add the following to your local~/.ssh/config file:

ServerAliveCountMax 3ServerAliveInterval 10

(local means that you need to make these changes to your computer, not on stallo)

The above config is for OpenSSH, if you’re using PUTTY you can take a look at this page explaining keepalives for asimilar solution.

2.1.3 Queue system and running jobs

Where can I find an example of job script?

You can find an example with comments in Example job script

Relevant application specific examples (also for beginning users) for a few applications can be found in Applicationguides.

When will my job start?

How can I find out when my job will start?

To find out approximately when the job scheduler thinks your job will start, use the command:

showstart <job_id>

This command will give you information about how many CPUs your job requires, for how long, as well as whenapproximately it will start and complete. It must be emphasized that this is just a best guess, queued jobs may startearlier because of running jobs that finishes before they hit the walltime limit and jobs may start later than projectedbecause new jobs are submitted that get higher priority.

How can I see the queing situation?

How can I see how my jobs are doing in the queue, if my jobs are idle, blocked, running?

On the webpage http://stallo-login1.uit.no/jobbrowser/ you can find information about the current load on stallo, someinformation about the nodes, and the information you would get from the showq command, which is described below.You can also find information about your job and if you the job is running, you can find graphs about its usage of theCPUs, memory and so on.

If you prefer to use the command line, to see the jobqueue use the command ‘showq’:

showq

This command gives you a list of running jobs, idle jobs and blocked jobs. Each line in the list gives you the jobid,which user is running the job, number of cpus it‘s using, time remaining and start time of the job. The list is sorted byremaining time for the jobs.

To get more information about the command:

2.1. Frequently asked questions 5

http://www.openssh.org

http://www.chiark.greenend.org.uk/~sgtatham/putty/docs.html

http://the.earth.li/~sgtatham/putty/0.60/htmldoc/Chapter4.html#config-keepalive

http://stallo-login1.uit.no/jobbrowser/


showq -h

You can also use the command

qstat

which gives you a list of jobs on Stallo, sorted by jobid. This command also has man-pages:

man qstat

Example of the ‘qstat’ output:

Job id Name User Time Use S Queue58.mangas rc5des.sh royd 00:07:46 R normal59.mangas rc5des.sh royd 00:07:46 R normal

where:

• Job id - The name of the job in the queueing system

• Name - The name of the script you give to qsub

• User - The job owner

• Time Use - The walltime of the job

• S - The state of the job, R - running, Q - waiting, S - suspended

• Queue - The queue where the job is running

How can I view the output of my job?

Use the ‘‘qpeek‘‘ command.

The ‘‘qpeek‘‘ command can list your job output while it is running. It behaves much like the ‘‘tail‘‘ command sousing ‘‘-f‘‘ will display the output as it is written:

$qpeek -f <job_id>1.02.03.04.0

See ‘‘qpeek -h‘‘ for more info. You can also submit your job with the ‘‘-k oe‘‘ flag, then the standard error andstandard output of your job will be put in your home directory, see ‘‘man qsub‘‘ for more details.

What is the maximum memory limit for a job?

What is the absolute maximum memory I can set for a job on Stallo?

Stallo have 50 nodes with 32 GB memory each, while the rest have 16 GB.

So 32 GB is the maximum memory limit you can ask for in a job running on a single node.

(You can of course have a job running on multiple nodes, using up to 32 GB of memory on each node.)

For instance, if you want to use 4 CPUs with 4 GB of memory:

-lnodes=1:ppn=4,pmem=4gb

If you want to use only one CPU, you can ask for all it‘s memory this way:



-lnodes=1:ppn=1,pmem=16gb

How much memory have I reserved, and how much am I using?

How can I see how much memory my jobs have reserved, and how much they are using?

You can check the memory of your job with the command ‘checkjob your_jobnumber‘. Then you will get ut somethinglike this:

[torj@stallo-2 ~]$ checkjob 24614

checking job 24614State: Running

Creds: user:torj group:torj account:uit-staff-000 class:default qos:defaultWallTime: 23:42:21 of 17:02:00:00SubmitTime: Wed Sep 6 10:16:31 (Time Queued Total: 00:16:42 Eligible: 00:16:42)

StartTime: Wed Sep 6 10:33:13Total Tasks: 4

Req[0] TaskCount: 4 Partition: DEFAULTNetwork: [NONE] Memory >= 1000M Disk >= 0 Swap >= 0Opsys: [NONE] Arch: [NONE] Features: [NONE]Dedicated Resources Per Task: PROCS: 1MEM: 1000MAllocated Nodes: [compute-3-3.local:4]

IWD: [NONE] Executable: [NONE]Bypass: 10 StartCount: 1PartitionMask: [ALL]Flags: BACKFILL RESTARTABLE

Reservation '24614' (-23:42:32 -> 16:02:17:28 Duration: 17:02:00:00)PE: 4.00 StartPriority: 384946

MEM: 1000M shows that the job has reserved 1000MB of memory per “task”.

[compute-3-3.local:4] means that the job runs on compute-3-3 and is using 4 CPUs. You can then log in to this node(with the command ‘ssh -Y c3-3’), and see how much memory this job is using. Use the command ‘top‘, the columnRES shows use of memory.

How can I get more information about my job?

How can I see when my queued job will start, when it will finish, why it is blocked and so on?

On the webpage http://stallo-login1.uit.no/jobbrowser/showq you can find a list of running jobs, idle jobs and blockedjobs (the same information you would get from the showq command). Each line in the list gives you the jobid, whichuser is running the job, number of cpus it‘s using, time remaining and start time of the job. The list is sorted byremaining time for the jobs. If you find your job there and click on the id number, you can also find more informationabout your job. And if you the job is running, you can find graphs about its usage of the CPUs, memory and so on.

If you prefer the command line, you should use this command:

checkjob <job_id>

You can also get more info about your job parameters using this command:


http://stallo-login1.uit.no/jobbrowser/showq


qstat -f <job_id>

How can I kill my job?

How can I delete my job from the queueing system?

To terminate one of your jobs use qdel, for instance:

qdel 58

This will delete job number 58 in the queueing list that you get when using the command showq.

How can I kill all my jobs?

I want to get rid of all my jobs in a hurry.

Use a little shell trickery:

qselect -u $USER | xargs qdel

qselect prints out a job list based on specific criteria, xargs takes multi line input and run the command you giveto it repeatedly until it has consumed the input list.

Delete all your running jobs:

qselect -u $USER -s R | xargs qdel

Delete all your queued jobs:

qselect -u $USER -s Q | xargs qdel

(xargs is your friend when you want to do the same thing to a lot of items.)

I am not able to kill my job!!!

I am not able to kill my job!!!

If you want to kill a job, and the normal way of doing it, qdel <job_id>, does not work you should try the followingcommand:

qsig -sNULL <job_id>

This should always work. However if it doesn’t work you have to contact the support personal: [email protected]

Which nodes are busy, and which are idle?

In order to get information about the status of the nodes in the cluster, use the following command:

diagnose -n

However, the list is rather long... :-)

You can also have a look at http://stallo-login1.uit.no/jobbrowser/nodes



http://stallo-login1.uit.no/jobbrowser/nodes


How can I run a job on specific nodes

If you want to run a job on specific nodes you can queue for only these nodes.

Example; to apply for nodes c2-1 and c2-2:

qsub -lnodes=c2-1:ppn=8+c2-2:ppn=8

How do I exclude a node from running a job?

Sometimes one need to be able to exclude a node from running a job.

Sometimes it is useful to exclude a specific node from running your jobs. This can be due to hardware or softwareproblems on that node. For instance the node seems to have problems with the interconnect.

The simplest way to do this is to submit a dummy job to this node:

echo sleep 600 | qsub -lnodes=cX-Y:ppn=8,walltime=1000

Then this job will be running a sleep job for 600 seconds and you can submit your real job afterwards that will run onother nodes. This will cost you some cpu hours off your quota, but let us know and we will refund this to you later.

Please let us know if you think some nodes have problems.

How can I run in parallel without using MPI

If you have many independent jobs which you wish to run simultaneously, you can use pbsdsh.

Here is a example of how to use one parallel job to launch 4 sequential jobs (job0, job1, job2, job3). The “Master-Job.sh” launches “SingleJobCaller.sh” which runs “jobN”.

MasterJob.sh:

#!/bin/bash

#PBS -lnodes=4#PBS -lwalltime=00:05:00

pbsdsh $HOME/test/SingleJobCaller.sh

The executable script SingleJobCaller.sh ($PBS\_VNODENUM is an environment variable which returns the rank ofthe job):

#!/bin/bash$HOME/test/job${PBS_VNODENUM}.sh

The executable script job0.sh (and similar for job1.sh, job2.sh, and job3.sh):

#!/bin/bashecho I am job 0

The output of Masterjob would be like:

I am job 0I am job 1I am job 2I am job 3

The four jobs have been run in parallel (simultaneously).



How can I see the processes belonging to my job?

Use the ‘‘qps‘‘ command.

The qps command can give you the process list belonging to your job:

$qps 12345

*** 12345 user procs=5 name=dalpar.pbscompute-9-1 08:12:17 RLl dalpar.x-4(mpi:[email protected])compute-9-5 08:16:39 RLl dalpar.x-2(mpi:[email protected])compute-9-5 08:14:08 RLl dalpar.x-3(mpi:[email protected])compute-9-7 08:14:12 RLl dalpar.x-1(mpi:[email protected])compute-9-9 07:45:55 RLl dalpar.x-0(mpi:[email protected])

See qps -h for more info.

Remark: The qps command assume you got only one job per node so if you have multiple jobs a node it will showall of the processes belonging to you on that node.

Faster access for testing and debugging

A trick that may give you faster access when performing small test runs in the queuing system.

Removing the parameter “ib” (if present) from the following line in you job script

#PBS -lnodes=2:ppn=8:ib (Remove ":ib")

will most probably make your test jobs start immediately, since it is most likely that some of the “non Infinibandnodes” are available. If your program has a demanding communication pattern, this will of course slow down yourprogram.

The “ib” option above tells the system that this job needs to use the Infiniband interconnect. If ”:ib” is omitted, thejob may either run on the Infiniband interconnect or the Gigabit Ethernet, but you don’t know in advance. If yourapplication doesn’t need Infiniband you should omit this option, to avoid being stuck in the queuing system waitingfor Infiniband powered nodes becoming available.

Creating dependencies between jobs

See the description of the -Wdepend option in the qsub manpage.

How can I submit many jobs in one command?

use job arrays:

qsub -t 1-16 Myjob

will send Myjob 16 times into the queue. They can be distinguished by the value of the environmental variable

$PBS_ARRAYID

2.1.4 Running many short tasks

Recommendations on how to run a lot of short tasks on the system. The overhead in the job start and cleanup makesit unpractical to run thousands of short tasks as individual jobs on Stallo.



Background

The queueing setup on stallo, or rather, the accounting system generates overhead in the start and finish of a job ofabout 1 second at each end of the job. This overhead is insignificant when running large parallel jobs, but createsscaling issues when running a massive amount of shorter jobs. One can consider a collection of independent tasks asone large parallel job and the aforementioned overhead becomes the serial or unparallelizable part of the job. Thisis because the queuing system can only start and account one job at a time. This scaling problem is described byAmdahls Law.

Without going into any more details, let’s look at the solution.

Running tasks in parallel within one job.

By using some shell trickery one can spawn and load-balance multiple independent task running in parallel within onenode, just background the tasks and poll to see when some task is finished until you spawn the next:

1 #!/usr/bin/env bash2

3 # Jobscript example that can run several tasks in parallel.4 # All features used here are standard in bash so it should work on5 # any sane UNIX/LINUX system.6 # Author: [email protected] #8 # This example will only work within one compute node so let's run9 # on one node using all the cpu-cores:

10 #PBS -lnodes=1:ppn=2011

12 # We assume we will be done in 10 minutes:13 #PBS -lwalltime=00:10:0014

15 # Let us use all CPUs:16 maxpartasks=$PBS_NP17

18 # Let's assume we have a bunch of tasks we want to perform.19 # Each task is done in the form of a shell script with a numerical argument:20 # dowork.sh N21 # Let's just create some fake arguments with a sequence of numbers22 # from 1 to 100, edit this to your liking:23 tasks=$(seq 100)24

25 cd $PBS_O_WORKDIR26

27 for t in $tasks; do28 # Do the real work, edit this section to your liking.29 # remember to background the task or else we will30 # run serially31 ./dowork.sh $t &32

33 # You should leave the rest alone...34 #35 # count the number of background tasks we have spawned36 # the jobs command print one line per task running so we only need37 # to count the number of lines.38 activetasks=$(jobs | wc -l)39

40 # if we have filled all the available cpu-cores with work we poll41 # every second to wait for tasks to exit.


http://en.wikipedia.org/wiki/Amdahl's_law


42 while [ $activetasks -ge $maxpartasks ]; do43 sleep 144 activetasks=$(jobs | wc -l)45 done46 done47

48 # Ok, all tasks spawned. Now we need to wait for the last ones to49 # be finished before we exit.50 echo "Waiting for tasks to complete"51 wait52 echo "done"

And here is the dowork.sh script:

1 #!/usr/bin/env bash2

3 # fake some work, $1 is the task number4 # change this to whatever you want to have done5

6 # sleep between 0 and 10 secs7 let sleeptime=10*$RANDOM/327688

9 echo "task $1 is sleeping for $sleeptime seconds"10 sleep $sleeptime11 echo "task $1 has slept for $sleeptime seconds"

2.1.5 MPI

My MPI application runs out of memory.

The OpenMPI library sometimes consumes a lot of memory for its communication buffers. This seems to happenwhen an application does a lot of small send and receives. How to limit this?

You can limit the communication buffer counts and sizes using the mca parameters, this has helped for some applica-tions that have experienced this problem.

For instance to limit the number of buffers used to 1000 you set this on the command line:

mpirun --mca btl_openib_free_list_max 1000 mpi-application.x

It is recommended to experiment with this to obtain optimal values, as setting it too low will decrease performance.

For more information about openmpi tunables see the open-mpi page.

My mpi job crashes with a retry limit exceeded error.

Sometimes a mpi job will fail saying that its retry limit is exceeded.

If you get something like this from an mpi job:

[0,1,132][btl_openib_component.c:1328:btl_openib_component_progress] from c12-3 to: c13-3 error pollingHP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 57073528 opcode 42

This means that you have hit a problem with the communication over the infiniband fabric between two nodes. Thismay or may not be related to problems with the infiniband network.

To work around this problem you can try to avoid the problematic node(s), in the error message above it seems tobe the receiver, c13-3, that is causing the problem so in this case I would run a dummy job on that node and try to


http://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage


resubmit the failing job. See How do I exclude a node from running a job? to find out how to run a dummy job on aspecific node.

If this does not help, send us a problem report.

Tuning OpenMPI performance - non default settings on Stallo.

We have changed one setting on stallo to limit the memory consumption of the OpenMPI library. The following is setas global default:

btl_openib_free_list_max = 1000

This setting can be changed on the mpirun command-line:

mpirun --mca btl_openib_free_list_max 1000 ..........

or by setting it in the users personal config file, $HOME/.openmpi/mca-params.conf. See/opt/openmpi/1.2.4/etc/openmpi-mca-params.conf for the syntax.

The value of 1000 might be too low for some apps, you can try to set it to -1 to remove this limit if your applicationgets into problems. Please let us know if this influences your application.

My MPI application hangs on startup.

We have discovered that some applications doesn’t work well with the default OpenMPI version, 1.3.2. Revertingback to v1.2.4 might solve the problem.

Symptoms:

If your application has both of the following symptoms:

1. It works on a low number of nodes, but hangs on startup on higher node counts

2. It works with ethernet (by setting --mca btl ^openib on the mpirun command), but not with infiniband

Then try to revert back to OpenMPI 1.2.4 by setting this before recompiling and in your job script:

module unload openmpimodule load openmpi/1.2.4module load intel-compiler

Please tell us if this helps for you application. We try to get an idea of the extent of this problem. Send a report [email protected] .

2.2 Contact

If you need help, please file a support request via [email protected], and our local team of experts will try to assistyou as soon as possible.

2.2. Contact 13




Credit: http://xkcd.com/293/

Alternatively you can use http://support.notur.no to submit and track your requests. The necessary user name andpassword are generated the first time you send an email to [email protected].

2.2.1 About support requests

• Requests will be handled as soon as possible by the support staff.

• Please be as detailed as possible about your problem when submitting a support request, as this will help thesupport personnel to solve your problem faster.

• Please do not send support requests directly to staff members. Response time is shorter, and your request doesnot risk getting lost in a huge mailbox if you use the support system.

• Working hours for the support staff are Monday-Friday 08:00-16:00, except holidays.


http://xkcd.com/293/

http://support.notur.no



2.3 Support staff

2.3.1 Our vision

Give our users a competitive advantage in the international science race.

Goal

Contribute to make our scientists the most efficient users of High Performance Computing (HPC), by creating thebest support environment for any HPC user, and present them to the most efficient solutions to support their highestdemands in HPC.

Services

The HPC-Group works within three areas:

• Technical and Operational computing resource support to the national program, and local initiatives.

• Basic and Advanced user support to projects and research groups that utilizes HPC or is expecting to do so inthe future.

• Involvement in national development projects and initiatives, for instance in GRID Computing and new tech-nology areas.

National resource site

Since 2000 UiT has been a Resource Site in the National High Performance Computing Consortium (NOTUR), a jointmission between the four Norwegian Universities in Bergen, Oslo, Trondheim and Tromsø. On January 1st 2005 anew limited company within Uninett (Sigma) was established to administrate the national HPC program. The fourUniversities will act as Resource Sites as before.

Regional resource site

The HPC group at UiT also has a role to play as a regional resource site within HPC, offering HPC services to otherinstitutions in Northern Norway. A strategic collaboration between UiT and the Norwegian Polar Institute within HPChas been ongoing since 1998.

2.3.2 Our team

• Radovan Bast

• Roy Dragseth

• Steinar Henden

• Dan Jonsson

• Elena Malkin

• Espen Tangen

• Giacomo Tartari

2.3. Support staff 15

http://bast.fr

https://uit.no/om/enhet/ansatte/person?p_document_id=42529&p_dimension_id=88223







2.4 Where to find us on the campus


CHAPTER 3

General information

3.1 About Stallo

3.1.1 Resource description

Key numbers about the Stallo cluster: compute nodes, node interconnect, operating system, and storage configuration.

Aggregated Per nodePeak performance 516 Teraflop/s 332 Gigaflop/s / 448 Gigaflops/s

# Nodes304 x HP BL460 gen8 blade servers376 x HP SL230 gen8 servers144 x HP Apollo 8000 gen8 servers

1 x HP BL460 gen8 blade servers1 x. HP SL230 gen8 servers1 x. HP apollo 8000 gen8 servers

# CPU’s / # Cores608 / 4864752 / 7520288 / 5760

2 / 162 / 202 / 20

Processors608 x 2.60 GHz Intel Xeon E5 2670752 x 2.80 GHz Intel Xeon E5 2680288 x 2.80 GHz Intel Xeon E5 2680

2 x 2.60 GHz Intel Xeon E5 26702 x 2.80 Ghz Intel Xeon E5 26802 x 2.80 GHz Intel Xeon E5 2680

Total memory 26.2 TB 32 GB (32 nodes with 128 GB)Internal storage 155.2 TB 500 GB (32 nodes with 600GB raid)Centralized storage 2000 TB 2000 TBInterconnect Gigabit Ethernet + Infiniband 1 Gigabit Ethernet + Infiniband 1

Compute racks 13Infrastructure racks 2Storage racks 3

1) All nodes in the cluster are connected with Gigabit Ethernet and QDR Infiniband.

17


3.1.2 Stallo - a Linux cluster

This is just a quick and brief introduction to the general features of Linux Clusters.

A Linux Cluster - one machine, consisting of many machines

On one hand you can look at large Linux Clusters as rather large and powerful supercomputers, but on the other handyou can look at them as just a large bunch of servers and some storage system(s) connected with each other through a(high speed) network. Both of these views are fully correct, and it’s therefore important to be aware of the strengthsand the limitations of such a system.

Clusters vs. SMP’s

Until July 2004, most of the supercomputers available to Norwegian HPC users were more or less large SymmetricMulti Processing (SMP’s) systems; like the HP Superdome’s at UiO and UiT, the IBM Regatta at UiB and the SGIOrigion and IBM p575 systems at NTNU.

On SMP systems most of the resources (CPU, memory, home disks, work disks, etc) are more or less uniformlyaccessible for any job running on the system. This is a rather simple picture to understand, it’s nearly as your desktopmachine – just more of everything: More users, more CPU’s, more memory, more disks etc.

On a Linux Cluster the picture is quite different. The system consists of several independent compute nodes (servers)connected with each other through some (high speed) network and maybe hooked up on some storage system. So theHW resources (like CPU, memory, disk, etc) in a cluster are in general distributed and only locally accessible at eachserver.

3.1.3 Linux operation system (Rocks): http://www.rocksclusters.org/

Since 2003, the HPC-group at has been one of five international development sites for the Linux operation systemRocks. Together with people in Singapore, Thailand, Korea and USA, we have developed a tool that has won inter-national recognition, such as the price for “Most important software innovation ” both in 2004 and 2005 in HPCWire.Now Rocks is a de-facto standard for cluster-management in Norwegian supercomputing.

3.1.4 Stallo - Norse mythology

In the folklore of the Sami, a Stallo (also Stallu or Stalo) is a sami wizard. “The Sami traditions up North differ a bitfrom other parts of Norwegian traditions. You will find troll and draug and some other creatures as well, but the Stallois purely Sami. He can change into all kinds of beings,; animals, human beings, plants, insects, bird – anything. Hecan also “turn” the landscape so you miss your direction or change it so you don’t recognise familiar surroundings.Stallo is very rich and smart, he owns silver and reindeers galore, but he always wants more. To get what he wantshe tries to trick the Samis into traps, and the most popular Sami stories tell how people manage to fool Stallo.” NB!Don’t mix Stallo with the noaide! He is a real wizard whom people still believe in.

3.2 Getting an account

3.2.1 Getting started on the machine (account, quota, password)

Before you start using Stallo, please read the UiT The Arctic University of Norway’s Guidelines for use of computerequipment at the UiT The Arctic University of Norway.

A general introduction to Notur can be found at https://www.notur.no/use-notur-resources

18 Chapter 3. General information

https://www.notur.no/use-notur-resources


3.2.2 How to get an account and a CPU quota on Stallo

To be able to work on Stallo you must have an account and you must have been granted CPU time on the system.There are two ways to achieve this:

Research Council Quota

National users (including users from UiT) may apply for an account and a CPU quota from the Research Councilsshare of Stallo. If you want to apply for such a quota please use this National quota form.

UiT Quota

“Local users” (i.e. users from UiT and users affiliated with UiT in some way) can apply for an account and a quotafrom UiT’s share of Stallo. If you wan to apply for such a quota please use this UiT quota form.

Please note that most users who are allowed to apply for a UiT quota also are allowed to apply for a quota from theResearch Council – at the same time!

Before you start using Stallo, please read the UiT The Arctic University of Norway’s Guidelines for use of computerequipment at the UiT The Arctic University of Norway.

3.3 Logging in for the first time

3.3.1 Log in with SSH

An SSH client (Secure SHell) is the required tool to connect to Stallo. An SSH client provides secure encryptedcommunications between two hosts over an insecure network.

If you already have ssh installed on your UNIX-like system, have a user account and password on a Notur system,login may be as easy as typing

ssh <machine name> (for instance: ssh njord.hpc.ntnu.no)

into a terminal window.

If your user name on the machine differs from your user name on the local machine, use the -l option to specify theuser name on the machine to which you connect. For example:

ssh <machine name> -l [user name]

And if you need X-forwarding (for instance, if you like to run Emacs in it’s own window) you must log in like this:

ssh -X -Y <machine name>

Log in with an ssh-key

SSH clients for Windows and Mac

At the OpenSSH page you will find several SSH alternatives for both Windows and Mac.

Please note that Mac OS X comes with its own implementation of OpenSSH, so you don’t need to install any third-partysoftware to take advantage of the extra security SSH offers. Just open a terminal window and jump in.

3.3. Logging in for the first time 19

https://www.metacenter.no/mas/application/project/

http://uit.no/ansatte/organisasjon/artikkel?p_document_id=299809&p_dimension_id=88223&p_menu=49281

http://www.openssh.com


Learning more about SSH

To learn more about using SSH, please also consult the OpenSSH page page and take a look at the manual page onyour system (man ssh).

3.3.2 Obtain a new password

When you have been granted an account on stallo.uit.no, your username and password is sent to you separately. Theusername by email and the password by SMS. The password you receive by SMS is temporally and only valid for 7days.

The passwd command does not seem to work. My password is reset back to the old one after awhile. Why is this happening?

The Stallo system is using a centralised database for user management. This will override the password changes donelocally on Stallo.

The password can be changed here, log in using your username on stallo and the NOTUR domain.

3.3.3 Logging on the compute nodes

Information on how to log in on a compute node.

Some times you may want to log on a compute node (for instance to check out output files on the local work area onthe node), and this is also done by using SSH. From stallo.uit.no you can log in to compute-x-y the following way:

ssh -Y compute-x-y (for instance: ssh compute-5-8)

or short

ssh -Y cx-y (for instance: ssh c5-8)

If you don’t need display forwarding you can omit the “-Y” option above.

If you for instance want to run a program interactively on a compute node, and with display forwarding to your desktopyou should in stead do something like this:

1. first login to Stallo with display forwarding,

2. then you should reserve a node, with display forwarding, trough the queuing system

Below is an example on how you can do this:

1) Log in on Stallo with display forwarding.$ ssh -Y stallo.uit.no

2) Reserve and log in on a compute node with display forwarding.(start an interactive job.)

$ qsub -lnodes=1,walltime=1:0:0 -I -X

This example assumes that you are running an X-server on your local desktop, which should be available for mostusers running Linux, Unix and Mac Os X. If you are using Windows you must install some X-server on your local PC.


http://www.openssh.com

https://www.metacenter.no/user/password/


3.3.4 Graphical logon to Stallo

If you want to run applications with graphical user interfaces we recommend you to use the remote desktop service onStallo.

Important:

If you are connecting from outside the networks of UNINETT and partners you need to log into stallo-gui.uit.no withssh to verify that you have a valid username and password on the Stallo system. After a successful login the servicewill automatically allow you to connect from the ip-address your client currently has. The connection will be open forat least 5 minutes after you log in. There is no need to keep the ssh-connection open after you have connected to theremote desktop, in fact you will be automatically logged off after 5 seconds.

3.4 CPU-hour quota and accounting

3.4.1 CPU quota.

To use the batch system you have to have a cpu quota, either local or national. For every job you submit we check thatyou have sufficient quota to run it and you will get a warning if you do not have sufficient cpu-hours to run the job.The job will be submitted to queue, but will not start until you have enough cpu-hours to run it.

3.4.2 Resource charging.

We charge for used resources, both cpu and memory. The accounting system charges for used processor equivalents(PE) times used walltime, so if you ask for more than 2GB of memory per cpu you will get charged for more than theactual cpus you use.

Processor equivalents.

The best way to describe PE is maybe by example: Assume that you have a node with 8 cpu-cores and 16 GB memory(as most nodes on stallo are):

if you ask for less than 2GB memory per core then PE will equal the cpu count.

if you ask for 4GB memory per core then PE will be twice the cpu-count.

if you ask for 16GB memory then PE=8 as you only can run one cpu per compute node.

3.4.3 Inspecting your quota.

You can use the cost command to check how much cpu-hours are left on your allocation:

[user@stallo-1 ~]$ costId Name Amount Reserved Balance CreditLimit Deposited Available Percentage--- ------- --------- -------- -------- ----------- --------- --------- ----------272 nnXXXXk 168836.65 96272.00 72564.65 0.00 290000.00 72564.65 58.2210 nnYYYYk 4246.04 0.00 4246.04 0.00 150000.00 4246.04 2.83

The column meaning is

Amount: The number of hours available for new jobs.

Reserved: The number of hours reserved for currently running jobs.

3.4. CPU-hour quota and accounting 21

http://stallo-gui.uit.no/vnc/


Balance: Amount - Reserved.

CreditLimit: Allocated low-pri quota.

Deposited: Allocated normal quota

Inspecting historic use.

You can view the accounting history of your projects using:

gstatement --hours --summarize -s YYYY-MM-DD -e YYYY-MM-DD -p nnXXXXk

for more detail see:

gstatement --man

3.4.4 Live status information

From our monitoring tool Ganglia, you can watch live status information on Stallo:

• Load situation

• Job queue

3.5 Linux command line

3.5.1 New on Linux systems?

This page contains some tips on how to get started using the Stallo cluster on UiT if you are not too familiar with Linux/Unix. The information is intended for both users that are new to Stallo and for users that are new to Linux/UNIX-likeoperating systems. Please consult the rest of the user guide for information that is not covered in this chapter.

For details about the hardware and the operating system of stallo, and basic explanation of Linux clusters please seethe About Stallo part of this documentation.

To find out more about the storage and file systems on the machines, your disk quota, how to manage data, and how totransfer file to/from Stallo, please read the storage section.

ssh

The only way to connect to Stallo is by using ssh. Check the Logging in for the first time in this documentation tolearn more about ssh.

scp

The machines are stand-alone systems. The machines do not (NFS-)mount remote disks. Therefore you must explicitlytransfer any files you wish to use to the machine by scp. For more info, see Transferring files to/from Stallo.


http://stallo-adm.uit.no/ganglia/

http://stallo-login1.uit.no/jobbrowser/showq


Running jobs on the machine

You must execute your jobs by submitting them to the batch system. There is a dedicated section Jobs, scripts andqueues in our user guide that explain how to use the batch system. The pages explains how to write job scripts, andhow to submit, manage, and monitor jobs. It is not allowed to run long or large memory jobs interactively (i.e., directlyfrom the command line).

Common commands

pwd Provides the full pathname of the directory you are currently in.

cd Change the current working directory.

ls List the files and directories which are located in the directory you are currently in.

find Searches through one or more directory trees of a file system, locates files based on some user-specified criteriaand applies a user-specified action on each matched file. e.g.:

find . -name 'my*' -type f

The above command searches in the current directory (.) and below it, for files and directories with namesstarting with my. “-type f” limits the results of the above search to only regular files, therefore excludingdirectories, special files, pipes, symbolic links, etc. my* is enclosed in single quotes (apostrophes) as otherwisethe shell would replace it with the list of files in the current directory starting with “my”.

grep Find a certain expression in one or more files, e.g:

grep apple fruitlist.txt

mkdir Create new directory.

rm Remove a file. Use with caution.

rmdir Remove a directory. Use with caution.

mv Move or rename a file or directory.

vi/vim or emacs Edit text files, see below.

less/ more View (but do not change) the contents of a text file one screen at a time, or, when combined with othercommands (see below) view the result of the command one screen at a time. Useful if a command prints severalscreens of information on your screen so quickly, that you don’t manage to read the first lines before they aregone.

| Called “pipe” or “vertical bar” in English. Group 2 or more commands together, e.g.:

ls -l | grep key | less

This will list files in the current directory (ls), retain only the lines of ls output containing the string “key” (grep),and view the result in a scrolling page (less).

More info on manual pages

If you know the UNIX-command that you would like to use but not the exact syntax, consult the manual pages on thesystem to get a brief overview. Use ‘man [command]’ for this. For example, to get the right options to display thecontents of a directory, use ‘man ls’. To choose the desired options for showing the current status of processes, use‘man ps’.

3.5. Linux command line 23


Text editing

Popular tools for editing files on Linux/UNIX-based systems are ‘vi’ and ‘emacs’. Unfortunately the commands withinboth editors are quite cryptic for beginners. It is probably wise to spend some time understanding the basic editingcommands before starting to program the machine.

vi/vim Full-screen editor. Use ‘man vi’ for quick help.

emacs Comes by default with its own window. Type ‘emacs -nw’ to invoke emacs in the active window. Type‘Control-h i’ or follow the menu ‘Help->manuals->browse-manuals-with-info’ for help. ‘Control-h t’ gives atutorial for beginners.

Environment variables

The following variables are automatically available after you log in:

$USER your account name$HOME your home directory$PWD your current directory

You can use these variables on the command line or in shell scripts by typing $USER, $HOME, etc. For instance:‘echo $USER’. A complete listing of the defined variables and their meanings can be obtained by typing ‘printenv ‘.

You can define (and redefine) your own variables by typing:

export VARIABLE=VALUE

Aliases

If you frequently use a command that is long and has for example many options to it, you can put an alias (abbreviation)for it in your ~/.bashrc file. For example, if you normally prefer a long listing of the contents of a directory withthe command ‘ls -laF | more’, you can put following line in your ~/.bashrc file:

alias ll='ls -laF | more'

You must run ‘source ~/.bashrc’ to update your environment and to make the alias effective, or log out and in :-).From then on, the command ‘ll’ is equivalent to ‘ls -laF | more’. Make sure that the chosen abbreviation is not alreadyan existing command, otherwise you may get unexpected (and unwanted) behavior. You can check the existence andlocation of a program, script, or alias by typing:

which [command]whereis [command]

~/bin

If you frequently use a self-made or self-installed program or script that you use in many different directories, you cancreate a directory ~/bin in which you put this program/script. If that directory does not already exist, you can do thefollowing. Suppose your favorite little program is called ‘myscript’ and is in your home ($HOME) directory:

mkdir -p $HOME/bincp myscript $HOME/binexport PATH=$PATH:$HOME/bin

PATH is a colon-separated list of directories that are searched in the order in which they are specified whenever youtype a command. The first occurrence of a file (executable) in a directory in this PATH variable that has the same nameas the command will be executed (if possible). In the example above, the ‘export’ command adds the ~/bin directory to



the PATH variable and any executable program/script you put in the ~/bin directory will be recognized as a command.To add the ~/bin directory permanently to your PATH variable, add the above ‘export’ command to your ~/.bashrc fileand update your environment with ‘source ~/.bashrc’.

Make sure that the names of the programs/scripts are not already existing commands, otherwise you may get unex-pected (and unwanted) behaviour. You can check the contents of the PATH variable by typing:

printenv PATHecho $PATH

3.6 Storage and backup

3.6.1 Available file system

Stallo has a “three folded” file system:

• global accessible home area (user area): /home (64 TB)

• global accessible work or scratch area: /global/work (1000 TB)

• local accessible work or scratch area on each node: /local/work (~450 GB)

3.6.2 Home area

The file system for user home directories on stallo. It is a 64 TB global file system, which is accessible from boththe login nodes and all the compute nodes. The default size of the home directory’s for each user is 300 GB. If morespace is needed for permanent storage users have to apply for it. Please contact the system administrators, [email protected], for more information about this.

The home area is for “permanent” storage only, so please do not use it for temporary storage during production runs.Jobs using the home area for scratch files while running may be killed without any warning.

3.6.3 Work/scratch areas

There are two different work/scratch areas available on Stallo:

• There is an 1000 TB global accessible work area on the cluster. This is accessible from both the login nodesand all the compute nodes as /global/work. This is the recommended work area, both because of size andperformance! Users can stripe files themselves as this file system is a Lustre file system.

• In addition, each compute node has a small work area of approximately 450 GB, only locally accessible on eachnode. This area is accessible as /local/work on each compute node. In general we do not recommend to use/local/work, both because of (the lack of) size and performance, however for some users this may be the bestalternative.

These work areas should be used for all jobs running on Stallo.

After a job has finished old files should be deleted, this is to ensure that there are plenty of available space for newjobs. Files left on the work areas after a job has finished may be removed without any warning.

There is no backup of files stored on the work areas. If you need permanent storage of large amounts of data, pleasecontact the system administrators: [email protected]

3.6. Storage and backup 25





3.6.4 Disk quota and accounting

Each user has the default size of the home directory 300 GB. If more space is needed for permanent storage, usershave to apply for it. Please contact the system administrators, [email protected], for more information about this.Disk quota is not supported on work/scratch areas. Please use common courtesy and keep your work/scratch partitionsclean. Move all files you do not need on stallo elsewhere or delete them. Since overfilled work/scratch partitions cancause problems, files older than 14 days are subject for deletion without any notice.

3.6.5 What area to use for what data

/home should be used for storing tools, like application sources, scripts, or any relevant data which must have a backup.

/work/ should be used for running jobs, as a main storage during data processing. All data after processing must bemoved out of the machine or deleted after use.

3.6.6 Policies for deletion of temporary data

/global/work/ has no backup and, file older than 14 days are subject for deletion without any notice. /local/work/ hasno backup and, files belonging to users other than the one that runs a job on the node will be deleted.

Since this deletion process (as well as the high disk usage percent) will take away disk-performance from the runningjobs - the best solution is of course for you to remember to clean up after each job.

3.6.7 Backup

There is no real backup of the data on stallo. However we do keep daily snapshots of /home and /project for the last 7days. The /home snapshots are kept at /global/hds/.snapshot/

There are no backup of files stored on the /work areas. If you need permanent storage of large amounts of data, or ifyou need to restore some lost data, please contact the system administrators: [email protected]

3.6.8 Archiving data

Archiving is not provided. However you may apply for archive space on Norstore.

3.6.9 Closing of user account

User accounts on Stallo are closed on request from Uninett Sigma or the project leader. The account is closed in a wayso that the user no longer can log in to Stallo.

If the user has data needed by other people in the group all data on /home/ is preserved.

3.6.10 Privacy of user data

General privacy

There is a couple of things you as a user, can do to minimize the risk of your data and account on Stallo beingread/accessed from the outside world.

1. Your account on Stallo is personal, do not give away your password to anyone, not even the HPC staff.

2. If you have configured ssh-keys on your local computer, do not use passphrase-less keys for accessing Stallo.




http://www.norstore.no/


By default a new account on Stallo is readable for everyone on the system. That is both /home/ and /global/work/

This can easily be change by the user using the command chmod The chmod have a lot “cryptic” combinations ofoptions (click here for a more in depth explanation ). the most commonly used is:

• only user can read their home directory:

chmod 700 /home/$USER

• User and their group can read and execute files on the home directory:


• User and all others including the group can read and execute the files:


• everybody can read, execute, and WRITE to directory:


3.6.11 Management of lage files (> 200GB)

Some special care needs to be taken if you want to create very large files on the system. With large we mean file sizesover 200GB.

The /global/work file system (and /global/home too) is served by a number of storage arrays that each contain smallerpieces of the file system, the size of the chunks are 2TB (2000GB) each. In the default setup each file is containedwithin one storage array so the default filesize limit is thus 2TB. In practice the file limit is considerably smaller aseach array contains a lot of files.

Each user can change the default placement of the files it creates by striping files over several storage arrays. This isdone with the following command:

lfs setstripe -c 4 .

After this has been done all new files created in the current directory will be spread over 4 storage arrays each having1/4th of the file. The file can be accessed as normal no special action need to be taken. When the striping is set thisway it will be defined on a per directory basis so different directories can have different stripe setups in the same filesystem, new subdirectories will inherit the striping from its parent at the time of creation.

We recommend users to set the stripe count so that each chunk will be approx. 200-300GB each, for example

File size Stripe count Command500-1000GB 4 lfs setstripe -c 4 .1TB - 2TB 8 lfs setstripe -c 8 .

Once a file is created the stripe count cannot be changed. This is because the physical bits of the data already arewritten to a certain subset of the storage arrays. However the following trick can used after one has changed thestriping as described above:

$ mv file file.bu$ cp -a file.bu file$ rm file.bu

The use of -a flag ensures that all permissions etc are preserved.

3.6. Storage and backup 27

http://en.wikipedia.org/wiki/Chmod


3.6.12 Management of many small files (> 10000)

The file system on Stallo is designed to give good performance for large files. This have some impact if you havemany small files.

If you have thousands of files in one directory. Basic operations like ‘ls’ becomes very slow, there is nothing to doabout this. However directories containing many files may cause the backup of the data to fail. It is therefore highlyrecommended that if you want backup of the files you need to use ‘tar’ to create on archive file of the directory.

3.6.13 Compression of data

Infrequently accessed files must be compressed to reduce file system usage.

Tools like gzip, bzip2 and zip are in the PATH and are available on all nodes. The manual page for these tools are verydetailed, use them for further help:

$ man gzip

3.6.14 Binary data and endianness

Stallo is like all desktop PCs a little endian computer.

At the moment in NOTUR the only big endian machine is njord.hpc.ntnu.no so Fortran sequential unformatted filescreate on Njord cannot be read on Stallo.

The best work around for this is to save your file in a portable file format like netCDF or HDF5.

Both formats are supported on stallo, but you have to load its modules to use them:

$ module load netcdf

Or:

$ module load hdf5

3.7 Transferring files to/from Stallo

All Notur systems are stand-alone systems, and in general we do not (NFS-)mount remote disks on them. Thereforeyou must either explicitly transfer any files you wish to use by using either sftp or scp, or you can mount your homedirectory(‘s) on the Notur systems on your own computer.

3.7.1 Transferring data to/from the system

Only ssh type of access is open to stallo. Therefore to upload or download data only scp and sftp can be used.

To transfer data to and from stallo use the following address:

stallo.uit.no

This address has nodes with 10Gb network interfaces.

Basic tools (scp, sftp)

Standard scp command and sftp clients can be used:


http://www.unidata.ucar.edu/software/netcdf/

http://www.hdfgroup.org/

http://docs.notur.no/metacenter/metacenter-documentation/metacenter_user_guide/mounting_disks_on_notur_systems

http://docs.notur.no/metacenter/metacenter-documentation/metacenter_user_guide/mounting_disks_on_notur_systems


ssh stallo.uit.nossh -l <username> stallo.uit.no

sftp stallo.uit.nosftp <username>@stallo.uit.no

Mounting the file system on you local machine using sshfs

For linux users:

sshfs [user@]stallo.uit.no:[dir] mountpoint [options]

eg.:

sshfs [email protected]: /home/steinar/stallo-fs/

Windows users may buy and install expandrive.

3.7.2 High-performance tools

OpenSSH with HPN

The default ssh client and server on stallo login nodes is the openssh package with applied HPN patches. By using ahpnssh client on the other end of the data transfer throughput will be increased.

To use this feature you must have a HPN patched openssh version. You can check if your ssh client has HPN patchesby issuing:

ssh -V

if the output contains the word “hpn” followed by version and release then you can make use of the high performancefeatures.

Transfer can then be speed up either by disabling data encryption, AFTER you have been authenticated or loggedinto the remote host (NONE cipher), or by spreading the encryption load over multiple threads (using MT-AES-CTRcipher).

NONE cipher

This cipher has the highest transfer rate. Keep in mind that data after authentication is NOT encrypted, therefore thefiles can be sniffed and collected unencrypted by an attacker. To use you add the following to the client command line:

-oNoneSwitch=yes -oNoneEnabled=yes

Anytime the None cipher is used a warning will be printed on the screen:

"WARNING: NONE CIPHER ENABLED"

If you do not see this warning then the NONE cipher is not in use.

MT-AES-CTR

If for some reason (eg: high confidentiality) NONE cipher can’t be used, the multithreaded AES-CTR cipher can beused, add the following to the client command line (choose one of the numbers):

-oCipher=aes[128|192|256]-ctr

3.7. Transferring files to/from Stallo 29

http://www.expandrive.com/windows


or:

-caes[128|192|256]-ctr.

Subversion and rsync

The tools subversion and rsync is also available for transferring files.

3.8 Jobs, scripts and queues

3.8.1 TORQUE vs. SLURM

If you are already used to Torque/Maui (the previous queue system used on Stallo), but not SLURM, you might findthis torque_slurm_table useful.

3.8.2 Batch system

The Stallo system is a resource that is shared between a lot of users so to ensure fair use everyone must do theircomputations by submitting jobs through a batch system that will execute the applications on the available resources.

The batch system on Stallo is SLURM (Simple Linux Utility for Resource Management.)

3.8.3 Create a job

To run a job on the system one needs to create a job script. A job script is a regular shell script (bash) with somedirectives specifying number of cpus, memory etc. that will be interpreted by the batch system upon submission.

For a quick feel for how to create and run batch jobs and for a more complete example see

Example job script

1 #!/bin/bash2 # First example of running a parallel job under the batch system.3 #4 # allocate a total of 80 processes, since this is both dividable by 16 and 20.5 #-----------------------6 #SBATCH --job-name=Hello_World7 #SBATCH --ntasks=808 # run for ten minutes9 #SBATCH --time=0-00:10:00

10 #SBATCH --mem-per-cpu=1GB11 #SBATCH --mail-type=ALL12 #-----------------------13

14 module load notur15

16 mkdir -p $SCRATCH17 cd $SUBMITDIR18

19 cp hello_c_mpi.x $SCRATCH20


http://slurm.schedmd.com/


21 cd $SCRATCH22

23 time mpirun ./hello_c_mpi.x > hello.output24

25 cp hello.output $SUBMITDIR26 cd $SUBMITDIR27

28 rm -rf $SCRATCH

3.8.4 Manage a job

A job’s lifecycle can be managed with as little as three different commands

1. Submit the job with sbatch jobscript.sh.

2. Check the job status with squeue. (to limit the display to only your jobs use squeue -u <user_name>.)

3. (optional) Delete the job with scancel <job_id>.

3.8.5 List of useful commands

Managing jobs

See the man page for each command for details.

sbatch <file_name> Submit jobs. All job parameters can be specified on the command line or in the job script.Command line arguments take precedence over directives in the script.

scancel <job_id> Delete a job.

scontrol hold <job_id> Put a hold on the job. A job on hold will not start or block other jobs from starting until yourelease the hold.

scontrol release <job_id> Release the hold on a job.

Getting job info

For details run the command with the --help option.

squeue List all jobs. This command can show you a lot of information, including expected start-time of a job. Seesqueue --help for more information.

squeue -u <username> List all current jobs for a user.

scontrol show jobid -dd <jobid> List detailed information for a job (useful for troubleshooting).

sacct -j <jobid> –format=JobID,JobName,MaxRSS,Elapsed To get statistics on completed jobs by jobID. Onceyour job has completed, you can get additional information that was not available during the run. This includesrun time, memory used, etc.

Get queue and account info

sinfo List partitions. What in Torque/ Maui was called queues, is now called partitions in Slurm.

sbank balance statement -u Information on available CPU-hours in your accounts.

3.8. Jobs, scripts and queues 31


3.8.6 Useful job script parameters

See the Example job script for a list of relevant parameters.

3.8.7 Recommended job parameters

Walltime

We recommend you to be as precise as you can when specifying the parameters as they will inflict on how fast yourjobs will start to run we generally have these two rules for prioritizing jobs:

1. Large jobs, that is jobs with high cpucounts, are prioritized.

2. Short jobs take precedence over long jobs.

Cpucount

We strongly advice all users to ask for all cpus on a node when running multinode jobs, that is, submit the job with-lnodes=X:ppn=16. This will make the best use of the resources and give the most predictable execution times.For single node jobs, e.g. jobs that ask for less than 16 cpus, we recommend to use -lnodes=1:ppn={1,2,4,8or 16}, this will make it possible for the batch system scheduler to fill up the compute nodes completely. A warningwill be issued if these recommendations are not followed.

Scalability

You should run a few tests to see what is the best fit between minimizing runtime and maximizing your allocated cpu-quota. That is you should not ask for more cpus for a job than you really can utilize efficiently. Try to run your job on1,2,4,8,16 cpus and so on to see when the runtime for your job starts tailing off. When you start to see less than 30%improvement in runtime when doubling the cpu-counts you should probably not go any further. Recommendations toa few of the most used applications can be found in Application guides.

3.8.8 Queues

In general it is not necessary to specify a specific queue for your job, the batch system will route your job to the rightqueue automatically based on your job parameters. There are two exceptions to this, the express and the highmemqueue

express: Jobs will get higher priority than jobs in other queues. Submit with qsub -q express .... Limits: Maxwalltime is 8 hours, no other resource limits, but there are very strict limits on the number of jobs running etc.

highmem: Jobs will get access to the nodes with large memory (32GB). Submit with qsub -q highmem ....Limits: Restricted access, send a request to [email protected] to get access to this queue. Jobs will be restrictedto the 32 nodes with 128GB memory.

Other queues

default: The default queue. Routes jobs to the queues below.

short: Jobs in this queue is allowed to run on any nodes, also the highmem nodes. Limits: walltime < 48 hours.

singlenode: Jobs that will run within one compute node will end up in this queue.

multinode: Contains jobs that span multiple nodes.

Again, it is not necessary to ask for any specific queue unless you want to use express or highmem.




3.8.9 Use of large memory nodes

Large memory nodes

Stallo has 32 compute nodes with 128GB memory each (the 272 others have 32GB memory).

To use the large memory nodes you should ask for access to the highmem queue, just send a mail to [email protected]. After being granted access to the highmem queue you can submit directly to the queue:

qsub -q highmem .........

Remark: You only need to apply for access to the large memory nodes if you want to run jobs that have more than 48hours walltime limit on these nodes.

Short jobs requiring less than 48 hours runtime can get assigned to the highmem nodes without running in thehighmem queue. This can happen if you submit requiring more than 2gb memory per process:

qsub -lnodes=2:ppn=16,pmem=4gb,walltime=12:00:00 .........

3.8.10 Interactive job submission

You can run an interactive jobs by using the -I flag to qsub:

qsub -I .......

The command prompt will appear as soon as the job start. If you also want to run a graphical application you mustalso use -X flag. Interactive jobs has the same policies as normal batch jobs, there are no extra restrictions.

General job limitations

The following limits are the default per user in the batch system. Users can ask for increased limits by sending arequest to [email protected].

Limit ValueMax number of running jobs 1024Maximum cpus per job 2048Maximum walltime No limitMaximum memory per job No limit:sup:1

1 There is a practical limit of 128GB per compute node used.

Remark: Even if we do not impose any limit on the length of the jobs on stallo we only give a weeks warning onsystem maintenance. Jobs with more than 7 days walltime, will be terminated and restarted if possible.

3.8.11 Scheduling policy on the machine

Priority

The scheduler is set up to

1. prioritize large jobs, that is, jobs that request large amount of cpus.

2. prioritize short jobs. The priority is calculated as proportional to the expansion factor: (queue-time+walltime)/walltime.

3. use fairshare, so a users with a lot of jobs running will get a decreased priority compared to other users.

3.8. Jobs, scripts and queues 33





Resource Limits

No user will be allowed to have more than 168 000 cpu-hours allocated for running jobs at any time. This means thata user at most can allocate 1000 cpus for a week for concurrently running jobs (or 500 cpus for two weeks or 2000cpus for half a week).

No single user will be allowed to use more than 500 jobs at any time. (you can well submit more, but you cannot havemore than 500 running at the same time)

Users can apply for exceptions to these rules by contacting [email protected].

The stallo archictecture

Before we dive into the details we need to say a few things about the stallo architecture.

• The Stallo cluster has 304 compute nodes with 16 cpu-cores each totalling 4864 cpu-cores (hereafter denoted ascpus).

• The Stallo cluster has two different memory configurations, 272 nodes have 32GB memory and 32 nodes have128GB memory.

• The Stallo cluster has all nodes connected with a high speed network which gives very high throughput and lowlatency. The network is split into islands with 128 nodes/2048 cpus each and jobs will run within one singleisland. This is done automatically by the scheduler.

See About Stallo chapter of the documentation for more details.

Job to node mapping

The basic philosophy for the job to node mapping is to run the job on the nodes best suited for the task.

• Short jobs are allowed to run anywhere. Short jobs are defined as jobs with walltime < 48 hours.

• Large memory jobs with walltime > 48 should run in the highmem queue. This queue is restricted access so theuser will need to notify the support team if access to these nodes are needed. Memory usage in this queue willbe monitored to prevent misuse.

Examples.

Short jobs:

qsub -lnodes=1:ppn=16,walltime=48:00:00 ........

Will be allowed to run anywhere.

Long jobs:

qsub -lnodes=8:ppn=16,walltime=240:00:00 .........

Will run within one island, but not on the highmem nodes.

Highmem jobs:

qsub -q highmem -lnodes=1:ppn=16,pmem=8gb,walltime=240:00:00 ........

This job will run on the highmem nodes if the user is granted access by the administrators. Otherwise it will neverstart. pmem is memory per process.



http://en.wikipedia.org/wiki/InfiniBand


3.8.12 Express queue for testing job scripts and interactive jobs.

By submitting a job to the express queue you can get higher throughput for testing and shorter start up time forinteractive jobs. Just use the -q express flag to submit to this queue:

qsub -q express jobscript.sh

or for an interactive job:

qsub -q express -I

This will give you a faster access if you have special needs during development, testing of job script logic or interactiveuse.

3.8.13 Priority and limitations

Jobs in the express queue will get higher priority than any other jobs in the system and will thus have a shorter queuedelay than regular jobs. To prevent misuse the express queue has the following limitations:

• Only one running job per user.

• Maximum 8 hours walltime.

• Maximum 8 nodes per job. This allows for jobs requesting up to 64 cpus to run in the queue,-lnodes=8:ppn=8. Remark: -lnodes=64 will NOT work, the nodes number must be less than or equal8.

• Only one job queued at any time, remark this is for the whole queue. This is to prevent express jobs delayinglarge regular jobs.

So, it is more or less pointless to try to use the express queue to sneak regular production jobs passed the other regularjobs. Submitting a large amount of jobs to the express queue will most probably decrease the overall throughput ofyour jobs. Also remark that large jobs get prioritized anyway so they will most probably not benefit anything fromusing the express queue.

3.9 Using software installed on Stallo

More information and guides to a few of the installed applications can be found in Application guides.

3.9.1 Which software is installed on Stallo

To find out what software packages are available to you, type

module avail

Missing / new SW?

If there is any SW missing on this list that you would like to have installed on Stallo, or you need help to install yourown SW, please feel free to contact the support personal about this: [email protected].

3.9. Using software installed on Stallo 35



Changes in application software

News about planned/unplanned downtime, changes in hardware, and important changes in software will be publishedon the HPC UiT twitter account https://twitter.com/hpc_uit and on the login screen of stallo. For more information onthe different situations see news.

The easiest way to check which software and versions available is to use the module command. List all softwareavailable:

module avail

List all version of a specific software:

module avail software-name

3.9.2 Modules

Software installations on the clusters span many applications, and many different version numbers of the same applica-tion. It is not possible (nor desirable) to use them all at the same time, since different versions of the same applicationmay conflict with each other. In order to simplify the control of which application versions are available in a specificsession, there is a system for loading and unloading ‘modules’ which activate and deactivate the relevant parts of youruser session.

The main command for using this system is the module command. You can find a list of all its options by typing

module --help

Below we listed the most commonly used options.

Which modules are currently loaded?

To see the modules currently active in your session, use the command

module list

Which modules are available?

In order to see a complete list of available modules, issue the command

module avail

The resulting list will contain module names conforming to the following pattern:

• name of the module

• /

• version

• (default) if default version

How to load a module

In order to make, for instance, the NetCDF library available issue the command


https://twitter.com/hpc_uit


module load netcdf

This will load the default NetCDF version. To load a specific version, just include the version after a slash:

module load netcdf/3.6.2

How to unload a module

Keeping with the above example, use the following command to unload the NetCDF module again:

module unload netcdf

Which version of a module is the default?

In order to see, for instance, which version of netcdf is loaded by the module, use:

module show netcdf

This will show the version and path of the module.

How do I switch to a different version of a module?

Switching to another version is similar to loading a specific version. As an example, if you want to switch from thedefault (current) Intel compilers version to version 14.0, type

module switch intel intel/14.0

3.10 Guidelines for use of computer equipment at the UiT The ArcticUniversity of Norway

3.10.1 Definitions

Explanation of words and expressions used in these guidelines.

users: Every person who has access to and who uses the University’s computer equipment. This includes employeesstudents and others who are granted access to the computer equipment.

user contract: Agreement between users and department(s) at the University who regulate the user’s right of accessto the computer equipment. The user contract is also a confirmation that these guidelines are accepted by theuser.

data: All information that is on the computer equipment. This includes both the contents of data files and software.

computer network: Hardware and/or software which makes it possible to establish a connection between two ormore computer terminals. This includes both private, local, national and international computer networks whichare accessible through the computer equipment.

breakdown: Disturbances and abnormalities which prevent the user from (stoppage) maximum utilization of thecomputer equipment.

computer equipment: This includes hardware, software, data, services and computer network.

hardware: Mechanical equipment that can be used for data processing.

3.10. Guidelines for use of computer equipment at the UiT The Arctic University of Norway 37


private data: Data found in reserved or private areas or that are marked as private. The data in a user’s account is tobe regarded as private irrespective of the rights attached to the data.

resources: Resources refers to the computer equipment including time and the capacity available for the persons whoare connected to the equipment.

3.10.2 Purpose

The purpose of these guidelines is to contribute towards the development of a computer environment in which thepotential provided by the computer equipment can be utilized in the best possible way by the University and bysociety at large. This is to promote education and research and to disseminate knowledge about scientific methods andresults.

3.10.3 Application

These guidelines apply to the use of the University’s computer equipment and apply to all users who are grantedaccess to the computer equipment. The guidelines are to be part of a user contract and are otherwise to be accessibleat suitable places such as the terminal room. The use of the computer equipment also requires that the user knows anypossible supplementary regulations.

3.10.4 Good Faith

Never leave any doubt as to your identity and give your full name in addition to explaining your connection to theUniversity. A user is always to identify him/ herself by name, his/ her own user identity, password or in another regularway when using services on the computer network. The goodwill of external environments is not to be abused by theuser accessing information not intended for the user, or by use of the services for purposes other than that for which theyare intended. Users are to follow the instructions from system administrators about the use of computer equipment.Users are also expected to familiarize themselves with the user guides, manuals, documentation etc. in order to reducethe risk of breakdowns or loss of data or equipment (through ignorance). On termination of employment or studies, itis the users responsibility to ensure that copies of data owned or used by the University are secured on behalf of theUniversity.

3.10.5 Data Safety

Users are obliged to take the necessity measures to prevent the loss of data etc. by taking back-up copies, carefulstorage of media, etc. This can be done by ensuring that the systems management take care of it. Your files are inprinciple personal but should be protected so that they cannot be read by others. Users are obliged to protect theintegrity of their passwords or other safety elements known to them, in addition to preventing unauthorized peoplefrom obtaining access to the computer equipment. Introducing data involves the risk of unwanted elements such asviruses. Users are obliged to take measures to protect the computer equipment from such things. Users are obliged toreport circumstances that may have importance for the safety or integrity of the equipment to the closest superior or tothe person who is responsible for data safety.

3.10.6 Respect for Other Users Privacy

Users may not try to find out another persons password, etc., nor try to obtain unauthorized access to another personsdata. This is true independent of whether or not the data is protected. Users are obliged to familiarize themselves withthe special rules that apply to the storage of personal information (on others). If a user wishes to register personalinformation, the user concerned is obliged to ensure that there is permission for this under the law for registration ofinformation on persons or rules authorized by the law or with acceptance of rights given to the University. In cases



where registration of such information is not permitted by these rules the user is obliged to apply for (and obtain?) the necessary permission. Users are bound by the oaths of secrecy concerning personal relationships of which theuser acquires knowledge through use of computer equipment, ref. to the definition in section 13 second section of theAdministration Law, (forvaltningslovens section 13 annet ledd).

3.10.7 Proper Use

The computer equipment of the University may not be used to advance slander or discriminating remarks, nor todistribute pornography or spread secret information, or to violate the peace of private life or to incite or take part inillegal actions. This apart, users are to restrain from improper communication on the network.

The computer equipment is to be used in accordance with the aims of the University. This excludes direct commercialuse.

3.10.8 Awareness of the Purposes for Use of Resources

The computer equipment of the University is to strengthen and support professional activity, administration, researchand teaching. Users have a co-responsibility in making the best possible use of the resources.

3.10.9 Rights

Data is usually linked to rights which make their use dependent on agreements with the holder of the rights. Userscommit themselves to respecting other people’s rights. This applies also when the University makes data accessible.The copying of programs in violation of the rights of use and/or license agreement is not permitted.

3.10.10 Liability

Users themselves are responsible for the use of data which is made accessible via the computer equipment. TheUniversity disclaims all responsibility for any loss that results from errors or defects in computer equipment, includingfor example, errors or defects in data, use of data from accessible databases or other data that has been obtained throughthe computer network etc. The University is not responsible for damage or loss suffered by users as a consequence ofinsufficient protection of their own data.

3.10.11 Surveillance

The systems manager has the right to seek access to the individual user’s reserved areas on the equipment for thepurpose of ensuring the equipment’s‘ proper functioning or to control that the user does not violate or has not violatedthe regulations in these guidelines. It is presupposed that such access is only sought when it is of great importanceto absolve the University from responsibility or bad reputation. If the systems manager seeks such access, the usershould be warned about it in an appropriate way. Ordinarily such a warning should be given in writing and in advance.If the use of a workstation, terminal or other end user equipment is under surveillance because of operational safetyor other considerations, information about this must be given in an appropriate way. The systems managers are boundby oaths of secrecy with respect to information about the user or the user’s activity which they obtain in this way, theexception being that circumstances which could represent a violation of these guidelines may be reported to superiorauthorities.

3.10. Guidelines for use of computer equipment at the UiT The Arctic University of Norway 39


3.10.12 Sanctions

Breach of these guidelines can lead to the user being denied access to the University’s data services, in additionto which there are sanctions that the University can order, applying other rules. Breach of privacy laws, oaths ofsecrecy etc. can lead to liability or punishment. The usual rules for dismissal or (forced) resignation of employees ordisciplinary measures against students, apply to users who misuse the computer equipment. The reasons for sanctionsagainst a user are to be stated, and can be ordered by the person who has authority given by the University. Disciplinarymeasures against students are passed by the University Council, ref. section 47 of the University law.

3.10.13 Complaints

Complaints about sanctions are to be directed to the person(s) who order sanctions. If the complaint is not compliedwith, it is sent on to the University Council for final decision. Complaints about surveillance have the same procedureas for sanctions. The procedure for complaints about dismissal or resignation of employees are the usual rules for theUniversity, and rules otherwise valid in Norwegian society. Decisions about disciplinary measures against studentscannot be complained about, See § 47 of the University law.

3.11 Do’s and do-not-do’s

• Never run calculations on the home disk

• Always use the queueing system

• The login nodes are only for editing files and submitting jobs

• Do not run calculations interactively on the login nodes


CHAPTER 4

Development of applications on Stallo

4.1 Compilers

The default development environment on Stallo is provided by Intel Cluster Studio XE. In general users are advisedto use the Intel compilers and MKL performance libraries, since they usually give the best performance.

4.1.1 Fortran compilers

The recommended Fortran compiler is the Intel Fortran Compiler: ifort. The gfortran compiler is also installed, butwe do not recommend it for general usage.

4.1.2 Usage of the Intel ifort compiler

For plain Fortran codes (all Fortran standards) the general form for usage of the Intel ifort compiler is as follows:

$ ifort [options] file1 [file2 ...]

where options represents zero or more compiler options, and fileN is a Fortran source (.f .for .ftn .f90 .fpp .F .FOR.F90 .i .i90), assembly (.s .S), object (.o), static library (.a), or an other linkable file. Commonly used options may beplaced in the ifort.cfg file.

The form above also applies for Fortran codes parallelized with OpenMP (www.openmp.org, Wikipedia); you onlyhave to select the necessary compiler options for OpenMP.

For Fortran codes parallelized with MPI the general form is quite similar:

$ mpif90 [options] file1 [file2 ...]

The wrapper mpif90 is using the Intel ifort compiler and invokes all the necessary MPI machinery automatically foryou. Therefore, everything else is the same for compiling MPI codes as for compiling plain Fortran codes.

4.1.3 C and C++ compilers

The recommended C and C++ compilers are the Intel Compilers; icc (C) and icpc (C++). The gcc and g++ compilersare also installed, but we do not recommend them for general usage due to performance issues.

41

http://www.openmp.org/

http://en.wikipedia.org/wiki/Openmp


4.1.4 Usage of the Intel C/C++ compilers

For plain C/C++ codes the general form for usage of the Intel icc/icpc compilers are as follows:

$ icc [options] file1 [file2 ...] # for C$ icpc [options] file1 [file2 ...] # for C++

where options represents zero or more compiler options, fileN is a C/C++ source (.C .c .cc .cpp .cxx .c++ .i .ii),assembly (.s .S), object (.o), static library (.a), or other linkable file. Commonly used options may be placed in theicc.cfg file (C) or the icpc.cfg (C++).

The form above also applies for C/C++ codes parallelized with OpenMP (www.openmp.org, Wikipedia); you onlyhave to select the necessary compiler options for OpenMP.

For C/C++ codes parallelized with MPI the general form is quite similar:

$ mpicc [options] file1 [file2 ...] # for C when using OpenMPI$ mpiCC [options] file1 [file2 ...] # For C++ when using OpenMPI

Both mpicc and mpiCC are using the Intel compilers, they are just wrappers that invoke all the necessary MPI ma-chinery automatically for you. Therefore, everything else is the same for compiling MPI codes as for compiling plainC/C++ codes.

4.2 Environment modules

The user’s environment, and which programs are available for immediate use, is controlled by the module command.Many development libraries are dependant on a particular compiler versions, and at times a specific MPI library. Whenloading and/or unloading a compiler, module automatically unloads incompatible modules and, if possible, reloadscompatible versions.

Currently, not all libraries in all combinations of all compilers and MPI implementations are supported. By using thedefault compiler and MPI library, these problems can be avoided. In the future, we aim to automate the process so thatall possible (valid) permutations are allowed.

Read the Using software installed on Stallo section for an introduction on how to use modules.

4.3 Optimization

top, time command, lsof, starce, valgrind cache profiler, Compiler flags, IPM, Vtune

In general, in order to reach performances close to the theoretical peak, it is necessary to write your algorithms in aform that allows the use of scientific library routines, such as BLACS/LAPACK. See General software and librariesfor available and recommended libraries.

4.3.1 Performance tuning by Compiler flags.

Quick n’dirty.

Use ifort/icc -O3. We usually recommend that you use the ifort/icc compilers as they give superior per-formance on Stallo. Using -O3 is a quick way to get reasonable performance for most applications. Unfortunately,sometimes the compiler break the code with -O3 making it crash or give incorrect results. Try a lower optimization,-O2 or -O1, if this doesn’t help, let us know and we will try to solve this or report a compiler bug to INTEL. Ifyou need to use -O2 or -O1 instead of -O3 please remember to add the -ftz too, this will flush small values tozero. Doing this can have a huge impact on the performance of your application.

42 Chapter 4. Development of applications on Stallo

http://www.openmp.org/

http://en.wikipedia.org/wiki/Openmp

http://docs.notur.no/uit/stallo_documentation/user_guide/faqsection_view?section=General%20software%20and%20libraries


Profile based optimization

The intel compilers can do something called profile based optimization . This uses information from the execution ofthe application to create more effective code. It is important that you run the application with a typical input set orelse the compiler will tune the application for another usage profile than you are interested in. With a typical input setone means for instance a full spatial input set, but using just a few iterations for the time stepping.

1. Compile with -prof _gen.

2. Run the app (might take a long time as optimization is turned off in this stage).

3. Recompile with -prof _use. The simplest case is to compile/run/recompile in the same catalog or else youneed to use the -prof _dir flag, see the manual for details.

IPM: MPI performance profiling

IPM is a tool which gives rapidly an overview over the time spent in the different MPI calls. It is very simple to useand can also give a html output with graphical representation of the results.

In your script or on the command line just write

module load ipm

and run you application as usual.

You can stop ipm with ipm _stop, and restart it with ipm _start.

At the end of the run you will get an overview (standard output):

##IPMv0.982###################################################################### command : a.out (completed)# host : stallo-1/x86_64_Linux mpi_tasks : 1 on 1 nodes# start : 04/30/10/13:09:19 wallclock : 0.002069 sec# stop : 04/30/10/13:09:19 %comm : 0.02# gbytes : 8.27026e-02 total gflop/sec : 0.00000e+00 total################################################################################ region : [ntasks] = 1## [total] <avg> min max# entries 1 1 1 1# wallclock 0.002069 0.002069 0.002069 0.002069# user 0.011998 0.011998 0.011998 0.011998# system 0.024996 0.024996 0.024996 0.024996# mpi 3.96278e-07 3.96278e-07 3.96278e-07 3.96278e-07# %comm 0.0191531 0.0191531 0.0191531# gflop/sec 0 0 0 0# gbytes 0.0827026 0.0827026 0.0827026 0.0827026### [time] [calls] <%mpi> <%wall># MPI_Comm_size 2.31201e-07 1 58.34 0.01# MPI_Comm_rank 1.65077e-07 1 41.66 0.01###############################################################################

This will also produce a file like.

MyName.1272624855.201741.0

4.3. Optimization 43


You can then run the command (on the front-end, stallo-1 or stallo-2, not a compute-node):

ipm_parse -html MyName.1272624855.201741.0

Which produces a new directory with html files that you can visualize in your browser:

firefox a.out_1_MyName.1272624855.201741.0_ipm_unknown/index.html

Note that the use of a hardware performance counter is not implemented yet on Stallo. Therefore IPM cannot giveinformation about floating point operations, cache use etc.

For more details refer to:

http://ipm-hpc.sourceforge.net/userguide.html

4.3.2 Vtune

Basic use of vtune

module unload openmpimodule unload intel-compilermodule load intel-compiler/12.0.4module load intel-mpimodule load intel-toolsamplxe-gui

<new project>, <new analysis> (choose Hotspots for example), <get command line> and edit it. For a parallel run youwill have something like:

mpirun -np 32 amplxe-cl -collect hotspots -follow-child -mrte-mode=auto -target-duration-type=short -no-allow-multiple-runs -no-analyze-system -data-limit=100 -slow-frames-threshold=40 -fast-frames-threshold=100 -r res -- /My/Path/MyProg.x

4.3.3 Compilers, libraries and tools

4.3.4 HPCToolkit

HPCToolkit is a measurement tool for profiling application using statistical sampling of the system timer or hardwareperformance counters.

HPCToolkit is installed on Stallo, see http://hpctoolkit.org/

Example of basic use

On the compute-node:

module load hpctoolkitmpiexec hpcrun-flat Myprog.x

This will produce files such as “Myprog.x.hpcrun-flat.compute-24-5.local.3310.0x0” . Each process produces a sepa-rate file.

hpcstruct Myprog.x > Myprog.psxmlhpcprof-flat -I '/MyPath/To/Source/Code/' -S Myprog.psxml Myprog.x.hpcrun-flat.compute-24-5.local.3310.0x

One or more file can be included in the profile.

The results can be looked at from the front-end (stallo-2) with:


http://ipm-hpc.sourceforge.net/overview.html

http://hpctoolkit.org/


module load hpctoolkithpcviewer experiment-db/experiment.xml

The profiling information is given down to line numbers.

PAPI (Performance Application Programming Interface)

HPCToolkit make uses of some performance hardware counters.

You can read directly the counters if you include some calls to PAPI routines into your code.

See http://icl.cs.utk.edu/papi/ for details.

The PAPI Library is installed on the compute-nodes only.

Here is a simple fortran example to measure the number of FLOP/s using one of the high level PAPI functions:

program testpapi

real4 :: rtime, ptime, mflopsinteger8 ::flpops

call PAPIF_flops(rtime, ptime, flpops, mflops,ierr)

call my_calc

call PAPIF_flops(rtime, ptime, flpops, mflops,ierr)

write (,90) rtime, ptime, flpops, mflops

90 format(' Real time (secs) :', f15.3, &/' CPU time (secs) :', f15.3,&/'Floating point instructions :', i15,&/' MFLOPS :', f15.3)

end program testpapi

subroutine my_calcreal :: xx=0.5do i=1,100000000

x=xx-0.8enddoif(x==1000)write(,)xend subroutine my_calc

Compile with

ifort -I/usr/include -L/usr/lib64 -lpapi papi.f90

4.3.5 Using google-perftools

Overview

Perf Tools is a collection of a high-performance multi-threaded malloc() implementation, plus some pretty nifty per-formance analysis tools.

For more information visit http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools

4.3. Optimization 45

http://icl.cs.utk.edu/papi/

http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools


Example

Note: this is by no means complete documentation, but simply gives you an idea of what the API is like.

No recompilation is necessary to use these tools.

TC Malloc:

gcc [...] -ltcmalloc

Heap Checker:

gcc [...] -o myprogram -ltcmallocHEAPCHECK=normal ./myprogram

Heap Profiler:

gcc [...] -o myprogram -ltcmallocHEAPPROFILE=/tmp/netheap ./myprogram

Cpu Profiler:

gcc [...] -o myprogram -lprofilerCPUPROFILE=/tmp/profile ./myprogram

4.4 Parallel applications

MPI, OpenMP and Hybrid codes

4.4.1 MPI based Fortran, C or C++ codes

The compilation of Fortran, C or C++ codes parallelized with MPI, the form is quite similar:

mpif90 [options] file1 [file2 ...] - for Fortran when using OpenMPImpicc [options] file1 [file2 ...] - for C when using OpenMPImpiCC [options] file1 [file2 ...] - For C++ when using OpenMPI

mpif90, mpicc and mpiCC are using the Intel compilers, they are just wrappers that invoke all the necessary MPIstuff automatically for you. Therefore, everything else is the same for compiling MPI codes as for compiling plainFortran/C/C++ codes. Please check also the section about MPI libraries for more information about using MPI.

To see exactly what the wrappers do, you can use the “–showme” option, for example:

$ mpif90 --showmeifort -I/global/apps/openmpi/1.3.3/include -I/global/apps/openmpi/1.3.3/lib -L/opt/torque/lib64 -Wl,--rpath -Wl,/opt/torque/lib64 -L/global/apps/openmpi/1.3.3/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -lrdmacm -libverbs -ltorque -ldl -lnsl -lutil

To start the MPI application:

mpirun MyApplication

This will use all the CPUs available

4.4.2 OpenMP based codes

You only have to select the necessary compiler options for OpenMP. For intel compilers:

ifort -openmp MyApplication.f90icc -openmp MyApplication.c



and for hybrid MPI/OpenMP:

mpif90 -openmp MyApplication.f90mpicc -openmp MyApplication.c

4.4.3 Launch hybrid MPI/OpenMP applications:

For example 12 MPI processes with 8 threads each (i.e. uses 96 cores):

export OMP_NUM_THREADS=8mpirun -np 12 -npernode 1 MyHybridApplication

OMP _NUM _THREADS=8 is the default, so it would normally not be necessary to specify it.

Note that you must ensure that you have reserved 12 nodes (96 cores) for this application:

$PBS -lnodes=12:ppn=8

4.4.4 Checkpoint and restart of applications

In normal conditions your job will not be suspended and resumed on Stallo. Once your job is started it will run untilit is finished or the walltime limit is reached.

In special cases (system/power/hardware failure or reached walltime limit), jobs may be interrupted. It is the usersresponsibility to ensure that the necessary data is stored and provide the necessary environment if the job is supposedto be restarted.

4.4.5 Notes on MPI

RDMA capabilities using OpenMPI

This section will present notes about different tests performed to investigate the possibilities of RDMA (Remote DirectMemory Access) and one sided communication using MPI on Stallo.

OpenMPI

A lot of info can be found at http://www.open-mpi.org , but here we will focus on an end user point of view.

Test: send without that the receiver node is in a MPI statement (i.e. eager send)

We have first tested how large chunks of data can be send eagerly with MPI _Send (i.e. which in effect does not block),even if the receiver process is busy (not within a MPI call).

In order to obtain meaningful results the code must include a successful communication between the actual processesbefore the eager send. This is necessary to initialize MPI for eager send. Otherwise the send cannot be eager. It is alsoafter this initialization, that resources are reserved in memory (MPI _Init is not sufficient). A barrier is not enough: forexample assume rank 0,1,2,3,4,5,6,7 are on one node and rank 8,9,10,11,12,13,14,15 on another node, a call to MPI_barrier will initialize the connection between rank 0 and 8, or between rank 1 and 9, but not between 0 and 9.

For send between infiniband nodes the default limit for eager send is 12 kb and is determined by the parameter mcabtl _openib _eager _limit. This value can be changed by the user, for example:

mpirun --mca btl_openib_eager_limit 65000 -pernode a.out

4.4. Parallel applications 47

http://www.open-mpi.org


btl _openib _eager _limit must be <= btl _openib _max _send _size (64kB default). But this value can also be setdifferently, for example:

mpirun --mca btl_openib_eager_limit 1000000 --mca btl_openib_max_send_size 1000000 -pernode a.out

For eager send within a node using MPI shared memory (sm), the default limit is 4 kB, can be changed with forexample:

mpirun --mca btl_sm_eager_limit 1000000 -pernode a.out

For a send to itself, the default limit is 128kB, can be changed with for example:

mpirun --mca btl_self_eager_limit 2000000000 -np 2 a.out

Important limitations

OpenMPI will reserve about 600 times more memory than the value from mca btl _openib _eager _limit. That meansthat for sending 10 MB eagerly between nodes, about 6GB must be reserved exclusively for OpenMPI buffers. (thisdoes however not seem to increase proportionally with the number of nodes).

btl _openib _free _list _max must be large (>260) and this number times mca btl _openib _eager _limit x 2 is reserved.

Non-blocking send

In the case of non-blocking send (MPI _Isend), the data will not necessarily be sent before the next MPI call is reachedin the caller process. It will be sent properly betwen nodes, but not necessarily within a node. The connection has alsoto be initialized (see above).

The limit of how much can be sent within a node with MPI _Isend is determined by btl _sm _eager _limit. Thisparameter can be increased without special memory penalty.

Conclusion

The best way to take advantage of RDMA is to use non-blocking send (MPI _Isend) in conjunction with sufficientlyhigh value of btl _sm _eager _limit.

Intel MPI

The Intel MPI library has different rules. The default limit for eager send is 64kB. This can be increased with animportant limitation, see below. Syntax for increasing limits:

module unload openmpi/1.4module unload intel-compiler/11.1module load intel-compiler/12.1.0module load intel-mpimpirun -np 16 -env I_MPI_EAGER_THRESHOLD 25600 -env I_MPI_INTRANODE_EAGER_THRESHOLD 2560 a.out

There are fundamental differences, compared to openMPI:

The “eager send” does not work as long as the receiving process is not in a MPI call (not necessarily the correspondingreceiving call).

Increasing the threshold, does not have an important memory penalty as in OpenMPI.



4.4.6 MPI libraries

Information about the MPI libraries installed on Stallo.

OpenMPI

On stallo the default mpi-library is OpenMPI. You compile with mpicc/mpif90, which will use the Intel compilers(more info) for building the application. You can start your application with mpirun like this

mpirun your-mpi-application

No need for extra flags, mpirun will find out the right nodes to run on directly from the queueing system, see theexample runscript for a starting point.

More information about MPI

To find information and download various material, like courses, the standard, program examples, and free implemen-tations of MPICH for Windows and most flavours of Unix, go to

• MPI on Netlib

• MPI Forum

• Open MPI

• MPI on Wikipedia

4.5 Scripting languages

A rich set of scripting tools and languages are available on Stallo, including Python, Bash, Lua, R, Perl, Ruby.

4.6 Debugging

4.6.1 Debugging with TotalView

TotalView is a graphical, source-level, multiprocess debugger. It is the primary debugger on Stallo and has excellentcapabilities for debugging parallel programs. When using this debugger you need to turn on X-forwarding, which isdone when you login via ssh. This is done by adding the -Y on newer ssh version, and -X on older:

$ ssh -Y [email protected]

The program you want to debug has to be compiled with the debug option. This is the “-g” option, on Intel and mostcompilers. The executable from this compilation will in the following examples be called “filename”.

First, load the totalview module to get the correct environment variables set:

$ module load totalview

To start debugging run:

$ totalview MyProg

4.5. Scripting languages 49

http://www.open-mpi.org/

http://www.netlib.org/mpi/

http://www.mpi-forum.org/

http://www.open-mpi.org/

http://en.wikipedia.org/wiki/Message_Passing_Interface


Which will start a graphical user interface.

Once inside the debugger, if you cannot see any source code, and keep the source files in a separate directory, add thesearch path to this directory via the main menu item File->Search path.

Source lines where it is possible to insert a breakpoint are marked with a box in the left column. Click on a box totoggle a breakpoint.

Double clicking a function/subroutine name in a source file should open the source file. You can go back to theprevious view by clicking on the left arrow on the top of the window.

The button “Go” runs the program from the beginning until the first breakpoint. “Next” and “Step” takes you one line/ statement forward. “Out” will continue until the end of the current subroutine/function. “Run to” will continue untilthe next breakpoint.

The value of variables can be inspected by right clicking on the name, then choose “add to expression list”. Thevariable will now be shown in a pop up window. Scalar variables will be shown with their value, arrays with theirdimensions and type. To see all values in the array, right click on the variable in the pop up window and choose “dive”.You can now scroll through the list of values. Another useful option is to visualize the array: after choosing “dive”,open the menu item “Tools->Visualize” of the pop up window. If you did this with a 2D array, use middle button anddrag mouse to rotate the surface that popped up, shift+middle button to pan, Ctrl+middle button to zoom in/out.

Running totalview inside the batch system (compute nodes):

$ qsub -I -lnodes=[#procs] -lwalltime=[time] -A [account] -q express -X$ mkdir -p /global/work/$USER/test_dir$ cp $HOME/test_dir/a.out /global/work/$USER/test_dir$ cd /global/work/$USER/test_dir$ module load totalview$ totalview a.out

Replace [#procs] with the core-count for the job.

A window with name “New Program” should pop up. Under “Program” write the name of the executable. Under“Parallel” choose “Open MPI” and “Tasks” is the number of cores you are using ([#procs]).

You can also start Totalview with:

$ mpirun -tv a.out

The users guide and the quick start quide for Totalview can be found on the RogueWave documentation page.

4.6.2 Debugging with Valgrind

• A memory error detector

• A thread error detector

• A MPI error detector

• A cache and branch-prediction profiler

• A call-graph generating cache profiler

• A heap profiler

• Very easy to use

Valgrind is the best thing for developers since the invention of pre-sliced bread!

Valgrind works by emulating one or more CPUs. Hence it can intercept and inspect your unmodified, running programin ways which would not be otherwise possible. It can for example check that all variables have actually been assignedbefore use, that all memory references are within their allowed space, even for static arrays and arrays on the stack.


http://www.roguewave.com/help-support/documentation/totalview


What makes Valgrind so extremely powerful is that it will tell exactly where in the program a problem, error orviolation occurred. It will also give you information on how many allocates/deallocates the program performed, andwhether there is any unreleased memory or memory leaks at program exit. In fact, it goes even further and will tellyou on what line the unreleased/leaking memory was allocated. The cache profiler will give you information aboutcache misses and where they occur.

The biggest downside to Valgrind is that it will make your program run much slower. How much slower depends onwhat kind of, and how much, information you request. Typically the program will run 10-100 times slower underValgrind.

Simply start Valgrind with the path to the binary program to be tested:

$ module load valgrind$ valgrind /path/to/prog

This runs Valgrind with the default “tool” which is called memcheck, which checks memory consistency. When runwithout any extra flags, Valgrind will produce a balanced, not overly detailed and informative output. If you need amore detailed (but slower) report, run Valgrind with:

$ valgrind --leak-check=full --track-origins=yes --show-reachable=yes /path/to/prog

Of course, if you want to get all possible information about where in the program something was inconsistent youmust compile the program with debugging flags switched on.

If you have a multi-threaded program (e.g. OpenMP, pthreads), and you are unsure if there might be possible deadlocksor data races lurking in your program, the Valgrind thread checker is your best friend. The thread checking tool iscalled helgrind:

$ export OMP_NUM_THREADS=2$ valgrind --tool=helgrind /path/to/prog

For more information on using Valgrind please refer to the man pages and the Valgrind manual which can be found onthe Valgrind website: http://www.valgrind.org

4.6. Debugging 51

http://www.valgrind.org

CHAPTER 5

Applications

5.1 Application guides

For a general explanation on how to make an application available for use, the module system, and information aboutchanges in application software see Using software installed on Stallo.

5.1.1 The GAUSSIAN program system

Gaussian is a computational chemistry software program system initially released in 1970. The name comes from, thengroundbreaking, the use of gaussian type basis functions – which later more or less has become the “trade standard”.Gaussian has a rather low user threshold and a tidy user setup, which, together with a broad range of possibilities anda graphical user interface (gaussview), might explain its popularity in academic institutions.

• For information about the latest Gaussian release, see http://www.gaussian.com/g_prod/g09b.htm.

• For a list of functionality: see http://www.gaussian.com/g_prod/g09_glance.htm

• For information about Gaussian 09 job types, see http://www.gaussian.com/g_tech/g_ur/m_jobtypes.htm

GAUSSIAN on NOTUR installations

Currently there are two available main release versions in the NOTUR system; Gaussian 03 and Gaussian 09. Gaussian09 is the latest in the Gaussian series of programs and Gaussian 03 is its predecessor. For differences between the two,please check the gaussian web page http://www.gaussian.com/g_tech/g_ur/a_gdiffs09.htm.

Generally, we advice users to run the latest version of the software. This rule has some important exceptions:

1. As a rule of thumb you should always complete a project using the same software as you started with.

2. If you by experience know that your kind of problem behaves better (runs faster, uses less memory or scalesbetter) on older versions of the program we of course encourage you to run your jobs the more efficient way.

For more information on using Gaussian, please look at the following:

LINKS

Software home-page: http://www.gaussian.com/

Technical support from software vendor: http://www.gaussian.com/g_email/em_help.htm

Technical support for Stallo: [email protected]

53

http://www.gaussian.com/g_prod/g09b.htm

http://www.gaussian.com/g_prod/g09_glance.htm

http://www.gaussian.com/g_tech/g_ur/m_jobtypes.htm

http://www.gaussian.com/g_tech/g_ur/a_gdiffs09.htm

http://www.gaussian.com/

http://www.gaussian.com/g_email/em_help.htm



5.1.2 VASP (Vienna Ab initio Simulation Package)

VASP is a package for performing ab-initio quantum-mechanical molecular dynamics (MD) using pseudopotentialsand a plane wave basis set.

To use VASP you need to be a member of the vasp group, contact the HPC staff to get access. Vasp is a commercialsoftware that requires a license for all who wants to run on Stallo, to get access you would need to prove that the groupyou are a member of holds a valid licence.

You load the application by typing:

$ module load vasp

This command will give you the default version, which is currently 5.3.2

For more information on available versions, type:

$ module avail vasp

For more information, see: http://www.vasp.at

5.1.3 Amsterdam Density Functional program system

A description of the quantum chemistry and material science package ADF/BAND

General Information:

The ADF software is a DFT-only first-principles electronic structure calculations program system, and consists of arich variety of packages.

Online info from vendor:

Homepage: http://www.scm.com Documentation: http://www.scm.com/Doc

The support people in NOTUR, do not provide trouble shooting guides anymore, due to a national agreement that it isbetter for the community as a whole to add to the community info/knowledge pool where such is made available. ForADF/BAND we advise to search in general documentation, sending emails to support(either notur or scm) or tryingthe ADF mailing list (see http://www.scm.com/Support for more info).

Citation

When publishing results obtained with the referred software referred, please do check the developers web page inorder to find the correct citation(s).

License information:

The license of ADF/Band is commercial.

NOTUR holds a national license of the ADF program system, making usage of ADF/BAND available for all academicscientists in Norway.

We have a national license for the following packages:

• ADF & ADFGUI

54 Chapter 5. Applications

http://www.vasp.at

http://www.scm.com

http://www.scm.com/Doc

http://www.scm.com/Support


• BAND &BANDGUI

• CRS

• DFTB & DFTBGUI

• GUI

• REAXFF & REAXFFGUI

• NBO6 (on Stallo only, but machine license available for all users of Stallo).

Please note that this is an academic type license; meaning that research institutes not being part of Norwegian Uni-versities must provide their own license to be able to use and publish results obtained with this code on NOTURinstalllations.

is software for first-principles electronic structure calculations. The parallel version has been tested for scaling andmost problems scale well up to 16 cpus, larger problems scale beyond 64 cpus. Over 100 cpus is normally notrecommended due to limited scaling.

(link to general info about license policy notur?)

Usage

(Link to general info about scaling of software on notur systems; get feedback from users?)

We generally advise to run ADF on more than one node. The parallel version has been tested for scaling and mostproblems scale well up to 4 nodes.

Use

$ module avail adf

to see which versions of ADF are available. Use

$ module load adf

or

$ module load adf/<version> # i.e 2014.08

to get access to ADF. Note that ADF contains its own MPI implementation, so there is no need to load any openmpimodule to use ADF.

First run of ADF/BAND on Stallo:

To get download the jobscript example(s), type the following:

Copy the contents of this folder to your home directory.

Then, submit the job by typing:

Compare the energy of adf_example.out with the Bond Energy = -156.77666698 eV inadf_example_correct_stallo.output. The energy should ideally be identical or close to identical. When this isthe case, you may alter variables in the shell script as much as you like to adapt to your own jobs. Good luck.

• NB: ADF is on Stallo set up with intel mpi, if not using this script - remember to do module swap openmpi impibefore loading the adf module.

• On Stallo, we holds the nbo6 plug in license that allows users of adf to produce files that can be read with thenb06 software.

5.1. Application guides 55


56 Chapter 5. Applications

Index

Bbatch system, 30

Jjob management, 30jobscripts, 30

57

hpc documentation - read the docs

Documents