unix basics and cluster computing

1/13/15

1

Next-Generation Sequencing Analysis Series

January 14, 2015

Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH

BCBB instructors for this NGS series Andrew Oler Vijay Nagarajan Mariam Quiñones

2

Bioinformatics and Computational Biosciences Branch

NIH/NIAID/OD/OSMO/OCICB

Contact BCBB at [email protected]

Contact HPC Cluster team at: [email protected]

1/13/15

2

Bioinformatics and Computational Biosciences Branch

§  Bioinformatics Software Developers

§  Computational Biologists §  Project Managers &

Analysts

http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx

3

Objectives

When you leave today, I hope you will be able to 1.  Open a terminal and know how to navigate 2.  Know how to do basic file manipulation and create files and

directories from the command line 3.  Submit a job to the HPC cluster To accomplish these goals, we will 1.  Learn the most useful Unix terminal commands 2.  Practice a few of these commands 3.  Practice preparing and submitting some scripts to the NIAID

HPC Cluster Caveat: 1.  You may not be a Unix expert when you leave today (and

that’s okay).

4

1/13/15

3

Anatomy of the Terminal, “Command Line”, or “Shell”

Prompt (computer_name:current_directory username) Cursor

Command Argument Window

Output

Mac: Applications -> Utilities -> Terminal Windows: Download open source software

PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)

6

1/13/15

4

File Manager/Browser by Operating System

7

OS: Windows Mac OSX Unix FM: Explorer Finder Shell

Typical UNIX directory structure

/ “root”

/bin essential binaries

/etc system config

/home user directories

/home/USER1 USER1 home

/home/USER2 USER2 home

/mnt network drives

/sbin system binaries

/usr shared, read-only

/usr/bin other binaries

/usr/local installed packages

/usr/local/bin installed binaries

/var variable data

/var/tmp program caches

pwd “print working directory”; tells where you are

8

1/13/15

5

How to execute a command

command argument

output

output

9

Some basic Unix commands

§  pwd §  ls §  mkdir §  cd §  wget §  curl §  cp §  wc §  head §  tail §  less §  cat

§  **See Pre-lecture worksheet.**

10

1/13/15

6

Tips to make life easier! Tab completion: hit Tab to make computer guess your filename. type: ls unix[Tab] result: ls unix_hpc If nothing happens on the first tab, press tab again… Up Arrow: recall the previous command(s) Ctrl+a go to beginning of line Ctrl+e go to end of line Ctrl+c kill current running process in terminal Aliases (put in ~/.bashrc file … see handout) alias ls='ls -AFG' alias ll='ls -lrhT' history show every command issued during the session !ls repeat the previous “ls” command !! repeat the previous command man [command] read the manual for the command man ls read the manual for the ls command

11

Accessing the NIAID HPC §  Login to HPC “submit node,” which is the computer from which you submit jobs.

ssh secure shell, remote login ssh [email protected] fill in XXX with number

§  Copy files to/from HPC scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/

§  ssh and scp will prompt you to enter your password

12

1/13/15

7

mv (“move file”)

mv file1 temp/ move “file1” to the “temp” directory mv file1 file2 rename “file1” to “file2” mv -i file2 temp/file3 move “file2” to the “temp” directory and

rename it “file3”; ask to make sure *without -i, it will overwrite an existing file!*

mv *.fastq ~ move all “.fastq” files to the home directory Exercise 1: mv *.fastq temp/ (moveall “.fastq” files to the “temp” directory) ls temp (check that the files are there) Note: syntax for mv and cp are similar

13

rm (“remove file”)

rm file1 delete “file1” rm -i file2 delete “file2”, but ask first rm *.pdb delete all “.pdb” files rm -r temp delete the “temp” directory rm -rf temp delete the “temp” directory, no questions asked!

Be careful!

rm -r *

14

1/13/15

8

File and system information

wc file1 “word count”; output is “lines”, “words”, characters” wc *.fastq “word count” of all fastq files, including summary du -h temp “disk usage” (size) of each file in the “temp”

directory (outputs a list) top report for local machine on the processes using the

most system resources (memory, CPU, etc.); “q” to exit

15

File compression

gzip temp/* compress every file in “temp”; adds .gz extension

gunzip temp/*.gz expand every “gzipped” file in

“temp” tar -zcvf myfiles.tar.gz temp/* create a single archive of

every file in “temp” tar –xvf test_data.tar.gz copy every file out of the archive

“tarball” ≠

16

1/13/15

9

File manipulation

cat file1 file2 > file3 write “file3”, containing first “file1”, then “file2”

cat file1 >> file2 append “file1” onto “file2” sort file1 alphabetize “file1.txt” sort -n file1 sort “file1” by number sort -n -r -k 2 file1 sort “file1” by the second word or

column in reverse numerical order

Careful!

17

grep (search within files)

grep key file* report the file name and line where “key” appears in file* grep -v key file* report the file names of files that do not match “key” man grep see other functions of grep. (lots! regular expressions!)

18

1/13/15

10

Linking files (making “shortcuts”) ln -s ~/myapp/binary ~/bin make a shortcut (“symbolic link”)

in “~/bin” that points to “~/myapp/binary”

ln -s /usr/local data make a shortcut in current

directory pointing to /usr/local cd data takes you to /usr/local

19

Downloading Files

wget download multiple files from ftp or http address curl download single files from ftp, http, sftp, etc. http://curl.haxx.se/docs/comparison-table.html

20

1/13/15

11

Pipelining

ls | wc count the number of files in a directory grep | sort > file1 pull out searched-for lines, sort them, and

write a new file Exercise 2: head –n 2000 lymph1k.fastq | gzip > head2K.txt.gz

21

Loops for assign a variable for each of a space-separated list of values ; Use to separate commands do done Marks start and end of loop to repeat Exercise 3: for i in 1 13 200; do echo $i; done 1 13 200 ls for i in file*; do echo $i; mv “$i” “${i}.txt”; done file1 file2 ls

22

1/13/15

12

Recommended Reading

Linux in a Nutshell, Sixth Edition Ellen Siever, Stephen Figgins, Robert Love, Arnold Robbins

Running Linux, 5th Edition Matthias Kalle Dalheimer, Matt Welsh

UNIX® Shells by Example, Fourth Edition Ellie Quigley

23

Take Away

Use mnemonics

Read “man” pages

Work on copies, make backups, and use “rm”, “mv”, and “>” carefully

Pick a text editor and master it (pico/nano, emacs, vi/vim, etc.)

Be clever!

Questions? 24

1/13/15

13

Using NIAID Grid Engine Cluster

25

High Performance Computing

§  “A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.” http://en.wikipedia.org/wiki/Cluster_%28computing%29

26

1/13/15

14

HPC Glossary §  node individual workstation within a network or cluster; a collection of processors all

accessing the same memory (RAM). §  CPU abbreviation for central processing unit. The processor of a node. Also

referred to as a socket. •  try cat /proc/cpuinfo •  Note that “processors” in the output are actually “cores” by the definition below

§  core separate execution core for calculations. e.g., “dual-core” means the processor has two cores. Sometimes each core is referred to as a separate processor.

§  slot a single core available for use within a node. e.g., if a node has 16 cores, it will have 16 slots.

§  hyper-threaded technology (HTT) Where a single execution core is treated as being two virtual cores (or two logical processors) by the system. Some of the nodes in the cluster have HTT. E.g., if there are 16 physical cores, there would be 32 logical processors.

§  thread a single process of a multi-process job. Each thread runs on a separate logical processor. E.g., if you run tophat with -p 10, 10 threads will be created and run in parallel.

These definitions are somewhat flexible…

27

Accessing the NIAID HPC §  Request an HPC account (for NIAID members and collaborators only)

•  https://hpcweb.niaid.nih.gov/#home •  “Request Account”

§  Login to HPC “submit node,” which is the computer from which you submit jobs. ssh secure shell, remote login ssh [email protected]

§  Copy files to/from HPC scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/

§  ssh and scp will prompt you to enter your password

28

1/13/15

15

Mounting HPC Drives

29

Mac Windows

1.  Click "Start" > "Computer” 2.  Click on "Map Network

Drive”. 3.  Choose an available drive

letter. 4.  Enter \\ai-hpcfileserver.niaid.nih.gov\bcbb

in the “Folder” field, replacing “bcbb” with your group name or your user name.

(For more details, see link to FAQ below.)

https://hpcweb.niaid.nih.gov/#support?type=Links&requestType=HPC%20FAQs&name=41

Cluster Architecture and Access

image modified from http://ainkaboot.co.uk/

regular.q interactive.q memLong.q

qrsh -q interactive.q qsub -q memLong.q

ssh [email protected]

Submit Node

30

1/13/15

16

Cluster Queue System: Sun Grid Engine §  Computers have Linux Red Hat Operating System

§  Grid Engine is a batch queuing system

§  Other queuing systems (http://en.wikipedia.org/wiki/Job_scheduler): •  Portable Batch System (PBS) (e.g., Biowulf) •  TORQUE Resource Manager •  Maui •  Moab •  others… •  Each will require a slightly different syntax for scripts

§  Comes with a set of commands to communicate with the cluster

§  Monitors available resources and users’ workloads to start jobs at the appropriate time

31

Grid Engine jobs

§ Three types of jobs

•  Batch/Serial (one node, one processor)

•  Parallel (multiple processors or nodes)

•  Interactive

32

Input Process Output

Input Process Output

Process

Process

1/13/15

17

Grid Engine Jobs: Interactive

§  Login to a node like ssh qrsh -l h_vmem=20G

§  Need to specify parameters -l requested resources in space-delimited list

•  For interactive job:

h_vmem= §  For Biowulf (PBS) (http://biowulf.nih.gov/user_guide.html#interactive): qsub -I -V -l nodes=1

33


image from http://ainkaboot.co.uk/


qrsh -q interactive.q


Submit Node

34

1/13/15

18

Test TopHat Job in Interactive Session

§ TopHat is a short read aligner for RNA-seq data § Manual: § http://ccb.jhu.edu/software/tophat/manual.shtml

1.  Check dependencies (e.g., PATH) 2.  Check command syntax and options 3.  Run command with test dataset

35

Grid Engine Jobs: Batch / Serial

§ Single processor, one job

§ Submit a script to the cluster from the submit node, “submit-1”

36

1/13/15

19


image from http://ainkaboot.co.uk/


qsub -q memLong.q script.sh OR qsub script.sh *No queue necessary*


Submit Node

37

Text Editors for Composing Scripts (batch jobs)

§  Not the same as a word processor! e.g., Microsoft Word §  Try some, choose a favorite §  Popular for Windows:

•  Notepad++ (nice color-coding) •  EditPad Lite (can open large files > 4Gb)

§  Popular for Mac: •  TextWrangler

§  Popular for Terminal: •  nano •  vi •  emacs

§  http://en.wikipedia.org/wiki/Comparison_of_text_editors

38

1/13/15

20

Quick Look at a Shell Script

Exercise 4: cd ~/unix_hpc/test_data cat test_serial.sh §  A few things to notice:

•  #!/bin/bash –  “shebang” or “hashbang,” used to specify the program to run for the script

•  qsub options (next slide) •  export (used to set environmental variables) •  PATH=/path/to/folder:/path/to/another/folder:$PATH

–  used to allow you to simply type the name of the executable instead of the full path to the executable, e.g., type “tophat” instead of “/usr/local/bio_apps/tophat/bin/tophat”

•  Comments about when you ran the job •  Command for job *PBS Script for Biowulf as well.

39

SGE qsub options qsub [options] script.sh command to submit a job to the cluster -S /bin/bash shell to use (default is csh) -N job_name name for your job -q queue.q queue(s) to submit to, e.g.,

memLong.q,memRegular.q -M [email protected] email address to send alert to -m abe when to send email (e.g., beginning, end, aborted) -l resources resources to request, e.g.,

h_vmem=20G,h_cpu=1:00:00,mem_free=10G -cwd run from current working directory. Output to here. -j y join stderr and stdout into one -pe threaded 10 parallel environment: “round” means processors

could be on separate machines, “threaded” all processors on same machine. number of processors/threads.

§  You can put these options on the command-line or in your shell

script §  Lines with these options should begin with #$

40

1/13/15

21

Submitting jobs with PBS (Biowulf)

§  PBS options and examples for Biowulf: •  http://biowulf.nih.gov/user_guide.html#batchsamp

§  Examples •  qsub -I -V -l nodes=1 •  qsub -l nodes=1 myjob.bat •  qsub -l nodes=8:o2800 myparalleljob •  qsub -v np=3 -l nodes=2:g24:c24,mem=0 novompi.sh

§  Option lines start with #PBS instead of #$ §  Application-specific usage for Biowulf as well, e.g.,

41

Grid Engine Jobs: Batch / Serial § Submit a script to the cluster from the

submit node Exercise 5: cd ~/unix_hpc/test_data (remember to try tab completion J) qsub test_serial.sh It should say “Your job XXXXXX ("tophat_test") has been submitted” where XXXXXX is the job number. ls –al Do you see a file called tophat_test.oXXXXXX where XXXXXX is your job number? cat tophat_test.oXXXXXX (substitute job number for XXXXXX)

42

1/13/15

22

Grid Engine Jobs: Parallel

§  pe commands (threaded, single, etc.) §  Basic use in script: #$ -pe threaded 8 §  Can also use advanced options, e.g.,

•  "-pe 12threaded 48" means use 12 cores per node, for a total of 48 cores needed. This will allocate the job to run on 4 nodes with 12 cores each. Your program must be able to support this

•  "-pe threaded 5-10" means run the job with 10 if available, but down to 5 cores is fine too.

§  Do the math for memory! •  h_vmem is not total, it’s per thread. E.g., if you have a job that

needs 10G total, running on 5 processors, you’ll assign h_vmem=2G, not h_vmem=10G.

•  Let’s edit our script to make it run parallel…

43

Edit Shell Script in the Terminal with nano

Navigation in nano: §  use arrow keys for up, down, left, right §  Ctrl+a for beginning of line; Ctrl+e for end of line §  Other commands at bottom of screen e.g., Ctrl+o, Ctrl+x Exercise 6: cd ~/unix_hpc/test_data Make new script for parallel, open in nano cp test_serial.sh test_parallel.sh nano test_parallel.sh Add line to script with SGE options #$ -pe threaded 4 Modify tophat command tophat -p 4 … Save and close Ctrl+o, [ENTER] Ctrl+x Now submit the jobs qsub test_serial.sh qsub test_parallel.sh

44

1/13/15

23

Monitoring Jobs Exercise 7: qsub test_tenminutes.sh qstat check on submitted jobs echo $LOGNAME check your username qstat -u $LOGNAME check status or your jobs qstat -u $LOGNAME -ext check resource usage, including memory qstat -u $LOGNAME -ext -g t get extended details, including MASTER, SLAVE

nodes for parallel jobs qstat -j job-ID get detailed information about your job status qacct –j 999072 see info about a job after it was run qalter [new qsub options] [job id] In case you want to change parameters while in

“qw” status qdel –u username delete all of your submitted jobs qdel jobnumber delete a single job §  Websites

•  Cluster status: http://hpcweb.niaid.nih.gov/#about?type=About%20Links&requestType=Cluster%20Status

•  Current State: http://hpcwiki.niaid.nih.gov/index.php/Current_State •  Ganglia toolkit: http://cluster.niaid.nih.gov/ganglia/

45

Contact Us

[email protected]

[email protected]

h5p://bioinforma;cs.niaid.nih.gov

46

1/13/15

24

Example Script For SGE #!/bin/bash ## SGE options (see man qsub for more options) #$ -S /bin/bash #type of shell. default is csh #$ -N tophat_test #name of job #$ -q regular.q,memRegular.q #which queue to submit job to. #$ -M [email protected] #email address to send email to #$ -m abe #when to send email: aborted, beginning, end #$ -l h_vmem=5G,h_cpu=1:00:00 #resources (virtual memory, cpu time) #$ -cwd #run the script from current working directory #$ -j y #join stderr and stdout into one job_id.o file ## Script dependencies #export the path for bowtie (tophat needs this) export PATH=$PATH:/usr/local/bio_apps/bowtie export PATH=$PATH:/usr/local/bio_apps/tophat/bin export PATH=$PATH:/usr/local/bio_apps/samtools/ ## Write comments (to make the future you happy) # Ran tophat on the test dataset - andrew (111013) #full path to tophat: /usr/local/bio_apps/tophat/bin/tophat time tophat -r 20 test_ref reads_1.fq reads_2.fq

47

“hashbang,” to specify program used to run script

qsub options

export command for setting environment

variables

command for job

unix basics and cluster computing

Technology