unix basics and cluster computing
TRANSCRIPT
1/13/15
1
Next-Generation Sequencing Analysis Series
January 14, 2015
Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH
BCBB instructors for this NGS series Andrew Oler Vijay Nagarajan Mariam Quiñones
2
Bioinformatics and Computational Biosciences Branch
NIH/NIAID/OD/OSMO/OCICB
Contact BCBB at [email protected]
Contact HPC Cluster team at: [email protected]
1/13/15
2
Bioinformatics and Computational Biosciences Branch
§ Bioinformatics Software Developers
§ Computational Biologists § Project Managers &
Analysts
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
3
Objectives
When you leave today, I hope you will be able to 1. Open a terminal and know how to navigate 2. Know how to do basic file manipulation and create files and
directories from the command line 3. Submit a job to the HPC cluster To accomplish these goals, we will 1. Learn the most useful Unix terminal commands 2. Practice a few of these commands 3. Practice preparing and submitting some scripts to the NIAID
HPC Cluster Caveat: 1. You may not be a Unix expert when you leave today (and
that’s okay).
4
1/13/15
3
Anatomy of the Terminal, “Command Line”, or “Shell”
Prompt (computer_name:current_directory username) Cursor
Command Argument Window
Output
Mac: Applications -> Utilities -> Terminal Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
6
1/13/15
4
File Manager/Browser by Operating System
7
OS: Windows Mac OSX Unix FM: Explorer Finder Shell
Typical UNIX directory structure
/ “root”
/bin essential binaries
/etc system config
/home user directories
/home/USER1 USER1 home
/home/USER2 USER2 home
/mnt network drives
/sbin system binaries
/usr shared, read-only
/usr/bin other binaries
/usr/local installed packages
/usr/local/bin installed binaries
/var variable data
/var/tmp program caches
pwd “print working directory”; tells where you are
8
1/13/15
5
How to execute a command
command argument
output
output
9
Some basic Unix commands
§ pwd § ls § mkdir § cd § wget § curl § cp § wc § head § tail § less § cat
§ **See Pre-lecture worksheet.**
10
1/13/15
6
Tips to make life easier! Tab completion: hit Tab to make computer guess your filename. type: ls unix[Tab] result: ls unix_hpc If nothing happens on the first tab, press tab again… Up Arrow: recall the previous command(s) Ctrl+a go to beginning of line Ctrl+e go to end of line Ctrl+c kill current running process in terminal Aliases (put in ~/.bashrc file … see handout) alias ls='ls -AFG' alias ll='ls -lrhT' history show every command issued during the session !ls repeat the previous “ls” command !! repeat the previous command man [command] read the manual for the command man ls read the manual for the ls command
11
Accessing the NIAID HPC § Login to HPC “submit node,” which is the computer from which you submit jobs.
ssh secure shell, remote login ssh [email protected] fill in XXX with number
§ Copy files to/from HPC scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/
§ ssh and scp will prompt you to enter your password
12
1/13/15
7
mv (“move file”)
mv file1 temp/ move “file1” to the “temp” directory mv file1 file2 rename “file1” to “file2” mv -i file2 temp/file3 move “file2” to the “temp” directory and
rename it “file3”; ask to make sure *without -i, it will overwrite an existing file!*
mv *.fastq ~ move all “.fastq” files to the home directory Exercise 1: mv *.fastq temp/ (moveall “.fastq” files to the “temp” directory) ls temp (check that the files are there) Note: syntax for mv and cp are similar
13
rm (“remove file”)
rm file1 delete “file1” rm -i file2 delete “file2”, but ask first rm *.pdb delete all “.pdb” files rm -r temp delete the “temp” directory rm -rf temp delete the “temp” directory, no questions asked!
Be careful!
rm -r *
14
1/13/15
8
File and system information
wc file1 “word count”; output is “lines”, “words”, characters” wc *.fastq “word count” of all fastq files, including summary du -h temp “disk usage” (size) of each file in the “temp”
directory (outputs a list) top report for local machine on the processes using the
most system resources (memory, CPU, etc.); “q” to exit
15
File compression
gzip temp/* compress every file in “temp”; adds .gz extension
gunzip temp/*.gz expand every “gzipped” file in
“temp” tar -zcvf myfiles.tar.gz temp/* create a single archive of
every file in “temp” tar –xvf test_data.tar.gz copy every file out of the archive
“tarball” ≠
16
1/13/15
9
File manipulation
cat file1 file2 > file3 write “file3”, containing first “file1”, then “file2”
cat file1 >> file2 append “file1” onto “file2” sort file1 alphabetize “file1.txt” sort -n file1 sort “file1” by number sort -n -r -k 2 file1 sort “file1” by the second word or
column in reverse numerical order
Careful!
17
grep (search within files)
grep key file* report the file name and line where “key” appears in file* grep -v key file* report the file names of files that do not match “key” man grep see other functions of grep. (lots! regular expressions!)
18
1/13/15
10
Linking files (making “shortcuts”) ln -s ~/myapp/binary ~/bin make a shortcut (“symbolic link”)
in “~/bin” that points to “~/myapp/binary”
ln -s /usr/local data make a shortcut in current
directory pointing to /usr/local cd data takes you to /usr/local
19
Downloading Files
wget download multiple files from ftp or http address curl download single files from ftp, http, sftp, etc. http://curl.haxx.se/docs/comparison-table.html
20
1/13/15
11
Pipelining
ls | wc count the number of files in a directory grep | sort > file1 pull out searched-for lines, sort them, and
write a new file Exercise 2: head –n 2000 lymph1k.fastq | gzip > head2K.txt.gz
21
Loops for assign a variable for each of a space-separated list of values ; Use to separate commands do done Marks start and end of loop to repeat Exercise 3: for i in 1 13 200; do echo $i; done 1 13 200 ls for i in file*; do echo $i; mv “$i” “${i}.txt”; done file1 file2 ls
22
1/13/15
12
Recommended Reading
Linux in a Nutshell, Sixth Edition Ellen Siever, Stephen Figgins, Robert Love, Arnold Robbins
Running Linux, 5th Edition Matthias Kalle Dalheimer, Matt Welsh
UNIX® Shells by Example, Fourth Edition Ellie Quigley
23
Take Away
Use mnemonics
Read “man” pages
Work on copies, make backups, and use “rm”, “mv”, and “>” carefully
Pick a text editor and master it (pico/nano, emacs, vi/vim, etc.)
Be clever!
Questions? 24
1/13/15
13
Using NIAID Grid Engine Cluster
25
High Performance Computing
§ “A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.” http://en.wikipedia.org/wiki/Cluster_%28computing%29
26
1/13/15
14
HPC Glossary § node individual workstation within a network or cluster; a collection of processors all
accessing the same memory (RAM). § CPU abbreviation for central processing unit. The processor of a node. Also
referred to as a socket. • try cat /proc/cpuinfo • Note that “processors” in the output are actually “cores” by the definition below
§ core separate execution core for calculations. e.g., “dual-core” means the processor has two cores. Sometimes each core is referred to as a separate processor.
§ slot a single core available for use within a node. e.g., if a node has 16 cores, it will have 16 slots.
§ hyper-threaded technology (HTT) Where a single execution core is treated as being two virtual cores (or two logical processors) by the system. Some of the nodes in the cluster have HTT. E.g., if there are 16 physical cores, there would be 32 logical processors.
§ thread a single process of a multi-process job. Each thread runs on a separate logical processor. E.g., if you run tophat with -p 10, 10 threads will be created and run in parallel.
These definitions are somewhat flexible…
27
Accessing the NIAID HPC § Request an HPC account (for NIAID members and collaborators only)
• https://hpcweb.niaid.nih.gov/#home • “Request Account”
§ Login to HPC “submit node,” which is the computer from which you submit jobs. ssh secure shell, remote login ssh [email protected]
§ Copy files to/from HPC scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/
§ ssh and scp will prompt you to enter your password
28
1/13/15
15
Mounting HPC Drives
29
Mac Windows
1. Click "Start" > "Computer” 2. Click on "Map Network
Drive”. 3. Choose an available drive
letter. 4. Enter \\ai-hpcfileserver.niaid.nih.gov\bcbb
in the “Folder” field, replacing “bcbb” with your group name or your user name.
(For more details, see link to FAQ below.)
https://hpcweb.niaid.nih.gov/#support?type=Links&requestType=HPC%20FAQs&name=41
Cluster Architecture and Access
image modified from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qrsh -q interactive.q qsub -q memLong.q
Submit Node
30
1/13/15
16
Cluster Queue System: Sun Grid Engine § Computers have Linux Red Hat Operating System
§ Grid Engine is a batch queuing system
§ Other queuing systems (http://en.wikipedia.org/wiki/Job_scheduler): • Portable Batch System (PBS) (e.g., Biowulf) • TORQUE Resource Manager • Maui • Moab • others… • Each will require a slightly different syntax for scripts
§ Comes with a set of commands to communicate with the cluster
§ Monitors available resources and users’ workloads to start jobs at the appropriate time
31
Grid Engine jobs
§ Three types of jobs
• Batch/Serial (one node, one processor)
• Parallel (multiple processors or nodes)
• Interactive
32
Input Process Output
Input Process Output
Process
Process
1/13/15
17
Grid Engine Jobs: Interactive
§ Login to a node like ssh qrsh -l h_vmem=20G
§ Need to specify parameters -l requested resources in space-delimited list
• For interactive job:
h_vmem= § For Biowulf (PBS) (http://biowulf.nih.gov/user_guide.html#interactive): qsub -I -V -l nodes=1
33
Cluster Architecture and Access
image from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qrsh -q interactive.q
Submit Node
34
1/13/15
18
Test TopHat Job in Interactive Session
§ TopHat is a short read aligner for RNA-seq data § Manual: § http://ccb.jhu.edu/software/tophat/manual.shtml
1. Check dependencies (e.g., PATH) 2. Check command syntax and options 3. Run command with test dataset
35
Grid Engine Jobs: Batch / Serial
§ Single processor, one job
§ Submit a script to the cluster from the submit node, “submit-1”
36
1/13/15
19
Cluster Architecture and Access
image from http://ainkaboot.co.uk/
regular.q interactive.q memLong.q
qsub -q memLong.q script.sh OR qsub script.sh *No queue necessary*
Submit Node
37
Text Editors for Composing Scripts (batch jobs)
§ Not the same as a word processor! e.g., Microsoft Word § Try some, choose a favorite § Popular for Windows:
• Notepad++ (nice color-coding) • EditPad Lite (can open large files > 4Gb)
§ Popular for Mac: • TextWrangler
§ Popular for Terminal: • nano • vi • emacs
§ http://en.wikipedia.org/wiki/Comparison_of_text_editors
38
1/13/15
20
Quick Look at a Shell Script
Exercise 4: cd ~/unix_hpc/test_data cat test_serial.sh § A few things to notice:
• #!/bin/bash – “shebang” or “hashbang,” used to specify the program to run for the script
• qsub options (next slide) • export (used to set environmental variables) • PATH=/path/to/folder:/path/to/another/folder:$PATH
– used to allow you to simply type the name of the executable instead of the full path to the executable, e.g., type “tophat” instead of “/usr/local/bio_apps/tophat/bin/tophat”
• Comments about when you ran the job • Command for job *PBS Script for Biowulf as well.
39
SGE qsub options qsub [options] script.sh command to submit a job to the cluster -S /bin/bash shell to use (default is csh) -N job_name name for your job -q queue.q queue(s) to submit to, e.g.,
memLong.q,memRegular.q -M [email protected] email address to send alert to -m abe when to send email (e.g., beginning, end, aborted) -l resources resources to request, e.g.,
h_vmem=20G,h_cpu=1:00:00,mem_free=10G -cwd run from current working directory. Output to here. -j y join stderr and stdout into one -pe threaded 10 parallel environment: “round” means processors
could be on separate machines, “threaded” all processors on same machine. number of processors/threads.
§ You can put these options on the command-line or in your shell
script § Lines with these options should begin with #$
40
1/13/15
21
Submitting jobs with PBS (Biowulf)
§ PBS options and examples for Biowulf: • http://biowulf.nih.gov/user_guide.html#batchsamp
§ Examples • qsub -I -V -l nodes=1 • qsub -l nodes=1 myjob.bat • qsub -l nodes=8:o2800 myparalleljob • qsub -v np=3 -l nodes=2:g24:c24,mem=0 novompi.sh
§ Option lines start with #PBS instead of #$ § Application-specific usage for Biowulf as well, e.g.,
41
Grid Engine Jobs: Batch / Serial § Submit a script to the cluster from the
submit node Exercise 5: cd ~/unix_hpc/test_data (remember to try tab completion J) qsub test_serial.sh It should say “Your job XXXXXX ("tophat_test") has been submitted” where XXXXXX is the job number. ls –al Do you see a file called tophat_test.oXXXXXX where XXXXXX is your job number? cat tophat_test.oXXXXXX (substitute job number for XXXXXX)
42
1/13/15
22
Grid Engine Jobs: Parallel
§ pe commands (threaded, single, etc.) § Basic use in script: #$ -pe threaded 8 § Can also use advanced options, e.g.,
• "-pe 12threaded 48" means use 12 cores per node, for a total of 48 cores needed. This will allocate the job to run on 4 nodes with 12 cores each. Your program must be able to support this
• "-pe threaded 5-10" means run the job with 10 if available, but down to 5 cores is fine too.
§ Do the math for memory! • h_vmem is not total, it’s per thread. E.g., if you have a job that
needs 10G total, running on 5 processors, you’ll assign h_vmem=2G, not h_vmem=10G.
• Let’s edit our script to make it run parallel…
43
Edit Shell Script in the Terminal with nano
Navigation in nano: § use arrow keys for up, down, left, right § Ctrl+a for beginning of line; Ctrl+e for end of line § Other commands at bottom of screen e.g., Ctrl+o, Ctrl+x Exercise 6: cd ~/unix_hpc/test_data Make new script for parallel, open in nano cp test_serial.sh test_parallel.sh nano test_parallel.sh Add line to script with SGE options #$ -pe threaded 4 Modify tophat command tophat -p 4 … Save and close Ctrl+o, [ENTER] Ctrl+x Now submit the jobs qsub test_serial.sh qsub test_parallel.sh
44
1/13/15
23
Monitoring Jobs Exercise 7: qsub test_tenminutes.sh qstat check on submitted jobs echo $LOGNAME check your username qstat -u $LOGNAME check status or your jobs qstat -u $LOGNAME -ext check resource usage, including memory qstat -u $LOGNAME -ext -g t get extended details, including MASTER, SLAVE
nodes for parallel jobs qstat -j job-ID get detailed information about your job status qacct –j 999072 see info about a job after it was run qalter [new qsub options] [job id] In case you want to change parameters while in
“qw” status qdel –u username delete all of your submitted jobs qdel jobnumber delete a single job § Websites
• Cluster status: http://hpcweb.niaid.nih.gov/#about?type=About%20Links&requestType=Cluster%20Status
• Current State: http://hpcwiki.niaid.nih.gov/index.php/Current_State • Ganglia toolkit: http://cluster.niaid.nih.gov/ganglia/
45
Contact Us
h5p://bioinforma;cs.niaid.nih.gov
46
1/13/15
24
Example Script For SGE #!/bin/bash ## SGE options (see man qsub for more options) #$ -S /bin/bash #type of shell. default is csh #$ -N tophat_test #name of job #$ -q regular.q,memRegular.q #which queue to submit job to. #$ -M [email protected] #email address to send email to #$ -m abe #when to send email: aborted, beginning, end #$ -l h_vmem=5G,h_cpu=1:00:00 #resources (virtual memory, cpu time) #$ -cwd #run the script from current working directory #$ -j y #join stderr and stdout into one job_id.o file ## Script dependencies #export the path for bowtie (tophat needs this) export PATH=$PATH:/usr/local/bio_apps/bowtie export PATH=$PATH:/usr/local/bio_apps/tophat/bin export PATH=$PATH:/usr/local/bio_apps/samtools/ ## Write comments (to make the future you happy) # Ran tophat on the test dataset - andrew (111013) #full path to tophat: /usr/local/bio_apps/tophat/bin/tophat time tophat -r 20 test_ref reads_1.fq reads_2.fq
47
“hashbang,” to specify program used to run script
qsub options
export command for setting environment
variables
command for job