unix basics and cluster computing

24
1/13/15 1 Next-Generation Sequencing Analysis Series January 14, 2015 Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH BCBB instructors for this NGS series Andrew Oler Vijay Nagarajan Mariam Quiñones 2 Bioinformatics and Computational Biosciences Branch NIH/NIAID/OD/OSMO/OCICB Contact BCBB at [email protected] Contact HPC Cluster team at: [email protected]

Upload: bcbbslides

Post on 22-Jan-2018

213 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: UNIX Basics and Cluster Computing

1/13/15  

1  

Next-Generation Sequencing Analysis Series

January 14, 2015

Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH

BCBB instructors for this NGS series Andrew Oler Vijay Nagarajan Mariam Quiñones

2

Bioinformatics and Computational Biosciences Branch

NIH/NIAID/OD/OSMO/OCICB

Contact BCBB at [email protected]

Contact HPC Cluster team at: [email protected]

Page 2: UNIX Basics and Cluster Computing

1/13/15  

2  

Bioinformatics and Computational Biosciences Branch

§  Bioinformatics Software Developers

§  Computational Biologists §  Project Managers &

Analysts

http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx

3

Objectives

When you leave today, I hope you will be able to 1.  Open a terminal and know how to navigate 2.  Know how to do basic file manipulation and create files and

directories from the command line 3.  Submit a job to the HPC cluster To accomplish these goals, we will 1.  Learn the most useful Unix terminal commands 2.  Practice a few of these commands 3.  Practice preparing and submitting some scripts to the NIAID

HPC Cluster Caveat: 1.  You may not be a Unix expert when you leave today (and

that’s okay).

4

Page 3: UNIX Basics and Cluster Computing

1/13/15  

3  

Anatomy of the Terminal, “Command Line”, or “Shell”

Prompt (computer_name:current_directory username) Cursor

Command Argument Window

Output

Mac: Applications -> Utilities -> Terminal Windows: Download open source software

PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)

6

Page 4: UNIX Basics and Cluster Computing

1/13/15  

4  

File Manager/Browser by Operating System

7

OS: Windows Mac OSX Unix FM: Explorer Finder Shell

Typical UNIX directory structure

/ “root”

/bin essential binaries

/etc system config

/home user directories

/home/USER1 USER1 home

/home/USER2 USER2 home

/mnt network drives

/sbin system binaries

/usr shared, read-only

/usr/bin other binaries

/usr/local installed packages

/usr/local/bin installed binaries

/var variable data

/var/tmp program caches

pwd “print working directory”; tells where you are

8

Page 5: UNIX Basics and Cluster Computing

1/13/15  

5  

How to execute a command

command argument

output

output

9

Some basic Unix commands

§  pwd §  ls §  mkdir §  cd §  wget §  curl §  cp §  wc §  head §  tail §  less §  cat

§  **See Pre-lecture worksheet.**

10

Page 6: UNIX Basics and Cluster Computing

1/13/15  

6  

Tips to make life easier! Tab completion: hit Tab to make computer guess your filename. type: ls unix[Tab] result: ls unix_hpc If nothing happens on the first tab, press tab again… Up Arrow: recall the previous command(s) Ctrl+a go to beginning of line Ctrl+e go to end of line Ctrl+c kill current running process in terminal Aliases (put in ~/.bashrc file … see handout) alias ls='ls -AFG' alias ll='ls -lrhT' history show every command issued during the session !ls repeat the previous “ls” command !! repeat the previous command man [command] read the manual for the command man ls read the manual for the ls command

11

Accessing the NIAID HPC §  Login to HPC “submit node,” which is the computer from which you submit jobs.

ssh secure shell, remote login ssh [email protected] fill in XXX with number

§  Copy  files  to/from  HPC   scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/

§  ssh  and  scp  will  prompt  you  to  enter  your  password  

12

Page 7: UNIX Basics and Cluster Computing

1/13/15  

7  

mv (“move file”)

mv file1 temp/ move “file1” to the “temp” directory mv file1 file2 rename “file1” to “file2” mv -i file2 temp/file3 move “file2” to the “temp” directory and

rename it “file3”; ask to make sure *without -i, it will overwrite an existing file!*

mv *.fastq ~ move all “.fastq” files to the home directory Exercise 1: mv *.fastq temp/ (moveall “.fastq” files to the “temp” directory) ls temp (check that the files are there) Note: syntax for mv and cp are similar

13

rm (“remove file”)

rm file1 delete “file1” rm -i file2 delete “file2”, but ask first rm *.pdb delete all “.pdb” files rm -r temp delete the “temp” directory rm -rf temp delete the “temp” directory, no questions asked!

Be careful!

rm -r *

14

Page 8: UNIX Basics and Cluster Computing

1/13/15  

8  

File and system information

wc file1 “word count”; output is “lines”, “words”, characters” wc *.fastq “word count” of all fastq files, including summary du -h temp “disk usage” (size) of each file in the “temp”

directory (outputs a list) top report for local machine on the processes using the

most system resources (memory, CPU, etc.); “q” to exit

15

File compression

gzip temp/* compress every file in “temp”; adds .gz extension

gunzip temp/*.gz expand every “gzipped” file in

“temp” tar -zcvf myfiles.tar.gz temp/* create a single archive of

every file in “temp” tar –xvf test_data.tar.gz copy every file out of the archive

“tarball” ≠

16

Page 9: UNIX Basics and Cluster Computing

1/13/15  

9  

File manipulation

cat file1 file2 > file3 write “file3”, containing first “file1”, then “file2”

cat file1 >> file2 append “file1” onto “file2” sort file1 alphabetize “file1.txt” sort -n file1 sort “file1” by number sort -n -r -k 2 file1 sort “file1” by the second word or

column in reverse numerical order

Careful!

17

grep (search within files)

grep key file* report the file name and line where “key” appears in file* grep -v key file* report the file names of files that do not match “key” man grep see other functions of grep. (lots! regular expressions!)

18

Page 10: UNIX Basics and Cluster Computing

1/13/15  

10  

Linking files (making “shortcuts”) ln -s ~/myapp/binary ~/bin make a shortcut (“symbolic link”)

in “~/bin” that points to “~/myapp/binary”

ln -s /usr/local data make a shortcut in current

directory pointing to /usr/local cd data takes you to /usr/local

19

Downloading Files

wget download multiple files from ftp or http address curl download single files from ftp, http, sftp, etc. http://curl.haxx.se/docs/comparison-table.html

20

Page 11: UNIX Basics and Cluster Computing

1/13/15  

11  

Pipelining

ls | wc count the number of files in a directory grep | sort > file1 pull out searched-for lines, sort them, and

write a new file Exercise 2: head –n 2000 lymph1k.fastq | gzip > head2K.txt.gz

21

Loops for assign a variable for each of a space-separated list of values ; Use to separate commands do done Marks start and end of loop to repeat Exercise 3: for i in 1 13 200; do echo $i; done 1 13 200 ls for i in file*; do echo $i; mv “$i” “${i}.txt”; done file1 file2 ls

22

Page 12: UNIX Basics and Cluster Computing

1/13/15  

12  

Recommended Reading

Linux in a Nutshell, Sixth Edition Ellen Siever, Stephen Figgins, Robert Love, Arnold Robbins

Running Linux, 5th Edition Matthias Kalle Dalheimer, Matt Welsh

UNIX® Shells by Example, Fourth Edition Ellie Quigley

23

Take Away

Use  mnemonics    

Read  “man”  pages    

Work  on  copies,  make  backups,  and  use  “rm”,  “mv”,  and  “>”  carefully  

 Pick  a  text  editor  and  master  it  (pico/nano,  emacs,  vi/vim,  etc.)  

 Be  clever!  

Questions? 24

Page 13: UNIX Basics and Cluster Computing

1/13/15  

13  

Using NIAID Grid Engine Cluster

25

High Performance Computing

§  “A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.” http://en.wikipedia.org/wiki/Cluster_%28computing%29

26

Page 14: UNIX Basics and Cluster Computing

1/13/15  

14  

HPC Glossary §  node individual workstation within a network or cluster; a collection of processors all

accessing the same memory (RAM). §  CPU abbreviation for central processing unit. The processor of a node. Also

referred to as a socket. •  try cat /proc/cpuinfo •  Note that “processors” in the output are actually “cores” by the definition below

§  core separate execution core for calculations. e.g., “dual-core” means the processor has two cores. Sometimes each core is referred to as a separate processor.

§  slot a single core available for use within a node. e.g., if a node has 16 cores, it will have 16 slots.

§  hyper-threaded technology (HTT) Where a single execution core is treated as being two virtual cores (or two logical processors) by the system. Some of the nodes in the cluster have HTT. E.g., if there are 16 physical cores, there would be 32 logical processors.

§  thread a single process of a multi-process job. Each thread runs on a separate logical processor. E.g., if you run tophat with -p 10, 10 threads will be created and run in parallel.

These definitions are somewhat flexible…

27

Accessing the NIAID HPC §  Request an HPC account (for NIAID members and collaborators only)

•  https://hpcweb.niaid.nih.gov/#home •  “Request Account”

§  Login to HPC “submit node,” which is the computer from which you submit jobs. ssh secure shell, remote login ssh [email protected]

§  Copy  files  to/from  HPC   scp secure copy to remote location scp -r ~/data/dir [email protected]:~/data/

§  ssh  and  scp  will  prompt  you  to  enter  your  password  

28

Page 15: UNIX Basics and Cluster Computing

1/13/15  

15  

Mounting HPC Drives

29

Mac Windows

1.  Click "Start" > "Computer” 2.  Click on "Map Network

Drive”. 3.  Choose an available drive

letter. 4.  Enter \\ai-hpcfileserver.niaid.nih.gov\bcbb

in the “Folder” field, replacing “bcbb” with your group name or your user name.

(For more details, see link to FAQ below.)

https://hpcweb.niaid.nih.gov/#support?type=Links&requestType=HPC%20FAQs&name=41

Cluster Architecture and Access

image modified from http://ainkaboot.co.uk/

regular.q interactive.q memLong.q

qrsh -q interactive.q qsub -q memLong.q

ssh [email protected]

Submit Node

30

Page 16: UNIX Basics and Cluster Computing

1/13/15  

16  

Cluster Queue System: Sun Grid Engine §  Computers have Linux Red Hat Operating System

§  Grid Engine is a batch queuing system

§  Other queuing systems (http://en.wikipedia.org/wiki/Job_scheduler): •  Portable Batch System (PBS) (e.g., Biowulf) •  TORQUE Resource Manager •  Maui •  Moab •  others… •  Each will require a slightly different syntax for scripts

§  Comes with a set of commands to communicate with the cluster

§  Monitors available resources and users’ workloads to start jobs at the appropriate time

31

Grid Engine jobs

§ Three types of jobs

•  Batch/Serial (one node, one processor)

•  Parallel (multiple processors or nodes)

•  Interactive

32

Input Process Output

Input Process Output

Process

Process

Page 17: UNIX Basics and Cluster Computing

1/13/15  

17  

Grid Engine Jobs: Interactive

§  Login to a node like ssh qrsh -l h_vmem=20G

§  Need to specify parameters -l requested resources in space-delimited list

•  For interactive job:

h_vmem= §  For Biowulf (PBS) (http://biowulf.nih.gov/user_guide.html#interactive): qsub -I -V -l nodes=1

33

Cluster Architecture and Access

image from http://ainkaboot.co.uk/

regular.q interactive.q memLong.q

qrsh -q interactive.q

ssh [email protected]

Submit Node

34

Page 18: UNIX Basics and Cluster Computing

1/13/15  

18  

Test TopHat Job in Interactive Session

§ TopHat is a short read aligner for RNA-seq data § Manual: § http://ccb.jhu.edu/software/tophat/manual.shtml

1.  Check dependencies (e.g., PATH) 2.  Check command syntax and options 3.  Run command with test dataset

35

Grid Engine Jobs: Batch / Serial

§ Single processor, one job

§ Submit a script to the cluster from the submit node, “submit-1”

36

Page 19: UNIX Basics and Cluster Computing

1/13/15  

19  

Cluster Architecture and Access

image from http://ainkaboot.co.uk/

regular.q interactive.q memLong.q

qsub -q memLong.q script.sh OR qsub script.sh *No queue necessary*

ssh [email protected]

Submit Node

37

Text Editors for Composing Scripts (batch jobs)

§  Not the same as a word processor! e.g., Microsoft Word §  Try some, choose a favorite §  Popular for Windows:

•  Notepad++ (nice color-coding) •  EditPad Lite (can open large files > 4Gb)

§  Popular for Mac: •  TextWrangler

§  Popular for Terminal: •  nano •  vi •  emacs

§  http://en.wikipedia.org/wiki/Comparison_of_text_editors

38

Page 20: UNIX Basics and Cluster Computing

1/13/15  

20  

Quick Look at a Shell Script

Exercise 4: cd ~/unix_hpc/test_data cat test_serial.sh §  A few things to notice:

•  #!/bin/bash –  “shebang” or “hashbang,” used to specify the program to run for the script

•  qsub options (next slide) •  export (used to set environmental variables) •  PATH=/path/to/folder:/path/to/another/folder:$PATH

–  used to allow you to simply type the name of the executable instead of the full path to the executable, e.g., type “tophat” instead of “/usr/local/bio_apps/tophat/bin/tophat”

•  Comments about when you ran the job •  Command for job *PBS Script for Biowulf as well.

39

SGE qsub options qsub [options] script.sh command to submit a job to the cluster -S /bin/bash shell to use (default is csh) -N job_name name for your job -q queue.q queue(s) to submit to, e.g.,

memLong.q,memRegular.q -M [email protected] email address to send alert to -m abe when to send email (e.g., beginning, end, aborted) -l resources resources to request, e.g.,

h_vmem=20G,h_cpu=1:00:00,mem_free=10G -cwd run from current working directory. Output to here. -j y join stderr and stdout into one -pe threaded 10 parallel environment: “round” means processors

could be on separate machines, “threaded” all processors on same machine. number of processors/threads.

§  You can put these options on the command-line or in your shell

script §  Lines with these options should begin with #$

40

Page 21: UNIX Basics and Cluster Computing

1/13/15  

21  

Submitting jobs with PBS (Biowulf)

§  PBS options and examples for Biowulf: •  http://biowulf.nih.gov/user_guide.html#batchsamp

§  Examples •  qsub -I -V -l nodes=1 •  qsub -l nodes=1 myjob.bat •  qsub -l nodes=8:o2800 myparalleljob •  qsub -v np=3 -l nodes=2:g24:c24,mem=0 novompi.sh

§  Option lines start with #PBS instead of #$ §  Application-specific usage for Biowulf as well, e.g.,

41

Grid Engine Jobs: Batch / Serial § Submit a script to the cluster from the

submit node Exercise 5: cd ~/unix_hpc/test_data (remember to try tab completion J) qsub test_serial.sh It should say “Your job XXXXXX ("tophat_test") has been submitted” where XXXXXX is the job number. ls –al Do you see a file called tophat_test.oXXXXXX where XXXXXX is your job number? cat tophat_test.oXXXXXX (substitute job number for XXXXXX)

42

Page 22: UNIX Basics and Cluster Computing

1/13/15  

22  

Grid Engine Jobs: Parallel

§  pe commands (threaded, single, etc.) §  Basic use in script: #$ -pe threaded 8 §  Can also use advanced options, e.g.,

•  "-pe 12threaded 48" means use 12 cores per node, for a total of 48 cores needed. This will allocate the job to run on 4 nodes with 12 cores each. Your program must be able to support this

•  "-pe threaded 5-10" means run the job with 10 if available, but down to 5 cores is fine too.

§  Do the math for memory! •  h_vmem is not total, it’s per thread. E.g., if you have a job that

needs 10G total, running on 5 processors, you’ll assign h_vmem=2G, not h_vmem=10G.

•  Let’s edit our script to make it run parallel…

43

Edit Shell Script in the Terminal with nano

Navigation in nano: §  use arrow keys for up, down, left, right §  Ctrl+a for beginning of line; Ctrl+e for end of line §  Other commands at bottom of screen e.g., Ctrl+o, Ctrl+x Exercise 6: cd ~/unix_hpc/test_data Make new script for parallel, open in nano cp test_serial.sh test_parallel.sh nano test_parallel.sh Add line to script with SGE options #$ -pe threaded 4 Modify tophat command tophat -p 4 … Save and close Ctrl+o, [ENTER] Ctrl+x Now submit the jobs qsub test_serial.sh qsub test_parallel.sh

44

Page 23: UNIX Basics and Cluster Computing

1/13/15  

23  

Monitoring Jobs Exercise 7: qsub test_tenminutes.sh qstat check on submitted jobs echo $LOGNAME check your username qstat -u $LOGNAME check status or your jobs qstat -u $LOGNAME -ext check resource usage, including memory qstat -u $LOGNAME -ext -g t get extended details, including MASTER, SLAVE

nodes for parallel jobs qstat -j job-ID get detailed information about your job status qacct –j 999072 see info about a job after it was run qalter [new qsub options] [job id] In case you want to change parameters while in

“qw” status qdel –u username delete all of your submitted jobs qdel jobnumber delete a single job §  Websites

•  Cluster status: http://hpcweb.niaid.nih.gov/#about?type=About%20Links&requestType=Cluster%20Status

•  Current State: http://hpcwiki.niaid.nih.gov/index.php/Current_State •  Ganglia toolkit: http://cluster.niaid.nih.gov/ganglia/

45

Contact Us

[email protected]    

[email protected]    

h5p://bioinforma;cs.niaid.nih.gov  

46

Page 24: UNIX Basics and Cluster Computing

1/13/15  

24  

Example Script For SGE #!/bin/bash ## SGE options (see man qsub for more options) #$ -S /bin/bash #type of shell. default is csh #$ -N tophat_test #name of job #$ -q regular.q,memRegular.q #which queue to submit job to. #$ -M [email protected] #email address to send email to #$ -m abe #when to send email: aborted, beginning, end #$ -l h_vmem=5G,h_cpu=1:00:00 #resources (virtual memory, cpu time) #$ -cwd #run the script from current working directory #$ -j y #join stderr and stdout into one job_id.o file ## Script dependencies #export the path for bowtie (tophat needs this) export PATH=$PATH:/usr/local/bio_apps/bowtie export PATH=$PATH:/usr/local/bio_apps/tophat/bin export PATH=$PATH:/usr/local/bio_apps/samtools/ ## Write comments (to make the future you happy) # Ran tophat on the test dataset - andrew (111013) #full path to tophat: /usr/local/bio_apps/tophat/bin/tophat time tophat -r 20 test_ref reads_1.fq reads_2.fq

47

“hashbang,” to specify program used to run script

qsub options

export command for setting environment

variables

command for job