applied bioinformatics - vanderbilt...
TRANSCRIPT
![Page 1: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/1.jpg)
Applied Bioinformatics Course Overview & Introduction to Linux
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
![Page 2: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/2.jpg)
What is bioinformatics
2
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
Bioinformatics
![Page 3: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/3.jpg)
Genomic sequences
3
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
Human genome project (1990-2003)
First bacterial (H. influenzae)
First eukaryote
(yeast)
First metazoan (C. elegans)
http://www.genomesonline.org
Completely Sequenced Genomes September 2012
![Page 4: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/4.jpg)
Genome sequencing costs plunge
4
![Page 5: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/5.jpg)
The Cancer Genome Atlas (TCGA)
n Mission (Bio) q To accelerate the understanding of the molecular basis of cancer
through the application of genome analysis technologies.
n 2014 target (Data)
q 25 tumor types x 500 cases each
q Exome/whole genome sequencing
q Copy number variation
q Promoter methylation
q mRNA expression
q miRNA expression
5
![Page 6: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/6.jpg)
Why now?
6
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
informatics
![Page 7: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/7.jpg)
Roles for different investigators in bioinformatics
n Algorithm developer q Statisticians
q Mathematicians
q Computer scientists
n Tool developer q Bioinformaticians
n Data provider/consumer q Biologists
7
Graph courtesy of http://www.incogen.com/
![Page 8: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/8.jpg)
Comprehensive resource list
8
01/2012
01/2013
http://bioinformatics.ca/links_directory/
![Page 9: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/9.jpg)
Sequence and structure databases
n Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q Annotated collection of all publicly available DNA sequences
q 126,551,501,141 bases in 135,440,924 sequence as of April 2011
n UniProt: http://www.uniprot.org/
q Comprehensive resource for protein sequences and functional information
q 534,242 reviewed entries as of January 2012
n PDB: http://www.rcsb.org/ q 3D structures of large biological molecules, including proteins and nucleic acids
q 79,180 structures as of February 2012
n Pfam: http://pfam.sanger.ac.uk/ q Collection of protein families, each represented by multiple sequence alignments
and hidden Markov models (HMMs)
q 13,672 families as of November 2011
9
![Page 10: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/10.jpg)
Genome browsers
n UCSC genome browser q http://genome.ucsc.edu/cgi-bin/hgGateway
n Ensembl genome browser q http://www.ensembl.org/index.html
10
![Page 11: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/11.jpg)
Gene-centric databases
n Entrez Gene q http://www.ncbi.nlm.nih.gov/gene
q NCBI/NIH
q All completely sequenced genomes
q One gene per page
n Ensembl BioMart q http://www.ensembl.org/biomart/martview
q EMBL-EBI and Sanger Institute
q Vertebrates and other selected eukaryotic species
q Batch information retrieval
11
![Page 12: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/12.jpg)
Gene expression data
n Gene Expression Omnibus (GEO) q http://www.ncbi.nlm.nih.gov/geo/
n ArrayExpress q http://www.ebi.ac.uk/arrayexpress/
12
![Page 13: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/13.jpg)
Pathway and network resources
n Gene Ontology (GO): http://www.geneontology.org/
n Pathway databases q KEGG: http://www.genome.jp/kegg/pathway.html
q Reactome: http://www.reactome.org/
q WikiPathways: http://www.wikipathways.org/
n Protein-protein interaction databases q DIP: http://dip.doe-mbi.ucla.edu/ q MINT: http://mint.bio.uniroma2.it/mint/ q BioGRID: http://www.thebiogrid.org/ q HPRD: http://www.hprd.org
n Protein-DNA interaction database q Transfac: http://www.gene-regulation.com
13
![Page 14: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/14.jpg)
Course content and grades
14
Applied Bioinformatics
IGP300B Bioregulation II, Spring 2014
(M/W/F, 10:00-10:55am, Location TBA)
Module director: Bing Zhang, Ph.D. ([email protected]; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)
Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.
Date Room Subject Instructor Homework (HW) / Project 2/14 206 PRB Course overview & Introduction to Linux Zhang 2/17 407 A-C LH Pairwise sequence alignment Zhao 2/19 407 A-C LH Multiple sequence alignment Zhao 2/21 206 PRB Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/24 407 A-C LH Gene prediction Bush 2/26 407 A-C LH Gene regulatory elements and conservation Bush HW I due 2/28 208 LH In silico and In clinico characterization of genetic variations Bush HW II distribution (20 pts) 3/3 407 A-C LH Supervised analysis of gene expression data Zhang 3/5 407 A-C LH Unsupervised analysis of gene expression data Zhang HW II due 3/7 206 PRB Functional interpretation of gene lists Zhang 3/10 411 A-C LH Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/12 407 A-C LH Data Analysis Project
Zhang & Liu
3/14 208 LH HW III due 3/17 407 A-C LH 3/19 415 A-C LH Project presentation Project presentation (40 pts) 3/21 206 PRB HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54
![Page 15: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/15.jpg)
Course materials and assignments
n Lecture slides available at https://sites.google.com/site/vandyigp/bioregulation-ii/minimester-2/applied-bioinformatics
n Homework assignments available at the same URL on the distribution date (2/21, 2/28, 3/10)
n Homework assignments are due on paper at the beginning of class on the due date (2/26, 3/5, 3/14). There will be a 10% per day deduction for late reports.
n Start thinking about forming project teams (~5 person per team)
n Instructor contact information q Dr. Bing Zhang: [email protected]
q Dr. Zhongming Zhao: [email protected]
q Dr. William Bush: [email protected]
q Dr. Qi Liu: [email protected]
15
![Page 16: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/16.jpg)
ACCRE
n Advanced Computing Center for Research & Education q http://www.accre.vanderbilt.edu/
q The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors
n Linux system q An operating system (OS) like Windows or Mac
q Portable, multi-tasking, multi-user OS
q High performance and free, making it idea for high performance computing clusters
16
![Page 17: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/17.jpg)
Get an ACCRE account n http://www.accre.vanderbilt.edu/?page_id=617
n Registration form q Name, VUNetID, Department (VU), School (VU), Email, Phone, Position
q Group: IGP300b_ab (igp300b_ab) q Primary research area: bioinformatics
q Primary application: Existing Application
q Primary application name: R
q Primary application type: Serial
q Expected typical number of processors: NA
q Expected typical number of concurrent running jobs: 1
q Linux experience:
q Expected compilers/languages: C, C++, R, perl, python
q Expected external libraries: NA
q BlueArc User: No
q Other useful information: NA
17
![Page 18: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/18.jpg)
Logging onto the cluster and change password
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
q Two steps: add profile -> edit profile
q Host name: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
n Mac q Spotlight to find the application: Terminal
q Command: ssh [email protected]
n Change password q rsh vmpsched
q passwd
n Exit q exit
18
![Page 19: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/19.jpg)
Logging onto the cluster and change password (using SSH in Windows)
19
![Page 20: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/20.jpg)
Logging onto the cluster and change password (using Terminal in Mac)
20
You won’t see any response while typing
password, which is fine.
![Page 21: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/21.jpg)
Hierarchical File system
/
bin usr home scratch etc tmp
chmod
cp
date
grep
mv
rm
vi
igptest annie cody bin lib
bin docs src
libc.so
libgpfs.so
libjpeg.so
libstdc++.so
diff
find
gcc
id
make
perl
ssh
prog1.c
prog2.f77
prog3.cpp
myprog.sh
dothis.pl
dothat.py
/home
/home/igptest
/home/igptest/src/prog3.cpp
21
![Page 22: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/22.jpg)
Working with directories
n pwd (prints your present working directory)
n ls (lists directory contents)
n mkdir (makes a directory)
n cd (changes directories) q .. (parent directory)
q . (current directory)
q ~ or no parameter (home directory)
n rmdir (removes an empty directory)
22
![Page 23: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/23.jpg)
Working with files
n more (displays the contents of a file) q space bar to show next page
q q to exist
n cp (copies files)
n mv (renames/moves files)
n rm (removes files)
23
![Page 24: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/24.jpg)
Getting help
n man (display manual pages for a command) q man ls (display manual for the
ls command)
q space bar to show next page
q q to exist
n Alternatives of ls q ls -a (do not ignore entries
starting with .)
q ls -l (use a long listing format)
q ls -al (use a long listing format and do not ignore entries starting with .)
24
![Page 25: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/25.jpg)
Editing files with nano q cd (change to home directory)
q nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).
q Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).
25
![Page 26: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/26.jpg)
Copying files to/from a local computer
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
n Mac q Application: Fugu (http://its.vanderbilt.edu/downloads)
q Connect to: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
q Don’t change other items
26
![Page 27: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/27.jpg)
Copying files to/from a local computer (using SSH in Windows)
27
![Page 28: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/28.jpg)
Copying files to/from a local computer (using Fugu in Mac)
28
![Page 29: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2014Lecture01.pdfThe Cancer Genome Atlas (TCGA) " Mission (Bio) # To accelerate the understanding](https://reader035.vdocuments.net/reader035/viewer/2022071105/5fdf0463aa69ba50967e6b47/html5/thumbnails/29.jpg)
Homework
n Get an ACCRE account
n Log onto the cluster and change password
n Get familiar with the Linux commands introduced today
n Copy the file sample_file.txt under directory /home/igptest to your home directory
n Add “setpkgs –a R” to the end of your .bashrc file
29