stat115 stat215 bio512 bist298 introduction to computational biology and bioinformatics spring 2015...
TRANSCRIPT
STAT115STAT215 BIO512 BIST298
Introduction to Computational Biology and Bioinformatics
Spring 2015
Xiaole Shirley Liu
Please Fill Out Student Sign In Sheet
Bioinformatics and Computational Biology
• Interdisciplinary – Statistics, Biology, Computer Science
• Applied– From freshman to postdocs– Useful training for many– The more you practice, the better you get
• Moves with technology development
STAT1152
The Protein Sequence and Structure Wave
• 1955: Sanger sequenced bovine insulin
• 1970: Smith-Waterman algorithm
• 1973: PDB
• 1990: BLAST
• 1994: BLOCKS database
• 1994-: CASP
• 1997-: Proteomics
STAT1153
STAT1154
The Microarray Wave
• Microarray contains hundreds to millions of tiny probes
• Simultaneously detect how much each gene is expressed
STAT1155
ALL vs AML
• Golub et al, Science 1999.
STAT1156
ALL vs AML
“Microarrays” Today
• Infer the expression value of all the genes from 1000 probes
• High throughput drug screen
STAT1157
The DNA Sequencing Wave
STAT1158
• 1953: DNA structure
• 1972: Recombinant DNA
• 1977: Sanger sequencing
• 1985: PCR
• 1988: NCBI
• 1990: BLAST
Sequencing in the 1970s
STAT1159
STAT11510
The Human Genome Race
• Human Genome Project: 1990-2003– Originally 1990-2005– Boosted by technology improvement and
automation– Competition from Celera
STAT11511
Human Genome Sequencing• Clone-by-clone and whole-genome shotgun
STAT11512
The Human Genome Race
• Human Genome Project: 1990-2003– Originally 1990-2005– Boosted by technology improvement and
automation– Competition from Celera
• Informatics essential for both the public and private sequencing efforts– Sequence assembly and gene prediction– Working draft finished simultaneously spring
2000
Sequencing in 2001
Sequencing in 2007
Sequencing Today
• Personal genome sequencing
• HiSeq X– 900GB data / flow cell
in < 3 days, 10 * 30X human genomes, at ~$1.5-2K / sample
STAT11515
Personalized Disease Susceptibility Test and Treatment
STAT11516
Big Data Challenges
STAT11517
All biology is becoming computational, much the same way it has became
molecular … Otherwise “low input, high throughput and no output science”
--- Sydney Brenner
2002 Nobel Prize
STAT11519
Class Information
• Course website: – http://stat115.org/ – Video recording / slides online– Office hours, auditing– Background: CS, Stats, Biology
• Roughly 3 modules (2 HW each)– Transcriptome (microarrays and RNA-seq)– Gene regulation (transcriptional & epigenetic
regulation)– Human genetics and disease (GWAS / cancer)
STAT11520
Class Information
• Teaching Fellows
Yang Li Stephanie Chan
• Labs: Wed 6 – 8pm, Science Center B09 – Tue 6-8pm, HSPH Kresge 209, Boston– First Lab: Fri 1/30 3-5pm (Odyssey)!
STAT11521
HW and Grading
• Discussion forum: stat115.slack.com
• Submission email: [email protected]
• HW 6 * 10 or 6 * 12
• Final exams 20
• Class participation: 20
• Algorithm videos: 5
• Lecture notes: extra 5 points
• Late daysSTAT11522
STAT11523
Gene Expression Microarrays
25
Expression Microarrays
• Grow cells at certain condition, collect mRNA population, and label them
• Microarray has high density (thousands to millions) sequence specific probes with known location for each gene/RNA
• Sample hybridized to microarray probes by DNA (A-T, G-C) base pairing, wash non-specific binding
• Measure sample mRNA value by checking labeled signals at each probe location
26
Affymetrix GeneChip Arrays
27
Labeled Samples Hybridize to DNA Probes on GeneChip
28
Shining Laser Light CausesTagged Fragments to Glow
29
Perfect Match (PM) vs MisMatch (MM)(control for cross hybridization)
NimbleGen Arrays
30
Agilent Arrays
31
Microarrays
• Array comparison:– # probes / array, # probes / gene, probe length– Flexibility vs data reuse
• Why do we bother learning about microarrays now?– RNA-seq is probably preferred in new
expression experiments– The amount of useful public data– The data analysis techniques
STAT11532