genomics and bioinformatics the "new" biology. what is genomics genome all the dna...

39
Genomics and Bioinformatics The "new" biology

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Genomics and Bioinformatics

The "new" biology

Page 2: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

What is genomics

Genome All the DNA contained in the

cell of an organism

Genomics The comprehensive study of

the interactions and functional dynamics of whole sets of genes and their products. (NIAAA, NIH)

A "scaled-up" version of genetics research in which scientists can look at all of the genes in a living creature at the same time. (NIGMS, NIH)

Which organism’s genome was sequenced first?

Page 3: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Genome sequencing chronology

Year Organism SignificanceGenome size (bp)

Number of genes

1977

Bacteriophage fX174

First genome ever!

5,386 11

1981

Human mitochondria

First organelle

16,500 37

1995

Haemophilus influenzae Rd

First free-living organism

1,830,137 ~3,500

1996

Saccharomyces cerevisiae

First eukaryote

12,086,000 ~6,000

http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpghttp://www.alsa.org/research/article.cfm?id=822

http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpghttp://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg

Page 4: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Genome sequencing chronology

Year Organism SignificanceGenome size (bp)

Number of genes

1998

Caenorhab-ditis elegans

First multi-cellular organism

97,000,000 ~19,000

1999

Human chromosome 22

First human chromosome

49,000,000 673

2000

Arabidopsis thaliana

First plant genome

150,000,000 ~25,000

2001

HumanFirst human genome

3,000,000,000 ~30,000

http://www.sih.m.u-tokyo.ac.jp/chem1.gif

http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg

Page 5: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Genome sequencing projects (as of 1/26,2007)

Page 6: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Sequencing strategies: Hierarchical shotgun sequencing

http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

Page 7: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Genome size range

What’re there in the genomes? Why are there such a big difference?

viruses

plasmids

bacteria

fungi

plants

algae

insects

mollusks

reptiles

birds

mammals

104 108105 106 107 10111010109

bony fish

amphibians

Page 8: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Information contents in a genome

Gene Protein coding genes RNA genes

Regulatory elements Gene expression control Chromatin remodeling Matrix attachment sites

“Non-functional” elements Selfish elements “Junk” DNA ??

Page 9: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

The “central dogma” of molecular biology

Central dogma

DNA

RNA

Protein

Transcription

Translation

Replication

Page 10: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Expanded “central dogma” of molecular biology

A more comprehensive view

DNA

RNA

Protein

Transcription

Translation

Replication

Metabolite

Pheno-type

Page 11: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

New disciplines due to the advance in genomics

Omics

DNA

RNA

Protein

Transcription

Translation

Replication

Metabolite

Pheno-type

Structuralgenomics

Transcriptomics

Proteomics

Metabolomics

Genomic DNAsequences

Transcript seqMicroarray data

Cis-elementsTF binding sites

Epigenetic regulation

Shotgun protein seqSubcellular location

Post-translational modProtein interactionProtein structure

Metabolite concnMetabolic flux

Genetic interactionsSystematic KO

Disease information

Page 12: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Nature omics gateway

http://www.nature.com/omics/subjects/index.html

Page 13: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Three perspectives of our biological world

The cellular level, the individual, the tree of life

Rosenzweig et al., 2002. Conservation Biol.Image: htto://www.tolweb.org/tree/Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html

~1014 cells per individual 2-100x106 species~3x104 genes

Page 14: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Further complications

Cell-cell interactions

Cell types

Environmental conditions

Developmental programming

Interactions at the organismal level

Interactions at the population, ecosystem level

Page 15: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Definition of bioinformatics

Bioinformatics Research, development, or application of Computational tools and approaches for expanding the use of Biological, medical, behavioral or health data, including those

to Acquire, store, organize, archive, analyze, or visualize such

data.

Computational biology The development and application of Data-analytical and theoretical methods, mathematical

modeling and computational simulation techniques to The study of biological, behavioral, and social systems

Q: What kinds of data are we taking about?

http://www.bisti.nih.gov/

Page 16: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Sequence assembly

Cut into ~150kb pieces

Clone into Bacterial Artificial Chromosome (BAC)

Mapped to determine order of the BAC clones (golden/tiling path)

Shear a BAC clone randomly

Sequencing

Assembie sequence reads

http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

Page 17: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Sequence assembly

Challenges The presence of gaps

Due to incomplete coverage Sequencing error and quality issue: worse at the end of

reactions So can’t rely on perfectly identical sequences all the time

Sequences derived from one strand of DNA Need to take orientations of reads into account

Non-random sequencing of DNA

Presence of repeats

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Correct layout

Mis-assembly

Page 18: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Overlap-layout consensus

The relationships between reads can be represented as a graph Nodes (vertices): reads Edges (lines): connecting “overlapping reads”

Goal: identifying a path through that graph that visits each node exactly once

1234

1

2

3

4

Genome

http://en.wikipedia.org/wiki/Image:Hamilton_path.gif

Page 19: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Gene prediction

How can we identify functional elements in the genomes?

How can we assign functions to these elements?

How can we determine/predict the structures of these elements?

How can we reconstruct networks describing the relationships and dynamics between these elements?

How can we link genotypes to phenotypes?

Page 20: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Characteristic of protein coding genes

Similarity to other genes Assuming there is some level of conservation. Substitutions that change amino acids vs. those that won’t.

http://www.mun.ca/biology/scarr/MGA2_03-20.html

Page 21: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Hidden Markov Model and gene finding

Goal: Choose a path that maximize the probability that you will

enjoy the trip (or the other way around if you wish)

How is the probability determined?

p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2

Page 22: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Sequence alignment

Align retinol-binding protein and b-lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

>RBPMKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIV

>lactoglobulinMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Page 23: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Goal of PSA

Find an alignment between 2 sequences with the maximum score

Page 24: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Extreme value distribution

Normal vs. extreme value distribution

x

pro

bab

ilit

y extreme value distribution

normal distribution

0 1 2 3 4 5-1-2-3-4-5

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0

Page 25: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Microarray

A solid support (e.g. a membrane or glass slide) on which DNA of known sequence is deposited in a grid-like fashion

http://shadygrove.umbi.umd.edu/microarray/Microarray.gif

Page 26: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Microarray data analysis

A simplified pipeline

http://www.microarray.lu/images/overview_1.jpg

Page 27: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

What’s in the cel files

Intensities of perfect and mismatch probes

#### Dimension of the data matrixnrow(M); ncol(M)

### Perfect matchpm <- pm(M) # perfect match intensitiesdim(pm) # dimension of the pm matrixpm[1:5,] # the first five columnssummary(pm) # summary stat for the pm matrix

GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL[1,] 252.5 267.0 349.0 424.8 213.5 237.8[2,] 138.0 129.8 147.5 335.5 215.3 142.3[3,] 172.3 155.5 174.8 411.8 241.0 128.3[4,] 163.3 142.8 155.5 494.3 225.5 119.5[5,] 259.5 257.3 245.3 505.5 308.8 217.0

GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL Min. : 56.3 Min. : 67.5 Min. : 69.5 Min. : 96.0 1st Qu.: 144.3 1st Qu.: 143.3 1st Qu.: 157.3 1st Qu.: 303.6 Median : 212.5 Median : 215.0 Median : 234.8 Median : 414.5 Mean : 423.1 Mean : 437.5 Mean : 458.4 Mean : 648.2 3rd Qu.: 383.5 3rd Qu.: 397.8 3rd Qu.: 426.0 3rd Qu.: 637.0 Max. :39818.5 Max. :39268.0 Max. :28628.0 Max. :24854.5

Page 28: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Probe intensity behaviors between arrays

Distributions vary widely between experiments

### Summarize the intensitypar(mfrow=c(1,2)) # get a plotting region with 1 row, 2 colhist(M) # generate log2 histogramsboxplot(M) # generate log2 boxplots

log inte

nsi

ty

Page 29: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Identification of cis-elements

The on-off switches and rheostats of a cell operating at the gene level.

They control whether and how vigorously that genes will be transcribed into RNAs.

http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml

Page 30: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Motif model: Position Frequency Matrix (PFM)

fb,i : freuqnecy of a base b occurred at the i-th position

D’haeseleer (2006) Nature Biotech. 24:423

Page 31: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Motif model: Position Weight Matrix (PWM)

Suppose pA,T = 0.32 and pG,C = 0.18 (Arabidopsis thaliana)

b

bibib p

NpnW

)1/(ln ,

,

1 2 3 4 5

A 8 0 4 4 2

T 0 0 0 2 2

G 0 8 4 2 2

C 0 0 0 0 2

Position Frequency Matrix

1 2 3 4 5

A 1.1 -2.2 0.4 0.4 -0.2

T -2.2 -2.2 -2.2 -0.2 -0.2

G -2.2 1.6 1.0 0.3 0.3

C -2.2 -2.2 -2.2 -2.2 0.3

Position Wight Matrix

Page 32: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Example: Cis-regulatory logic

Based on a high confidence set of binding sites: 3,353 interactions

between 116 regulators and 1,296 promoters

Harbison et al. (2004) Nature 43:99

Page 33: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Identification of putative cis elements

Pearson's correlation coefficient as the similarity measure. k-mean clustering to identify co-regulated genes. Motifs identified only with AlignACE

Beer and Tavazoie (2004) Cell 117:185

Page 34: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Bayesian network

Bayes' theorem

Bayesian network

Charniak (1991) Bayesian networks without tears

)(

)()|()|(

BP

APABPBAP

n

iiin XparentsXPXXP

11 |,...,

Page 35: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Final example: Relationships between sequences

Sanger and colleagues (1950s): 1st sequence

Insulin from various mammals

Page 36: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Trees

An acyclic, un-directed graph with nodes and edges

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

Li 1997. Molecular Evolution. p101

one unit

6

1

2

2

1

A

BC

2

1

2

D

E

Operationaltaxonomic unit

Ancestraltaxonomic units

Externalbranch

Internal branch

Page 37: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Enumerating trees

Suppose there are n OTUs (n ≥ 3) Bifurcating rooted trees:

Unrooted trees:

For 10 OTUs 3.4x107 possible rooted trees 2.0x106 possible unrooted trees

http://w3.uniroma1.it/cogfil/philotrees.jpg

)!3(2

)!52(3

n

nN

nU

)!3(2

)!32(3

n

nN

nR

Page 38: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

Impacts of genomics and bioinformatics

New ways to ask and answer question? Hypothesis driven vs. data driven A matter of scale A matter of integration Quantitative emphasis Multi-displinary approaches

How is genomics different from genetics? Whole genome approach versus a few genes Investigations into the structure and function of very large

numbers of genes undertaken in a simultaneous fashion. Genetics looks at single genes, one at a time, as a snapshot. Genomics is trying to look at all the genes as a dynamic

system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense

Page 39: Genomics and Bioinformatics The "new" biology. What is genomics  Genome  All the DNA contained in the cell of an organism  Genomics  The comprehensive

The END

...