6/12/2015 © bud mishra, 2002 l11-1 lecture #11: evolution & genome structure special topics in...

51
03/27/22 ©Bud Mishra, 2002 L11-1 Special Topics in Computational Biology Lecture #11: Lecture #11: Evolution & Genome Evolution & Genome Structure Structure ¦ Bud Mishra Professor of Computer Science and Mathematics (Courant, NYU) Professor (Watson School, CSHL) 4 ¦ 9 ¦ 2002

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

04/18/23 ©Bud Mishra, 2002

L11-1

Special Topics in Computational Biology

Lecture #11: Lecture #11: Evolution & Genome Evolution & Genome

StructureStructure¦

Bud MishraProfessor of Computer Science and Mathematics (Courant,

NYU)Professor (Watson School, CSHL)

4 ¦ 9 ¦ 2002

04/18/23 ©Bud Mishra, 2002

L11-2

Genomes & Genes

• Genome (GENe + chromosOME): Collection of genes.

• A gene is contained in an open reading frame– it consists of– INTRONS: Intervening sequences Noncoding

regions– EXONS: Protein coding regions

• Introns are abundant in eukaryotes and certain animal viruses.

04/18/23 ©Bud Mishra, 2002

L11-3

Interrupted Genes:

Intron1 Intron2Intron3

Exon1 Exon2

Transcription

Splicing

DNA

RNAPrimary transcript

mRNA

04/18/23 ©Bud Mishra, 2002

L11-4

Some Genes…

Gene Product Organism ExonLength

#Introns

Intron Length

Adenoshine deaminase Human 1500 11 30,000

Apolipoprotein B Human 14,000 28 29,000

Erythropoietin Human 582 4 1562

Thyroglobulin Human 8500 = 40 100,000

-interferon Human 600 0 0

Fibroin Silk Worm 18,000 1 970

Phaseolin French Bean

1263 5 515

04/18/23 ©Bud Mishra, 2002

L11-5

Regulation of Gene Expressions

• Motifs (short DNA sequences) that regulate transcription– Promoter– Terminator

• Motifs that modulate transcription– Repressor– Activator– Antiterminator

Promoter

GeneTranscriptional

Initiation

Transcriptional

Termination

Terminator

10-35bp

04/18/23 ©Bud Mishra, 2002

L11-6

Motifs

• Each motif is a binding site for a specific protein• Transcription Factor:

– Transcription factors (specific to a cell/environmental conditions) bind to regulatory regions and facilitate

• Assembly of RNA polymerase into a transcriptional complex• Activation of a transcriptional complex.

• Termination Factor:– Assembly of proteins for termination and modification of

the end of the RNA• Epigenetic Changes

– Methylation of the cytosine in the 5’ region– Structural changes in cromatin

04/18/23 ©Bud Mishra, 2002

L11-7

Organization of Genetic Information

• Bacterial Genome:– Genes are closely spaced along the DNA.– The sequences of genes may overlap.– Related genes (encoding enzymes whose

functions are part of the same pathway or whose activities are related) are linked as a single transcription unit.

04/18/23 ©Bud Mishra, 2002

L11-8

Organization of Genetic Information

• Eukaryotic Genome:– Genes are separated by long stretches of

noncoding DNA sequences.– Multiple genes in a single transcription unit is

extremely rare.– Multiple chromosomes – Linear– Chloroplasts and mitochondria – Circular– Genes appearing on the same chromosome are

syntenic.

04/18/23 ©Bud Mishra, 2002

L11-9

Location of Some Genes on Human Chromosome.

Genes chromosomes

-globin cluster 16

-globin cluster 11

Immunoglobulin

(light chain) 2

(light chain) 22

Heavy Chain 14

Pseudogenes 9,32,15,18

Growth Hormone gene cluster

17

Thymidine kinase 17

Genes chromosomes

Insulin 11

Galactokinase 11

Viral oncogene homologues

C-sis 22

C-mos 8

C-Ha-Ras-1 11

C-myb 6

Interferons

& cluster 9

12

04/18/23 ©Bud Mishra, 2002

L11-10

Eukaryotic Genome

• Multiple copies of the same gene– Solve “supply problem”– There are several hundred ribosomal RNA

genes I mammals

• Pseudogenes– Nonfunctional copies of genes…(Deletions or

alterations in the DNA sequence)– Number of pseudo genes for a particular gene

varies greatly…Different from one organism to another.

04/18/23 ©Bud Mishra, 2002

L11-11

Genes in Eukaryotes

• A gene may appear exactly once• It may be part of a family of repeated sequence .

Members of a family may be clustered or dispersed.• Members of a gene family may be related and

functional (expressed at different times in development, or in different cells) or may be pseudo genes.

• Chromosomal Morphology:– Nucleolar organizers (genes for ribosomal RNA)– Telomeric and Centromeric regions (Tandemly repeated

sequences)

04/18/23 ©Bud Mishra, 2002

L11-12

The Rearrangement of DNA Sequences

• Reshuffling of genes between homologous chromosomes via reciprocal crossing-over during both meiosis and mitosis.

• Gene synteny and linkages are usually preserved.

• Most rearrangements are random.• Some rearrangements are normal

processes altering gene expressions in an orderly and programmed manner.

04/18/23 ©Bud Mishra, 2002

L11-13

Chromosomal Aberrations

• Breakage• Translocation (Among non-homologous

chromosomes.)• Formation of acentric and dicentric chromosomes.• Gene Conversions• Amplification and deletions• Point mutations• Jumping genes Transposition of DNA segments• Programmed rearrangements E.g., antibody

responses.

04/18/23 ©Bud Mishra, 2002

L11-14

Repeat Structure

• Copy Number: 2 » 106

• Direct Repeats “head-to-tail”– Tandem repeats or separated by other

sequences

• Inverted Repeats “head-to-head”– Stem-and-loop structure– Hairpin structure

• Reverse Palindrome• True Palindrome

04/18/23 ©Bud Mishra, 2002

L11-15

Repeat Structure

• Tandem Direct Repeats

• Inverted Repeats

• Reverse Palindrome• True Palindrome

5’-AAGAG AAGAG AAGAG-3’

5’-GTCCAGN NCTGGAC-3’ CAGGTCN NGACCTG

G C

A T

C G

C G

T A

G C

Stem-and-loop structureAssociated with inverted

repeats

5’-GAATTC-3’ CTTAAG

5’-GTCAATGA AGTAACTG-3’

04/18/23 ©Bud Mishra, 2002

L11-16

Repeats within the Genome

• Gene Family– Genes and its cognate pseudogenes

• Satellite: Repeats made of noncoding units– Minisatellites: Tandem repeats…Mostly in

centromeric regions– Satellite repeat units vary in length freom 2

base pairs to several thousands.

04/18/23 ©Bud Mishra, 2002

L11-17

Interspersed Repeats

• SINES: Short Interspersed Repeats– Each repeat unit is of length 100 – 500 bps– Processed pseudogenes derived from class III

genes– Example: Alu repeats…dimeric head-to-tail

repeats of 130 bp

• LINES: Long Interspersed Repeats– Each unit is of length > 6 Kb.

04/18/23 ©Bud Mishra, 2002

L11-18

A Genome Grammar

• Consists of– A stochastic grammar specifying target DNA

sequence together with – A description of polymorphisms and– A description of the sampling strategy for

experiments

• h specificationi ! h DNA-Seg i h Poly-Seg i*

h Sample-Seg i+

04/18/23 ©Bud Mishra, 2002

L11-19

Stochastic Grammar

• h DNA-Seg i !“.dna” h DNA-Spec i

• h Poly-Seg i !“.poly” h Weight i+ h Poly-Spec i

• h Sample-Seg i !“.sample” h Sample-spec i

04/18/23 ©Bud Mishra, 2002

L11-20

DNA Sequence

• .dnaA = 150 Ã sequence of length 150—

Pr(A) = Pr(T) = Pr(C) = Pr(G) = ¼

B = A A m(.30) Ã A followed by a mutatedcopy of A---Pr(Mutation) = .30

C » 3-7 p(.2, .3, .3) Ã A string of length 3 to 7,Pr(A) =.2, Pr(T) = .3, Pr(C)=.3, Pr(G) = .2---C = Constant String

D = C m(0.03) n(10,30) Ã m = mutation rate,n = copy number

• S = 30,000,000B m(.05, .10) p(.1,.1,.01) n(10)D !(500)

04/18/23 ©Bud Mishra, 2002

L11-21

Polymorphisms

• Modify the ancestral sequence by a series of– S Point mutation (SNPs)– D Deletions– X Translocations

• .poly .8 .8S 0.00012TD 1-1 .00012D 2-2 .00006D 3-3 .00002D 500-1000 .00005X 1000-2000 .0005

.poly .4S .001D 1-2 .0005

Two haplotypes of .8 each andone haplotype of weight .4

04/18/23 ©Bud Mishra, 2002

L11-22

Sampling

• .sample48,000 Ã Number of Samples400 600 .5 Ã Read Lengths.01 .02 Ã Sequence Read Errors.33 .33 Ã Failure of Read.3 1800 2200 .005 Ã Clone size

• .sample12,000400 600 .5.01 .03.33 .33.4 9000 11000 .015

04/18/23 ©Bud Mishra, 2002

L11-23

Experiment

• First sample generate 48,000 end reads from inserts of average length 2 Kbp.– Sample proportions: 40% from haplotype H1,

40% from H2 and 20% from H3

• Second sample generates 12,000 end reads from inserts of average length 10 Kbp.– Sample proportions: 40% from haplotype H1,

40% from H2 and 20% from H3

04/18/23 ©Bud Mishra, 2002

L11-24

¢ Break ¢

10 minutes

04/18/23 ©Bud Mishra, 2002

L11-25

Simple Random Walk Model

• DNA Walk Recipe:– Group the four constituent bases

into two families U and D, characterized by similar chemical structure with approximately the same fraction ½.

– Imagine that time t corresponds to the length covered when counting the bases in one direction along a DNA chain (say 5’ to 3’).

– When one encounters the family U (resp. D), the random walk makes a +1 (resp –1) step along 1D-line.

Binary DNA Walk

04/18/23 ©Bud Mishra, 2002

L11-26

Random Walk

• For example:– The family U could

be pyrimidines: C & T

– The family D could be purines: A & G

Binary DNA Walk

04/18/23 ©Bud Mishra, 2002

L11-27

Long Range Correlations

• Each step is modeled as a Bernoulli random variable, li, Pr[li=1] = Pr[li=0] = ½.

• The random walk can be modeled by the sum of random variables li, resulting in the

• Random variable s(t) = i=0t li,

– (with Binomial distribution » S(t, ½) under very general conditions)

– The correlation function Cij is defined as

Cij = h li lj i - h li i h lj i– If the process is stationary then Cij only depends upon

n = j –i.

04/18/23 ©Bud Mishra, 2002

L11-28

• Thus C(n) can be estimated as C(n) = 1/(N-n) i=1

N-n li li+n – [(1/N) i=1n li]2

• Considering the variance of s(t), we have

h [s(t)]2 i = t [C(0) + 2 k=1t C(k)] – 2 k=1

t k C(k)

• If C(k) ¿ k-1 then the integral s1t dk C(k)

(approximating the discrete sum by the Poisson formula) converges absolutely &

h [s(t)]2 i / t.

Long Range Correlations

04/18/23 ©Bud Mishra, 2002

L11-29

Super-diffusive behavior

• Imagine the opposite situation when C(k) decays slower than k-1…– For instance C(k) / k-y, with 0 5 y < 1.– Then s1

t dk C(k) / t1-y

• Super-diffusive behavior:h [s(t)]2 i / t2-y for y < 1.

• Diffusion is thus enhanced by the long-range correlations and the typical value of s(t) is of the order of t1-y/2 À t1/2

– The exponent (1 – y/2) is often written as H = the Hurst Exponent…

04/18/23 ©Bud Mishra, 2002

L11-30

Interpretation of Hurst Exponent

• The value of resulting H = Hurst Exponent (0<H<1), implies the strength of LRC (long range correlation) in the series. For infinite series:• H=0.5: Brownian motion, no long-range correlation;

• H<0.5: Antipersistent Fractional Brownian motion, with negative feedback- like mechanism

• H>0.5: Persistent Fractional Brownian motion, with positive feedback-like mechanism

04/18/23 ©Bud Mishra, 2002

L11-31

Random Walk Model with Repetitions

• Polya’s Urn Model:

F’s: functions deciding probability distributions

F1 (decide an initial position)

F2 (decide selected length)F3 (decide copy number)

k copies

F4 (decide insertion positions)

04/18/23 ©Bud Mishra, 2002

L11-32

Toy System

• Start sequence: completely random (Brownian motion).

• F1 (initial position): uniformly random ditribution;

• F2 (length selection): normal distribution;

• F3 (copy number): Poisson distribution;

• F4 (insertion position): Power-law distribution.

log(Pr(posinsert=K))=-c £ log(K-posinitial )+b;

04/18/23 ©Bud Mishra, 2002

L11-33

A Simple Polya’s Urn

• Polya’s Urn or Contagion Model:• An Urn containing

– b black balls, and– r red balls.

• The event X is the result of the extraction of a ball from the urn:– X = 1, if red, X=0 if black

• After extracting a ball, replace the ball extracted from the urn and add to the urn c many balls of the same color as that of the ball just extracted.

• If b=r and c=0, then the events X=1 and X=0 occur equi-probably and independently.

04/18/23 ©Bud Mishra, 2002

L11-34

Exchangeable Systems

• Note that Pr[1,1,0,1]

= [r/(r+b)] [(r+c)/(r+b+c)] [b /(r+b+2c)] [(r+2c)/(r+b+3c)]

= [b/(r+b)] [r/(r+b+c)] [(r+c) /(r+b+2c)] [(r+2c)/(r+b+3c)]

= Pr[0,1,1,1] = Pr[1,1,1,0]

• Such a system is called an exchangeable system, as

Pr[X1, X2, …, Xn] = Pr[X(1), X(2), …, X(n)]

04/18/23 ©Bud Mishra, 2002

L11-35

ERV Exchangeable Random

Variables• A set of random variables = {X1, X2, …,

XN} is defined to be “exchangeable” if every condition concerning n of the Xi has the same probability, no matter how the Xi are chosen or labeled.

Pr(X1, …, Xn) = Pr(Xi1, …, Xin)

• The joint distribution is invariant under an arbitrary permutation of the indexes of its arguments …

04/18/23 ©Bud Mishra, 2002

L11-36

De Finetti Representation Theorem

• Provided that the set {X1, …, Xn,…} of exchangeable variable is infinite, the probability to have k successes followed by (n-k) failures is

Pr[X1 = 1, …, Xk = 1, Xk+1 = 0, …, Xn = 0]

= s01 pk (1-p)n-k dF[p]

• where F[p] is a probability distribution, called the mixing distribution of ERV.

04/18/23 ©Bud Mishra, 2002

L11-37

Walks

• If we define sn = i=1n Xi to be the partial

sum of n variables then

Pr[sn = k]

= C(n,k) s01 pk (1-p)n-k dF[p]

= s01 Binp(n,k) dF[p]

• A randomized binomial distribution…

04/18/23 ©Bud Mishra, 2002

L11-38

Beta ERV

• Because of exchangeability, we havePr[Xn = 1] = Pr[X1 = 1] = s0

1 p dF[p]

• Thus, the probability to have a success at the mth step, given that there have been k successes in the previous n steps (m > n ¸ k) isPr[Xm =1 | sn = i=1

n Xi = k] = N s01 pk+1 (1-p)n-k

dF[p].• The normalization factor is given by

N-1 = s01 pk (1-p)n-k dF[p].

04/18/23 ©Bud Mishra, 2002

L11-39

Beta ERV

• One can compute the mixing distribution for ERV’s: It is a beta distribution:dF[p] = [( + )/(()())] (1-p) –1 p-1 dp

´ , (p) dp.• It is the family closed under conditional

expectations. • “xij | sn = k” define a new set of ERV’s with

mixing distribution, F*[p], a beta distribution.

dF*[p] = +n-k, +k (p) dp.

04/18/23 ©Bud Mishra, 2002

L11-40

Auto-Correlation Functions

• Auto-correlation functions:C(t1, t2) = h Xt1 Xt2 i - h Xt1 i h Xt2 i

C(t1, t2) = s01 p2 dF[p] – ( s0

1 p dF[p])2…

• The squared variance of F[p] gives rise to long range correlation…

C(k) = constant & h [s(t)]2] i / t2

Hurst Exponent ¼ 1

04/18/23 ©Bud Mishra, 2002

L11-41

Detrended Fluctuation Analysis: (DFA)

Whole Genomic Sequence

Coding Sequence

Noncoding Sequence

ATTCGACCGTT…1-1-1-1 1 1-1-1 1–1-1…‘Binarize’:

Randomly select n data windows of size l

1 2 3 4 5 6 7 …… n 1 2 3 4 5 6 7 8 …… n

l l l l l l l …… l l l l l l l l l …… l

An analysis to test long-range correlation

04/18/23 ©Bud Mishra, 2002

L11-42

Main Ideas

1 2 3 4 5 6 7 …… n 1 2 3 4 5 6 7 8 …… n

l l l l l l l …… l l l l l l l l l …… l

•In the kth data windowof size l:•uk(i) = +1 (A/G at ith bp position in the kth window)•uk(i) = -1 (C/T at ith bp position in the kth window)

Mkl = (1/l) i=1

l uk(i); Yk

l(j) = i=1j uk(i) –j £ Mk

l ; Fk(l)2 = (1/l) j=1

l Ykl(j)2

•F(l)2 = (1/n) k=1n Fk(l)2

(n = the total number of windows)•F(l) / l ) log F(l) = log l + const

04/18/23 ©Bud Mishra, 2002

L11-43

DFA Analysis Results

E.coli K12

04/18/23 ©Bud Mishra, 2002

L11-44

DFA Analysis Results

Yeast

04/18/23 ©Bud Mishra, 2002

L11-45

DFA Analysis Results

Drosophila

04/18/23 ©Bud Mishra, 2002

L11-46

DFA Analysis Results

H. Sapien

04/18/23 ©Bud Mishra, 2002

L11-47

Preliminary Conclusions

• Long-range correlation in noncoding region > long-range correlation in coding region

• Long-range correlation in higher organisms > long-range correlation in lower organisms

• There is a correlation transition point in coding sequences

04/18/23 ©Bud Mishra, 2002

L11-48

Summary of H Values

Organism

E. coli K12 S. cerevisiae Drosophila

Human

Expected Value in Brownian motion

Region coding

non-coding

coding non-coding

coding coding

H value 0.5811

0.6088

0.6081

0.6650

0.6186 0.6322 0.5709+

Significance*

P>0.05

P>0.05

P>0.05

P<0.001

P<0.001 P<0.001

 

Correction of H values from RS analysis:

04/18/23 ©Bud Mishra, 2002

L11-49

What have we found out about DNA LRC?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

E.colicoding

E.coli non-coding

Yeastcoding

Yeast non-coding

Non-coding region has higher long-range correlation than coding region

04/18/23 ©Bud Mishra, 2002

L11-50

What have we found out about DNA LRC?

Higher organism has higher long-range correlation than lower organism (The plot shows the H value in the coding regions of different organisms)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

E.coli Yeast Fly Human

Co

rre

cte

d H

va

lue

04/18/23 ©Bud Mishra, 2002

L11-51

Hypotheses

• Possible causes of long-range correlation in DNA sequences:

• Possible reason causing the correlation level differences between coding and noncoding region: -Difference in the force of natural selection-Presence of a DNA repair mechanism that is specific to the coding regions – transcription-coupled DNA repair

•Natural selection

•DNA repair mechanisms

•Replication mistakes (DNA polymerase stuttering)

•Random deletion/insertion (transposons, retroviruses)

Correlation