6/12/2015 © bud mishra, 2002 l11-1 lecture #11: evolution & genome structure special topics in...
Post on 19-Dec-2015
214 views
TRANSCRIPT
04/18/23 ©Bud Mishra, 2002
L11-1
Special Topics in Computational Biology
Lecture #11: Lecture #11: Evolution & Genome Evolution & Genome
StructureStructure¦
Bud MishraProfessor of Computer Science and Mathematics (Courant,
NYU)Professor (Watson School, CSHL)
4 ¦ 9 ¦ 2002
04/18/23 ©Bud Mishra, 2002
L11-2
Genomes & Genes
• Genome (GENe + chromosOME): Collection of genes.
• A gene is contained in an open reading frame– it consists of– INTRONS: Intervening sequences Noncoding
regions– EXONS: Protein coding regions
• Introns are abundant in eukaryotes and certain animal viruses.
04/18/23 ©Bud Mishra, 2002
L11-3
Interrupted Genes:
Intron1 Intron2Intron3
Exon1 Exon2
Transcription
Splicing
DNA
RNAPrimary transcript
mRNA
04/18/23 ©Bud Mishra, 2002
L11-4
Some Genes…
Gene Product Organism ExonLength
#Introns
Intron Length
Adenoshine deaminase Human 1500 11 30,000
Apolipoprotein B Human 14,000 28 29,000
Erythropoietin Human 582 4 1562
Thyroglobulin Human 8500 = 40 100,000
-interferon Human 600 0 0
Fibroin Silk Worm 18,000 1 970
Phaseolin French Bean
1263 5 515
04/18/23 ©Bud Mishra, 2002
L11-5
Regulation of Gene Expressions
• Motifs (short DNA sequences) that regulate transcription– Promoter– Terminator
• Motifs that modulate transcription– Repressor– Activator– Antiterminator
Promoter
GeneTranscriptional
Initiation
Transcriptional
Termination
Terminator
10-35bp
04/18/23 ©Bud Mishra, 2002
L11-6
Motifs
• Each motif is a binding site for a specific protein• Transcription Factor:
– Transcription factors (specific to a cell/environmental conditions) bind to regulatory regions and facilitate
• Assembly of RNA polymerase into a transcriptional complex• Activation of a transcriptional complex.
• Termination Factor:– Assembly of proteins for termination and modification of
the end of the RNA• Epigenetic Changes
– Methylation of the cytosine in the 5’ region– Structural changes in cromatin
04/18/23 ©Bud Mishra, 2002
L11-7
Organization of Genetic Information
• Bacterial Genome:– Genes are closely spaced along the DNA.– The sequences of genes may overlap.– Related genes (encoding enzymes whose
functions are part of the same pathway or whose activities are related) are linked as a single transcription unit.
04/18/23 ©Bud Mishra, 2002
L11-8
Organization of Genetic Information
• Eukaryotic Genome:– Genes are separated by long stretches of
noncoding DNA sequences.– Multiple genes in a single transcription unit is
extremely rare.– Multiple chromosomes – Linear– Chloroplasts and mitochondria – Circular– Genes appearing on the same chromosome are
syntenic.
04/18/23 ©Bud Mishra, 2002
L11-9
Location of Some Genes on Human Chromosome.
Genes chromosomes
-globin cluster 16
-globin cluster 11
Immunoglobulin
(light chain) 2
(light chain) 22
Heavy Chain 14
Pseudogenes 9,32,15,18
Growth Hormone gene cluster
17
Thymidine kinase 17
Genes chromosomes
Insulin 11
Galactokinase 11
Viral oncogene homologues
C-sis 22
C-mos 8
C-Ha-Ras-1 11
C-myb 6
Interferons
& cluster 9
12
04/18/23 ©Bud Mishra, 2002
L11-10
Eukaryotic Genome
• Multiple copies of the same gene– Solve “supply problem”– There are several hundred ribosomal RNA
genes I mammals
• Pseudogenes– Nonfunctional copies of genes…(Deletions or
alterations in the DNA sequence)– Number of pseudo genes for a particular gene
varies greatly…Different from one organism to another.
04/18/23 ©Bud Mishra, 2002
L11-11
Genes in Eukaryotes
• A gene may appear exactly once• It may be part of a family of repeated sequence .
Members of a family may be clustered or dispersed.• Members of a gene family may be related and
functional (expressed at different times in development, or in different cells) or may be pseudo genes.
• Chromosomal Morphology:– Nucleolar organizers (genes for ribosomal RNA)– Telomeric and Centromeric regions (Tandemly repeated
sequences)
04/18/23 ©Bud Mishra, 2002
L11-12
The Rearrangement of DNA Sequences
• Reshuffling of genes between homologous chromosomes via reciprocal crossing-over during both meiosis and mitosis.
• Gene synteny and linkages are usually preserved.
• Most rearrangements are random.• Some rearrangements are normal
processes altering gene expressions in an orderly and programmed manner.
04/18/23 ©Bud Mishra, 2002
L11-13
Chromosomal Aberrations
• Breakage• Translocation (Among non-homologous
chromosomes.)• Formation of acentric and dicentric chromosomes.• Gene Conversions• Amplification and deletions• Point mutations• Jumping genes Transposition of DNA segments• Programmed rearrangements E.g., antibody
responses.
04/18/23 ©Bud Mishra, 2002
L11-14
Repeat Structure
• Copy Number: 2 » 106
• Direct Repeats “head-to-tail”– Tandem repeats or separated by other
sequences
• Inverted Repeats “head-to-head”– Stem-and-loop structure– Hairpin structure
• Reverse Palindrome• True Palindrome
04/18/23 ©Bud Mishra, 2002
L11-15
Repeat Structure
• Tandem Direct Repeats
• Inverted Repeats
• Reverse Palindrome• True Palindrome
5’-AAGAG AAGAG AAGAG-3’
5’-GTCCAGN NCTGGAC-3’ CAGGTCN NGACCTG
G C
A T
C G
C G
T A
G C
Stem-and-loop structureAssociated with inverted
repeats
5’-GAATTC-3’ CTTAAG
5’-GTCAATGA AGTAACTG-3’
04/18/23 ©Bud Mishra, 2002
L11-16
Repeats within the Genome
• Gene Family– Genes and its cognate pseudogenes
• Satellite: Repeats made of noncoding units– Minisatellites: Tandem repeats…Mostly in
centromeric regions– Satellite repeat units vary in length freom 2
base pairs to several thousands.
04/18/23 ©Bud Mishra, 2002
L11-17
Interspersed Repeats
• SINES: Short Interspersed Repeats– Each repeat unit is of length 100 – 500 bps– Processed pseudogenes derived from class III
genes– Example: Alu repeats…dimeric head-to-tail
repeats of 130 bp
• LINES: Long Interspersed Repeats– Each unit is of length > 6 Kb.
04/18/23 ©Bud Mishra, 2002
L11-18
A Genome Grammar
• Consists of– A stochastic grammar specifying target DNA
sequence together with – A description of polymorphisms and– A description of the sampling strategy for
experiments
• h specificationi ! h DNA-Seg i h Poly-Seg i*
h Sample-Seg i+
04/18/23 ©Bud Mishra, 2002
L11-19
Stochastic Grammar
• h DNA-Seg i !“.dna” h DNA-Spec i
• h Poly-Seg i !“.poly” h Weight i+ h Poly-Spec i
• h Sample-Seg i !“.sample” h Sample-spec i
04/18/23 ©Bud Mishra, 2002
L11-20
DNA Sequence
• .dnaA = 150 Ã sequence of length 150—
Pr(A) = Pr(T) = Pr(C) = Pr(G) = ¼
B = A A m(.30) Ã A followed by a mutatedcopy of A---Pr(Mutation) = .30
C » 3-7 p(.2, .3, .3) Ã A string of length 3 to 7,Pr(A) =.2, Pr(T) = .3, Pr(C)=.3, Pr(G) = .2---C = Constant String
D = C m(0.03) n(10,30) Ã m = mutation rate,n = copy number
• S = 30,000,000B m(.05, .10) p(.1,.1,.01) n(10)D !(500)
04/18/23 ©Bud Mishra, 2002
L11-21
Polymorphisms
• Modify the ancestral sequence by a series of– S Point mutation (SNPs)– D Deletions– X Translocations
• .poly .8 .8S 0.00012TD 1-1 .00012D 2-2 .00006D 3-3 .00002D 500-1000 .00005X 1000-2000 .0005
.poly .4S .001D 1-2 .0005
Two haplotypes of .8 each andone haplotype of weight .4
04/18/23 ©Bud Mishra, 2002
L11-22
Sampling
• .sample48,000 Ã Number of Samples400 600 .5 Ã Read Lengths.01 .02 Ã Sequence Read Errors.33 .33 Ã Failure of Read.3 1800 2200 .005 Ã Clone size
• .sample12,000400 600 .5.01 .03.33 .33.4 9000 11000 .015
04/18/23 ©Bud Mishra, 2002
L11-23
Experiment
• First sample generate 48,000 end reads from inserts of average length 2 Kbp.– Sample proportions: 40% from haplotype H1,
40% from H2 and 20% from H3
• Second sample generates 12,000 end reads from inserts of average length 10 Kbp.– Sample proportions: 40% from haplotype H1,
40% from H2 and 20% from H3
04/18/23 ©Bud Mishra, 2002
L11-25
Simple Random Walk Model
• DNA Walk Recipe:– Group the four constituent bases
into two families U and D, characterized by similar chemical structure with approximately the same fraction ½.
– Imagine that time t corresponds to the length covered when counting the bases in one direction along a DNA chain (say 5’ to 3’).
– When one encounters the family U (resp. D), the random walk makes a +1 (resp –1) step along 1D-line.
Binary DNA Walk
04/18/23 ©Bud Mishra, 2002
L11-26
Random Walk
• For example:– The family U could
be pyrimidines: C & T
– The family D could be purines: A & G
Binary DNA Walk
04/18/23 ©Bud Mishra, 2002
L11-27
Long Range Correlations
• Each step is modeled as a Bernoulli random variable, li, Pr[li=1] = Pr[li=0] = ½.
• The random walk can be modeled by the sum of random variables li, resulting in the
• Random variable s(t) = i=0t li,
– (with Binomial distribution » S(t, ½) under very general conditions)
– The correlation function Cij is defined as
Cij = h li lj i - h li i h lj i– If the process is stationary then Cij only depends upon
n = j –i.
04/18/23 ©Bud Mishra, 2002
L11-28
• Thus C(n) can be estimated as C(n) = 1/(N-n) i=1
N-n li li+n – [(1/N) i=1n li]2
• Considering the variance of s(t), we have
h [s(t)]2 i = t [C(0) + 2 k=1t C(k)] – 2 k=1
t k C(k)
• If C(k) ¿ k-1 then the integral s1t dk C(k)
(approximating the discrete sum by the Poisson formula) converges absolutely &
h [s(t)]2 i / t.
Long Range Correlations
04/18/23 ©Bud Mishra, 2002
L11-29
Super-diffusive behavior
• Imagine the opposite situation when C(k) decays slower than k-1…– For instance C(k) / k-y, with 0 5 y < 1.– Then s1
t dk C(k) / t1-y
• Super-diffusive behavior:h [s(t)]2 i / t2-y for y < 1.
• Diffusion is thus enhanced by the long-range correlations and the typical value of s(t) is of the order of t1-y/2 À t1/2
– The exponent (1 – y/2) is often written as H = the Hurst Exponent…
04/18/23 ©Bud Mishra, 2002
L11-30
Interpretation of Hurst Exponent
• The value of resulting H = Hurst Exponent (0<H<1), implies the strength of LRC (long range correlation) in the series. For infinite series:• H=0.5: Brownian motion, no long-range correlation;
• H<0.5: Antipersistent Fractional Brownian motion, with negative feedback- like mechanism
• H>0.5: Persistent Fractional Brownian motion, with positive feedback-like mechanism
04/18/23 ©Bud Mishra, 2002
L11-31
Random Walk Model with Repetitions
• Polya’s Urn Model:
F’s: functions deciding probability distributions
F1 (decide an initial position)
F2 (decide selected length)F3 (decide copy number)
k copies
F4 (decide insertion positions)
04/18/23 ©Bud Mishra, 2002
L11-32
Toy System
• Start sequence: completely random (Brownian motion).
• F1 (initial position): uniformly random ditribution;
• F2 (length selection): normal distribution;
• F3 (copy number): Poisson distribution;
• F4 (insertion position): Power-law distribution.
log(Pr(posinsert=K))=-c £ log(K-posinitial )+b;
04/18/23 ©Bud Mishra, 2002
L11-33
A Simple Polya’s Urn
• Polya’s Urn or Contagion Model:• An Urn containing
– b black balls, and– r red balls.
• The event X is the result of the extraction of a ball from the urn:– X = 1, if red, X=0 if black
• After extracting a ball, replace the ball extracted from the urn and add to the urn c many balls of the same color as that of the ball just extracted.
• If b=r and c=0, then the events X=1 and X=0 occur equi-probably and independently.
04/18/23 ©Bud Mishra, 2002
L11-34
Exchangeable Systems
• Note that Pr[1,1,0,1]
= [r/(r+b)] [(r+c)/(r+b+c)] [b /(r+b+2c)] [(r+2c)/(r+b+3c)]
= [b/(r+b)] [r/(r+b+c)] [(r+c) /(r+b+2c)] [(r+2c)/(r+b+3c)]
= Pr[0,1,1,1] = Pr[1,1,1,0]
• Such a system is called an exchangeable system, as
Pr[X1, X2, …, Xn] = Pr[X(1), X(2), …, X(n)]
04/18/23 ©Bud Mishra, 2002
L11-35
ERV Exchangeable Random
Variables• A set of random variables = {X1, X2, …,
XN} is defined to be “exchangeable” if every condition concerning n of the Xi has the same probability, no matter how the Xi are chosen or labeled.
Pr(X1, …, Xn) = Pr(Xi1, …, Xin)
• The joint distribution is invariant under an arbitrary permutation of the indexes of its arguments …
04/18/23 ©Bud Mishra, 2002
L11-36
De Finetti Representation Theorem
• Provided that the set {X1, …, Xn,…} of exchangeable variable is infinite, the probability to have k successes followed by (n-k) failures is
Pr[X1 = 1, …, Xk = 1, Xk+1 = 0, …, Xn = 0]
= s01 pk (1-p)n-k dF[p]
• where F[p] is a probability distribution, called the mixing distribution of ERV.
04/18/23 ©Bud Mishra, 2002
L11-37
Walks
• If we define sn = i=1n Xi to be the partial
sum of n variables then
Pr[sn = k]
= C(n,k) s01 pk (1-p)n-k dF[p]
= s01 Binp(n,k) dF[p]
• A randomized binomial distribution…
04/18/23 ©Bud Mishra, 2002
L11-38
Beta ERV
• Because of exchangeability, we havePr[Xn = 1] = Pr[X1 = 1] = s0
1 p dF[p]
• Thus, the probability to have a success at the mth step, given that there have been k successes in the previous n steps (m > n ¸ k) isPr[Xm =1 | sn = i=1
n Xi = k] = N s01 pk+1 (1-p)n-k
dF[p].• The normalization factor is given by
N-1 = s01 pk (1-p)n-k dF[p].
04/18/23 ©Bud Mishra, 2002
L11-39
Beta ERV
• One can compute the mixing distribution for ERV’s: It is a beta distribution:dF[p] = [( + )/(()())] (1-p) –1 p-1 dp
´ , (p) dp.• It is the family closed under conditional
expectations. • “xij | sn = k” define a new set of ERV’s with
mixing distribution, F*[p], a beta distribution.
dF*[p] = +n-k, +k (p) dp.
04/18/23 ©Bud Mishra, 2002
L11-40
Auto-Correlation Functions
• Auto-correlation functions:C(t1, t2) = h Xt1 Xt2 i - h Xt1 i h Xt2 i
C(t1, t2) = s01 p2 dF[p] – ( s0
1 p dF[p])2…
• The squared variance of F[p] gives rise to long range correlation…
C(k) = constant & h [s(t)]2] i / t2
Hurst Exponent ¼ 1
04/18/23 ©Bud Mishra, 2002
L11-41
Detrended Fluctuation Analysis: (DFA)
Whole Genomic Sequence
Coding Sequence
Noncoding Sequence
ATTCGACCGTT…1-1-1-1 1 1-1-1 1–1-1…‘Binarize’:
Randomly select n data windows of size l
1 2 3 4 5 6 7 …… n 1 2 3 4 5 6 7 8 …… n
l l l l l l l …… l l l l l l l l l …… l
An analysis to test long-range correlation
04/18/23 ©Bud Mishra, 2002
L11-42
Main Ideas
1 2 3 4 5 6 7 …… n 1 2 3 4 5 6 7 8 …… n
l l l l l l l …… l l l l l l l l l …… l
•In the kth data windowof size l:•uk(i) = +1 (A/G at ith bp position in the kth window)•uk(i) = -1 (C/T at ith bp position in the kth window)
Mkl = (1/l) i=1
l uk(i); Yk
l(j) = i=1j uk(i) –j £ Mk
l ; Fk(l)2 = (1/l) j=1
l Ykl(j)2
•F(l)2 = (1/n) k=1n Fk(l)2
(n = the total number of windows)•F(l) / l ) log F(l) = log l + const
04/18/23 ©Bud Mishra, 2002
L11-47
Preliminary Conclusions
• Long-range correlation in noncoding region > long-range correlation in coding region
• Long-range correlation in higher organisms > long-range correlation in lower organisms
• There is a correlation transition point in coding sequences
04/18/23 ©Bud Mishra, 2002
L11-48
Summary of H Values
Organism
E. coli K12 S. cerevisiae Drosophila
Human
Expected Value in Brownian motion
Region coding
non-coding
coding non-coding
coding coding
H value 0.5811
0.6088
0.6081
0.6650
0.6186 0.6322 0.5709+
Significance*
P>0.05
P>0.05
P>0.05
P<0.001
P<0.001 P<0.001
Correction of H values from RS analysis:
04/18/23 ©Bud Mishra, 2002
L11-49
What have we found out about DNA LRC?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
E.colicoding
E.coli non-coding
Yeastcoding
Yeast non-coding
Non-coding region has higher long-range correlation than coding region
04/18/23 ©Bud Mishra, 2002
L11-50
What have we found out about DNA LRC?
Higher organism has higher long-range correlation than lower organism (The plot shows the H value in the coding regions of different organisms)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
E.coli Yeast Fly Human
Co
rre
cte
d H
va
lue
04/18/23 ©Bud Mishra, 2002
L11-51
Hypotheses
• Possible causes of long-range correlation in DNA sequences:
• Possible reason causing the correlation level differences between coding and noncoding region: -Difference in the force of natural selection-Presence of a DNA repair mechanism that is specific to the coding regions – transcription-coupled DNA repair
•Natural selection
•DNA repair mechanisms
•Replication mistakes (DNA polymerase stuttering)
•Random deletion/insertion (transposons, retroviruses)
Correlation