evolutionsssykim/teaching/f11/slides/lecture11.pdf · 2011. 10. 4. · correcting for ascertainment...
TRANSCRIPT
Evolution
02-‐715 Advanced Topics in Computa8onal Genomics
Ascertainment Bias
• SNP discovery phase – Assume SNPs have been ascertained in an alignment of different
sequences of fixed depth d
– The final sample size n – Ascertainment condi8on: the locus was variable in the ascertainment
sample
Correcting for Ascertainment Bias
• Likelihood for allele frequencies aMer condi8oning on ascertainment (i.e., unobserved true allele frequencies)
Various Extensions
• Varia8on in d – Informa8on about d is not known, but we know the distribu8on of d
among loci
• Allele frequencies in the ascertainment sample is unknown – The ascertainment sample may not have been included in the final
typed sample.
– Ascertainment condi8on: the variability in the ascertainment sample and variability in the typed sample.
Correcting for Ascertainment Bias (Nielson et al., 2004)
• Illustra8on through simula8on study (20 genes, 10,000 SNPs, 5 genes for ascertainment)
Ascertainment Bias from HapMap Analysis
Cross-species Sequence Analysis
• Func8onal regions of genomes are conserved across species
• Cross-‐species sequence conserva8on is believed to occur because of nega8ve (purifying) selec8on
• About 5% or more of bases in mammalian genomes are under purifying selec8on
• Protein coding genes account for 1.5% of the regions under purifying selec8on
Phylo-HMM
• Parse aligned sequences into two classes – Conserved vs. nonconserved
• Maximum likelihood es8ma8on of parameters of Phylo-‐HMM
Phylo-HMM
• μ, ν: transi8on probabili8es
• States – c: conserved region – n: non-‐conserved region
• ψn, ψc: emission probabili8es as a tree
Phylo-HMM
• ψn, ψc: emission probabili8es as a phylogene8c model – Iden8cal phylogene8c model structure for two states
– ρ: scaling factor for branch length 0≤ρ≤1 • Average subs8tu8on rate
Datasets
• Vertebrate species – Human, mouse, rat, chicken, fugu rubripes
– Alignment with human as reference sequence
• Insect species – Three species of Drosophila and Anopheles gambiae
– Alignment with D. melanogaster as reference sequence
• Two species of Caenorhabdi8s – Alignment with C. elegans as reference sequence
• Seven species of saccharomyces – Alignment with S. cerevisiae as reference sequence
Phylogenetic Models: Assumed Topologies and Estimated Branch Lengths
Estimated Conserved Elements
• More complex organisms have more conserved regions outside of coding regions
Vertebrate
Insect
Worm
Yeast
Conservation Around GRIA2 in Human
Extreme Conservation
• Extreme conserva8on at the 3’ end of the ELAVL4 gene
Key Observations
• Conserved regions – 3%-‐8% of the human genome conserved in vertebrates and other
mammals – 37-‐53% in D. melanogaster – 18-‐37% in C. elegans – 47-‐68% in S. cerevisiae
• Highly conserved regions (HCE) – 42% of HCEs overlap with exons in vertebrate genomes – >93% for insects, worms, yeasts
• Extreme conserva8ons in 3’ UTRs – Post-‐transrip8onal regula8on?
• HCEs in intron regions – Enriched for RNA secondary structure: encoding func8onal RNAs?
Phylogenetics vs. Population Genetics
• Phylogene8cs – Assumes a single correct species phylogeny that holds across genomes
– Ignores varia8ons among individuals of the same species or assumes a negligible variability within species
– Reduces the en8re popula8on of a species into a single individual
• Popula8on gene8cs – Usually concerned with within-‐species varia8on in genomes – Individuals within a species are related by genealogies
Siepel, A. Genome Res. 19(11):1929-‐41. 2009. Phylogenomics of primates and their ancestral popula8ons.
Population-aware Phylogenetics
• Primate species – Divergence 8me is short rela8ve to ancestral popula8on sizes
– Phylogene8cs assump8ons do not hold – Non-‐negligible popula8on gene8c effects
• Interspecies comparison, taking into account selec8ve forces within species, ancestral popula8ons, modes of specia8on
Phylogeny of Primates
Siepel, A. Genome Res. 19(11):1929-‐41. 2009. Phylogenomics of primates and their ancestral popula8ons.
Darwin’s Phylogeny
Genealogies in Wright-Fisher Model
Population Genetic Interpretation of Speciation
• T: coalescent 8me
• τ: specia8on 8me
Population Genetic Interpretation of Speciation
• τ>>Ne: – Divergence between individual chromosomes as an es8mate of specia8on 8me
– the phylogene8cs assump8on holds
Population Genetic Interpretation of Speciation
• τ<<Ne: – Coalescent 8me dominates
– Equivalent to the coalescent in popula8on gene8cs
Population Genetic Interpretation of Speciation
• τ~Ne: – Both ancestral popula8on dynamics and interspecies divergence must be considered
– Popula8on-‐aware phylogene8cs
Three-Species Phylogeny
• Three species X, Y, and Z with specia8on 8me and coalescent 8me – X: human – Y: chimpanzee – Z: gorilla
• Black phylogeny: discordance with the phylogeny among the three species
• Gray phylogeny: concordant with the phylogeny among the three species
• ILS: incomplete lineage sor8ng with deep coalescent
Three-Species Phylogeny
• When Nxy, Nxyz are small, τxy and τxyz approximate the divergence 8me well
• Otherwise, the coalescent 8me Txy, Txyz need to be taken into account
Ancestral Recombination Graph for Three Individuals
Ancestral Recombina8on Graph
Phylogene8c Ancestral Recombina8on Graph
Coal-HMM (Hobolth et al., 2009)
• Four states corresponding to different phylogenies with ILS
• Transi8ons to other states correspond to recombina8ons
Coal-HMM
• HC1 state (with no ILS) explains only ~50% of sites
• Remaining states explain the other 50% propor8oned roughly equally
What if We Ignore Incomplete Lineage Sorting
• Aligned human (Hom), chimpanzee (Pan), gorilla (Gor), orangutan (Pon) sequences
• Two different es8mated lineages • Without considera8on of ILS, subs8tu8on rates are
overes8mated