evolutionsssykim/teaching/f11/slides/lecture11.pdf · 2011. 10. 4. · correcting for ascertainment...

31
Evolution 02715 Advanced Topics in Computa8onal Genomics

Upload: others

Post on 25-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Evolution

02-­‐715  Advanced  Topics  in  Computa8onal  Genomics  

Page 2: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Ascertainment Bias

•  SNP  discovery  phase  –  Assume  SNPs  have  been  ascertained  in  an  alignment  of  different  

sequences  of  fixed  depth  d  

–  The  final  sample  size  n  –  Ascertainment  condi8on:  the  locus  was  variable  in  the  ascertainment  

sample  

Page 3: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Correcting for Ascertainment Bias

•  Likelihood  for  allele  frequencies  aMer  condi8oning  on  ascertainment  (i.e.,  unobserved  true  allele  frequencies)  

Page 4: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Various Extensions

•  Varia8on  in  d  –  Informa8on  about  d  is  not  known,  but  we  know  the  distribu8on  of  d  

among  loci  

•  Allele  frequencies  in  the  ascertainment  sample  is  unknown  –  The  ascertainment  sample  may  not  have  been  included  in  the  final  

typed  sample.  

–  Ascertainment  condi8on:  the  variability  in  the  ascertainment  sample  and  variability  in  the  typed  sample.  

Page 5: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Correcting for Ascertainment Bias (Nielson et al., 2004)

•  Illustra8on  through  simula8on  study  (20  genes,  10,000  SNPs,  5  genes  for  ascertainment)  

Page 6: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Ascertainment Bias from HapMap Analysis

Page 7: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Cross-species Sequence Analysis

•  Func8onal  regions  of  genomes  are  conserved  across  species  

•  Cross-­‐species  sequence  conserva8on  is  believed  to  occur  because  of  nega8ve  (purifying)  selec8on  

•  About  5%  or  more  of  bases  in  mammalian  genomes  are  under  purifying  selec8on  

•  Protein  coding  genes  account  for  1.5%  of  the  regions  under  purifying  selec8on  

Page 8: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylo-HMM

•  Parse  aligned  sequences  into  two  classes  –  Conserved  vs.  nonconserved  

•  Maximum  likelihood  es8ma8on  of  parameters  of  Phylo-­‐HMM  

Page 9: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylo-HMM

•  μ,  ν:  transi8on  probabili8es  

•  States  –  c:  conserved  region  –  n:  non-­‐conserved  region  

•  ψn,  ψc:  emission  probabili8es  as  a  tree      

Page 10: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylo-HMM

•  ψn,  ψc:  emission  probabili8es  as  a  phylogene8c  model  –  Iden8cal  phylogene8c  model  structure  for  two  states  

–   ρ:  scaling  factor  for  branch  length  0≤ρ≤1  •  Average  subs8tu8on  rate  

Page 11: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Datasets

•  Vertebrate  species  –  Human,  mouse,  rat,  chicken,  fugu  rubripes  

–  Alignment  with  human  as  reference  sequence  

•  Insect  species  –  Three  species  of  Drosophila  and  Anopheles  gambiae  

–  Alignment  with  D.  melanogaster  as  reference  sequence  

•  Two  species  of  Caenorhabdi8s    –  Alignment  with  C.  elegans  as  reference  sequence  

•  Seven  species  of  saccharomyces  –  Alignment  with  S.  cerevisiae  as  reference  sequence  

Page 12: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylogenetic Models: Assumed Topologies and Estimated Branch Lengths

Page 13: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Estimated Conserved Elements

•  More  complex  organisms  have  more  conserved  regions  outside  of  coding  regions  

Vertebrate  

Insect  

Worm  

Yeast  

Page 14: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Conservation Around GRIA2 in Human

Page 15: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Extreme Conservation

•  Extreme  conserva8on  at  the  3’  end  of  the  ELAVL4  gene  

Page 16: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Key Observations

•  Conserved  regions  –  3%-­‐8%  of  the  human  genome  conserved  in  vertebrates  and  other  

mammals  –  37-­‐53%  in  D.  melanogaster  –  18-­‐37%  in  C.  elegans  –  47-­‐68%  in  S.  cerevisiae  

•  Highly  conserved  regions  (HCE)  –  42%  of  HCEs  overlap  with  exons  in  vertebrate  genomes  –  >93%  for  insects,  worms,  yeasts  

•  Extreme  conserva8ons  in  3’  UTRs  –  Post-­‐transrip8onal  regula8on?  

•  HCEs  in  intron  regions  –  Enriched  for  RNA  secondary  structure:  encoding  func8onal  RNAs?  

Page 17: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylogenetics vs. Population Genetics

•  Phylogene8cs  –  Assumes  a  single  correct  species  phylogeny  that  holds  across  genomes  

–  Ignores  varia8ons  among  individuals  of  the  same  species  or  assumes  a  negligible  variability  within  species  

–  Reduces  the  en8re  popula8on  of  a  species  into  a  single  individual  

•  Popula8on  gene8cs  –  Usually  concerned  with  within-­‐species  varia8on  in  genomes  –  Individuals  within  a  species  are  related  by  genealogies  

Siepel,  A.  Genome  Res.  19(11):1929-­‐41.  2009.  Phylogenomics  of  primates  and  their  ancestral  popula8ons.  

Page 18: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Population-aware Phylogenetics

•  Primate  species  –  Divergence  8me  is  short  rela8ve  to  ancestral  popula8on  sizes  

–  Phylogene8cs  assump8ons  do  not  hold  –  Non-­‐negligible  popula8on  gene8c  effects  

•  Interspecies  comparison,  taking  into  account  selec8ve  forces  within  species,  ancestral  popula8ons,  modes  of  specia8on  

Page 19: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Phylogeny of Primates

Siepel,  A.  Genome  Res.  19(11):1929-­‐41.  2009.  Phylogenomics  of  primates  and  their  ancestral  popula8ons.  

Page 20: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Darwin’s Phylogeny

Page 21: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Genealogies in Wright-Fisher Model

Page 22: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Population Genetic Interpretation of Speciation

•  T:  coalescent  8me  

•  τ:  specia8on  8me    

Page 23: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Population Genetic Interpretation of Speciation

•  τ>>Ne:    –  Divergence  between  individual  chromosomes  as  an  es8mate  of  specia8on  8me  

–  the  phylogene8cs  assump8on  holds  

Page 24: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Population Genetic Interpretation of Speciation

•  τ<<Ne:    –  Coalescent  8me  dominates  

–  Equivalent  to  the  coalescent  in  popula8on  gene8cs  

Page 25: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Population Genetic Interpretation of Speciation

•  τ~Ne:    –  Both  ancestral  popula8on  dynamics  and  interspecies  divergence  must  be  considered  

–  Popula8on-­‐aware  phylogene8cs  

Page 26: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Three-Species Phylogeny

•  Three  species  X,  Y,  and  Z  with  specia8on  8me  and  coalescent  8me  –  X:  human  –  Y:  chimpanzee  –  Z:  gorilla  

•  Black  phylogeny:  discordance  with  the  phylogeny  among  the  three  species  

•  Gray  phylogeny:  concordant  with  the  phylogeny  among  the  three  species  

•  ILS:  incomplete  lineage  sor8ng  with  deep  coalescent  

Page 27: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Three-Species Phylogeny

•  When  Nxy,  Nxyz  are  small,  τxy  and  τxyz  approximate  the  divergence  8me  well  

•  Otherwise,  the  coalescent  8me  Txy,  Txyz  need  to  be  taken  into  account  

Page 28: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Ancestral Recombination Graph for Three Individuals

Ancestral  Recombina8on  Graph  

Phylogene8c  Ancestral  Recombina8on  Graph  

Page 29: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Coal-HMM (Hobolth et al., 2009)

•  Four  states  corresponding  to  different  phylogenies  with  ILS  

•  Transi8ons  to  other  states  correspond  to  recombina8ons  

Page 30: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

Coal-HMM

•  HC1  state  (with  no  ILS)  explains  only  ~50%  of  sites  

•  Remaining  states  explain  the  other  50%  propor8oned  roughly  equally  

Page 31: Evolutionsssykim/teaching/f11/slides/Lecture11.pdf · 2011. 10. 4. · Correcting for Ascertainment Bias • Likelihood’for’allele’frequencies’aer’condi8oning’on’ ascertainment(i.e.,’unobserved’true’allele

What if We Ignore Incomplete Lineage Sorting

•  Aligned  human  (Hom),  chimpanzee  (Pan),  gorilla  (Gor),  orangutan  (Pon)  sequences  

•  Two  different  es8mated  lineages  •  Without  considera8on  of  ILS,  subs8tu8on  rates  are  

overes8mated