cs 581 / bioe 540: algorithmic computaonal genomicstandy.cs.illinois.edu/581-intro-to-course.pdf ·...

70
CS 581 / BIOE 540: Algorithmic Computa<onal Genomics Tandy Warnow Departments of Bioengineering and Computer Science hHp://tandy.cs.illinois.edu

Upload: others

Post on 09-Sep-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

CS581/BIOE540:AlgorithmicComputa<onalGenomics

TandyWarnowDepartmentsofBioengineeringandComputerSciencehHp://tandy.cs.illinois.edu

Page 2: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

CourseDetails

Officehours:Tuesdays12:30-1:30(Siebel3235)Coursewebpage: hHp://tandy.cs.illinois.edu/581-2017.htmlTextbook:Computa<onalPhylogene<cs,availablefordownloadathHp://tandy.cs.illinois.edu/textbook.pdfTA:PranjalVachaspa<(tobeconfirmed)

Page 3: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Today

•  Describesomeimportantproblemsincomputa<onalbiology,forwhichstudentsinthiscoursecoulddevelopimprovedmethods

•  Explainhowthecoursewillberun•  Answerques<ons

Page 4: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

ThisCourseTopics:computa<onalandsta<s<calproblemsinsequenceanalysis(e.g.,mul<plesequencealignment,phylogenyes<ma<on,metagenomics,etc.).Focus:understandingthemathema<calfounda<ons,anddesigningalgorithmswithoutstandingaccuracyandspeedonlarge,complexdatasets.Thisisnotacourseabouthowtousethetools.

Page 5: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Prerequisites

Nobackgroundinbiologyisneeded.However,thecoursehasthefollowingprerequisites:

•  CS374:computa<onalcomplexity,algorithmdesigntechniques,andprovingtheoremsaboutalgorithms

•  CS361:probabilityandsta<s<cs

•  Byrecursion,CS225:programming

Page 6: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Ifyouhaven’tsa<sfiedthepre-reqs:

Youneedpermissiontostayinthecourse.

•  Thefirsthomeworkisdue(byemail)onSaturdayat1PM.SeethehomeworkwebpagehHp://tandy.cs.illinois.edu/cs581-2017-hw.html

•  Thenmakeanappointmenttoseemetoreviewthehomework.

Page 7: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Thiscourse

•  Phylogenyes<ma<onbasedonstochas<cmodelsofsequenceevolu<onandgenomeevolu<on

•  Mul<plesequencealignment•  Applica<onstometagenomics,protein

structurepredic<on,andotherbiologicalproblems

Page 8: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree

Page 9: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Page 10: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Page 11: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Page 12: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phylogenomicpipeline

SelecttaxonsetandmarkersGatherandscreensequencedata,possiblyiden<fyorthologsComputemul<plesequencealignmentsforeachlocusComputespeciestreeornetwork:

Computegenetreesonthealignmentsandcombinethees<matedgenetrees,OREs<mateatreefromaconcatena<onofthemul<plesequencealignments

Getsta<s<calsupportoneachbranch(e.g.,bootstrapping)Es<matedatesonthenodesofthephylogenyUsespeciestreewithbranchsupportanddatestounderstandbiology

Page 13: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

Page 14: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

Page 15: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci)

Page 16: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

DNA Sequence Evolution

AAGACTT

TGGACTT AAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTT AAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT

Page 17: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

U

V W

X

Y

Page 18: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Performance criteria •  Running time •  Space •  Statistical performance issues (e.g., statistical

consistency) with respect to a Markov model of evolution

•  “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation

•  Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Page 19: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Quantifying Error

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate

FN

FP

Page 20: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Page 21: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

1  Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)

Local optimum

Cost

Global optimum

Phylogenetic trees 2  Polynomial time distance-based methods: Neighbor

Joining, FastME, etc. 3. Bayesian methods

Phylogenetic reconstruction methods

Page 22: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Solving maximum likelihood (and other hard optimization problems) is… unlikely

# of Taxa

# of Unrooted Trees

4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 23: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Quantifying Error

FN: false negative (missing edge)

FP: false positive (incorrect edge)

FP

50% error rate

FN

Page 24: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining!

NJ

0 400 800 No. Taxa

1600 1200

0.4 0.2

0

0.6

0.8

Erro

r Rat

e

Page 25: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Page 26: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y

U

V W

X

Y

Page 27: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

AGAT TAGACTT TGCACAA TGCGCTT AGGGCATGA

U V W X Y

X U

Y

V W

TheRealProblem!

Page 28: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

…ACGGTGCAGTTACC-A…

…AC----CAGTCACCTA…

The true multiple alignment –  Reflects historical substitution, insertion, and deletion events –  Defined using transitive closure of pairwise alignments computed on

edges of the true tree

…ACGGTGCAGTTACCA… Insertion

Substitution Deletion

…ACCAGTCACCTA…

Page 29: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Page 30: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Page 31: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

S1

S4

S2

S3

Page 32: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Two-phase estimation Alignment methods •  Clustal •  POY (and POY*) •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc.

Phylogeny methods •  Bayesian MCMC•  Maximum

parsimony•  Maximum likelihood•  Neighbor joining•  FastME•  UPGMA•  Quartet puzzling•  Etc.

Page 33: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Simulation Studies

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 S2

S4 S3

True tree and alignment

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Compare

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA

S1 S4

S2 S3

Estimated tree and alignment

Unaligned Sequences

Page 34: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

1000taxonmodels,orderedbydifficulty(Liuetal.,2009)

Page 35: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Multiple Sequence Alignment (MSA): another grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Page 36: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Major Challenges

•  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

•  Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

Page 37: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

(Phylogenetic estimation from whole genomes)

Page 38: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Species Tree Estimation requires multiple genes!

Page 39: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Two basic approaches for species tree estimation

•  Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods

•  Compute trees on individual genes and

combine gene trees

Page 40: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Usingmul<plegenes

gene 1S1

S2

S3

S4

S7

S8

TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S1

S3

S4

S7

S8

gene 2GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

S4

S5

S6

S7

Page 41: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Concatena<on gene 1

S1

S2

S3

S4

S5

S6

S7

S8

gene 2 gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA

TCTAATGGAC

TATAACGGAA

GGTAACCCTCGCTAAACCTC

GGTGACCATC

GCTAAACCTC

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

Page 42: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Redgenetree≠speciestree(greengenetreeokay)

Page 43: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

GeneTreeIncongruence

Genetreescandifferfromthespeciestreedueto:

Duplica<onandlossHorizontalgenetransferIncompletelineagesor<ng(ILS)

Page 44: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

IncompleteLineageSor<ng(ILS)

Confoundsphylogene<canalysisformanygroups:

HominidsBirdsYeastAnimalsToadsFishFungi

Thereissubstan<aldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS.

Page 45: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

LineageSor<ng

Popula<on-levelprocess,alsocalledthe “Mul<-speciescoalescent”(Kingman,1982)Genetreescandifferfromspeciestreesduetoshort<mesbetweenspecia<oneventsorlargepopula<onsize;thisiscalled“IncompleteLineageSor<ng”or“DeepCoalescence”.

Page 46: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

TheCoalescent

Present

Past

CourtesyJamesDegnan

Page 47: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

GenetreeinaspeciestreeCourtesyJamesDegnan

Page 48: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Keyobserva<on:Underthemul<-speciescoalescentmodel,thespeciestreedefinesaprobabilitydistribu4ononthegenetrees,andisiden4fiablefromthedistribu4onongenetrees

CourtesyJamesDegnan

Page 49: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

. . .

Analyzeseparately

Summary Method

Twocompe<ngapproaches

gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Page 50: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website, University of Arizona

Speciestreees<ma<on:difficult,evenforsmalldatasets!

Page 51: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

Page 52: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Major Challenges: large datasets, fragmentary sequences

•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution.

•  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements).

•  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging.

•  Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution

Both phylogenetic estimation and multiple sequence alignment are also

impacted by fragmentary data.

Page 53: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

AvianPhylogenomicsProject

GZhang,BGI

• Approx.50species,wholegenomes• 14,000loci

MTPGilbert,Copenhagen

S.MirarabMd.S.Bayzid,UT-Aus<nUT-Aus<n

T.WarnowUT-Aus<n

Plusmanymanyotherpeople…

ErichJarvis,HHMI

Challenges:•  Speciestreees<ma<onunderthemul<-speciescoalescentmodel,from14,000poores<matedgenetrees,allwithdifferenttopologies(weused“sta<s<calbinning”)•  Maximumlikelihoodes<ma<ononamillion-sitegenome-scalealignment–250CPUyears

Science,December2014(Jarvis,Mirarab,etal.,andMirarabetal.)

Page 54: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)l  Firstpaper:PNAS2014(~100speciesand~800loci)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin

Plus many many other people…

UpcomingChallenges(~1200species,~400loci):•  Speciestreees<ma<onunderthemul<-speciescoalescentfromhundredsofconflic<nggenetreeson>1000species(wewilluseASTRAL–Mirarabetal.2014,Mirarab&Warnow

2015)•  Mul<plesequencealignmentof>100,000sequences(withlotsoffragments!)–wewilluseUPP(Nguyenetal.,2015)

Page 55: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Constructing the Tree of Life: Hard Computational Problems

NP-hardproblemsLargedatasets

100,000+sequencesthousandsofgenes

“Bigdata”complexity:

modelmisspecifica<onfragmentarysequenceserrorsininputdatastreamingdata

Page 56: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Research Strategies

Improvedalgorithmsthrough:•  Divide-and-conquer•  “Bin-and-conquer”•  Iteration•  Bayesiansta<s<cs• HiddenMarkovModels• Graphtheory•  Combinatorialop<miza<on

Sta<s<calmodellingMassiveSimula<onsHighPerformanceCompu<ng

Page 57: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Evolu<oninformsabouteverythinginbiology

•  Biggenomesequencingprojectsjustproducedata------sowhat?

•  Evolu<onaryhistoryrelatesallorganismsandgenes,andhelpsusunderstandandpredict–  interac<onsbetweengenes(gene<cnetworks)–  drugdesign–  predic<ngfunc<onsofgenes–  influenzavaccinedevelopment–  originsandspreadofdisease–  originsandmigra<onsofhumans

Page 58: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Metagenomics: Venter et al., Exploring the Sargasso Sea:

Scientists Discover One Million New Genes in Ocean Microbes

Page 59: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown

species Taxon identification: given short sequences,

identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed

MihaiPop,UnivMaryland

Page 60: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Metagenomictaxoniden6fica6on

Objec<ve:classifyshortreadsinametagenomicsample

Page 61: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

PossibleIndo-Europeantree(Ringe,WarnowandTaylor2000)

Anatolian

Albanian

Tocharian

Greek

Germanic

Armenian

Italic

Celtic

Baltic Slavic

Vedic Iranian

Page 62: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

“PerfectPhylogene<cNetwork” forIENakhlehetal.,Language2005

Anatolian

Albanian

Tocharian

Greek

Germanic

Armenian

Baltic Slavic

Vedic Iranian

Italic

Celtic

Page 63: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

Grading

•  Homework:25%(onehwdropped)•  Midterm:40%(March30)•  FinalProject:25%(dueMay6)•  CoursePar<cipa<on:10%

Nofinalexam.

Page 64: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

HomeworkAssignments

•  HomeworkassignmentsarelistedathHp://tandy.cs.illinois.edu/cs581-2017-hw.htmlandaredueat1PM(inpersonorviaemail)–latehomeworkshavereducedcreditandwillnotbeacceptedayer48hourspastthedeadline.

•  Youareencouragedtoworkwithothersonyourhomework,butyoumustwriteupsolu<onsbyyourselfandindicatewhoyouworkedwithoneachhomework.

Page 65: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

CourseSchedule

AdetailedcoursescheduleisathHp://tandy.cs.illinois.edu/cs581-2017-detailed-syllabus.htmlThisscheduleincludesmaterialyouareexpectedtohavelookedatbeforecomingtoclass:•  assignedreading(fromtextbookand/or

scien<ficliterature)•  PPTand/orPDFofmylecture

Page 66: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

FinalProjectandClassPresenta<on

Eitherresearchproject(canbewithanotherstudent)orsurveypaper(donebyyourself).Manyinteres<ngandpublishableproblemstoaddress:seehHp://tandy.cs.illinois.edu/topics.htmlYourclasspresenta<onshouldberelatedtoyourfinalproject.

Page 67: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

AcademicIntegrity

PleaseseecoursewebsiteathHp://tandy.cs.illinois.edu/581-2017.htmlandalsohHp://tandy.cs.illinois.edu/ethics.pdfForthiscourse:•  Examinethepolicyaboutcollabora<on•  Learnandunderstandwhatplagiarismis(and

thendon’tdoit).Thisappliestohomework,allwri<ngassignments,andthefinalproject.

Page 68: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

CourseResearchProjects

•  Evalua<ngexis<ngmethodsonsimulatedandreal(biologicalorlinguis<c)datasets

•  Designinganewmethod,andestablishingitsperformance(usingtheoryanddata)

•  Analyzingabiologicaldatasetusingseveraldifferentmethods,toaddressbiology

Page 69: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

ExamplesofpublishedcourseprojectsMdS.Bayzid,T.Hunt,andT.Warnow."DiskCoveringMethodsImprovePhylogenomicAnalyses".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S7.T.Zimmermann,S.MirarabandT.Warnow."BBCA:Improvingthescalabilityof*BEASTusingrandombinning".ProceedingsofRECOMB-CG(Compara<veGenomics),2014,andBMCGenomics2014,15(Suppl6):S11.J.Chou,A.Gupta,S.Yaduvanshi,R.Davidson,M.Nute,S.MirarabandT.Warnow.“Acompara<vestudyofSVDquartetsandothercoalescent-basedspeciestreees<ma<onmethods”.RECOMB-Compara<veGenomicsandBMCGenomics,2015.,2015,16(Suppl10):S2.P.Vachaspa<andT.Warnow(2016).FastRFS:FastandAccurateRobinson-FouldsSupertreesusingConstrainedExactOp<miza<onBioinforma<cs2016;doi:10.1093/bioinforma<cs/btw600.(SpecialissueforpapersfromRECOMB-CG)

Page 70: CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

ResearchProjectsyoucouldjoin•  Phylogenomicsprojects(Avianandthe1KP)

•  Speciestreeandnetworkes<ma<onfromconflic<nggenes•  Large-scalemul<plesequencealignment•  Large-scalemaximumlikelihoodtreees<ma<on•  Improvinggenetreees<ma<onusingwholegenomes

•  Metagenomics(withMihaiPop,UniversityofMaryland,andBillGropp)•  Iden<fyinggenesandtaxafromshortsequences•  Metagenomicassembly•  Applica<onstoclinicaldiagnos<cs

•  ProteinSequenceAnalysis(withJianPeng)•  Whatfunc<onandstructuredoesthisproteinhave?•  Howdidstructureandfunc<onevolve?

•  HistoricalLinguis<cs(withDonaldRinge,UPenn)•  HowdidIndo-Europeanevolve?•  Designingandimplemen<ngsta<s<cales<ma<onmethodsfor

languagephylogenies