using ensembles of hidden markov models for grand challenges...
TRANSCRIPT
![Page 1: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/1.jpg)
UsingEnsemblesofHiddenMarkovModelsforGrandChallengesinBioinformatics
TandyWarnowFounderProfessorofEngineering
TheUniversityofIllinoisatUrbana-Champaignhttp://tandy.cs.illinois.edu
![Page 2: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/2.jpg)
From the Tree of the Life Website, University of Arizona
Phylogenomics
![Page 3: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/3.jpg)
1kp:ThousandTranscriptomeProject
l PlantTreeofLifebasedontranscriptomesof~1200speciesl Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD
Challenge: Alignments and trees on > 100,000 sequences
Plus many many other people…
![Page 4: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/4.jpg)
1000-taxonmodels,orderedbydifficulty(Liuetal.,Science324(5934):1561-1564,2009)
![Page 5: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/5.jpg)
Re-aligning on a tree A
B D
C
Merge sub-alignments
Estimate ML tree on merged
alignment
Decompose dataset
A B
C D
Align subsets
A B
C D
ABCD
![Page 6: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/6.jpg)
SATéandPASTAAlgorithms
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
Repeatuntilterminationcondition,and
returnthealignment/treepairwiththebestMLscore
![Page 7: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/7.jpg)
1000taxonmodels,orderedbydifficulty,Liuetal.,Science324(5934):1561-1564,2009
24hourSATé-Ianalysis,ondesktopmachines
(Similarimprovementsforbiologicaldatasets)
![Page 8: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/8.jpg)
1000-taxonmodelsrankedbydifficulty
SATé-2betterthanSATé-1
SATé-1(Liuetal.,Science2009):cananalyzeupto8KsequencesSATé-2(Liuetal.,SystematicBiology2012):cananalyzeupto~50Ksequences
![Page 9: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/9.jpg)
RNASim
0.00
0.05
0.10
0.15
0.20
10000 50000 100000 200000
Tree
Erro
r (FN
Rat
e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment
PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015– SimulatedRNASimdatasetsfrom10Kto200Ktaxa– Limitedto24hoursusing12CPUs– Notallmethodscouldrun(missingbarscouldnotfinish)
PASTA:evenbetterthanSATé-2
![Page 10: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/10.jpg)
Decompositionto100-sequencesubsets,oneiterationofPASTA+BAli-Phy
ComparingdefaultPASTAtoPASTA+BAli-Phyonsimulateddatasetswith1000sequences
PASTA+BAli−Phy Better
PAS
TA B
etter
0.0
0.1
0.2
0.3
0.4
0.0 0.1 0.2 0.3 0.4PASTA
PAS
TA+B
Ali−
Phy
Total Column Score
dataIndelible M2
RNAsim
Rose L1
Rose M1
Rose S1
PASTA+BAli−Phy Better
PAS
TA B
etter
0.6
0.7
0.8
0.9
1.0
0.6 0.7 0.8 0.9 1.0PASTA
PAS
TA+B
Ali−
Phy
Recall (SP−Score)
PASTA+BAli−Phy Better
PAS
TA B
etter
0.00
0.05
0.10
0.15
0.00 0.05 0.10 0.15PASTA+BAli−Phy
PAS
TA
Tree Error: Delta RF (RAxML)
![Page 11: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/11.jpg)
1kp:ThousandTranscriptomeProject
l PlantTreeofLifebasedontranscriptomesof~1200speciesl Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD
Challenge: Alignments and trees on > 100,000 sequences
Plus many many other people…
![Page 12: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/12.jpg)
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary
![Page 13: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/13.jpg)
Length
Counts
0
2000
4000
6000
8000
10000
12000 Mean:317Median:266
0 500 1000 1500 2000
1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary
Allstandardmultiplesequencealignmentmethodswetestedperformedpoorlyondatasetswithfragments.
![Page 14: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/14.jpg)
ProfileHiddenMarkovModels• Probabilisticmodeltorepresentafamilyofsequences,
representedbyamultiplesequencealignment
• IntroducedforsequenceanalysisinKroghetal.1994.PopularizedinEddy1996andtextbookDurbinetal.1998
• FundamentalpartofHMMERandotherproteindatabases
• Usedfor:homologydetection,proteinfamilyassignment,multiplesequencealignment,phylogeneticplacement,proteinstructureprediction,alignmentsegmentation,etc.
![Page 15: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/15.jpg)
ProfileHMMsl GenerativemodelforrepresentingaMSA
l Consistsof:
l Setofstates(Match,insertion,anddeletion)
l Transitionprobabilities
l Emissionprobabilities
![Page 16: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/16.jpg)
ProfileHiddenMarkovModelforDNAsequencealignment
![Page 17: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/17.jpg)
HMMsforMSAl Givenseedalignment(e.g.,inPFAM)andacollectionof
sequencesfortheproteinfamily:
l RepresentseedalignmentusingHMM
l AligneachadditionalsequencetotheHMM
l UsetransitivitytoobtainMSA
l Canwedosomethinglikethiswithoutaseedalignment?
![Page 18: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/18.jpg)
HMMsforMSAl Givenseedalignment(e.g.,inPFAM)andacollectionof
sequencesfortheproteinfamily:
l RepresentseedalignmentusingHMM
l AligneachadditionalsequencetotheHMM
l UsetransitivitytoobtainMSA
l Canwedosomethinglikethiswithoutaseedalignment?
![Page 19: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/19.jpg)
UPPUPP=“Ultra-largemultiplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemultiplesequencealignments,eveninthepresenceoffragmentarysequences.
![Page 20: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/20.jpg)
UPPUPP=“Ultra-largemultiplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemultiplesequencealignments,eveninthepresenceoffragmentarysequences.
UsesanensembleofHMMs
![Page 21: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/21.jpg)
Simpleidea(notUPP)
• Selectrandomsubsetofsequences,andbuild“backbonealignment”
• ConstructaHiddenMarkovModel(HMM)onthebackbonealignment
• AddallremainingsequencestothebackbonealignmentusingtheHMM
![Page 22: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/22.jpg)
RNASim
0.00
0.05
0.10
0.15
0.20
10000 50000 100000 200000
Tree
Erro
r (FN
Rat
e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment
PASTA:Mirarab,Nguyen,andWarnow,JComp.Biol.2015– SimulatedRNASimdatasetsfrom10Kto200Ktaxa– Limitedto24hoursusing12CPUs– Notallmethodscouldrun(missingbarscouldnotfinish)
PASTA:evenbetterthanSATé-2StartingtreeisbasedonUPP(simple):oneprofileHMMforBackbonealignment
![Page 23: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/23.jpg)
• Selectrandomsubsetofsequences,andbuild“backbonealignment”
• ConstructaHiddenMarkovModel(HMM)onthebackbonealignment
• AddallremainingsequencestothebackbonealignmentusingtheHMM
Thisapproachworkswellifthedatasetissmallandhaslowevolutionaryrates,butisnotveryaccurateotherwise.
![Page 24: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/24.jpg)
One Hidden Markov Model for the entire alignment?
![Page 25: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/25.jpg)
OneHiddenMarkovModelforthebackbonealignment?
HMM1
![Page 26: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/26.jpg)
Or2HMMs?
HMM1
HMM2
![Page 27: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/27.jpg)
HMM1
HMM3 HMM4
HMM2
Or4HMMs?
![Page 28: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/28.jpg)
m
HMM2
HMM3
HMM1
HMM4
HMM5 HMM6
HMM7
Orall7HMMs?
![Page 29: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/29.jpg)
UPPAlgorithmicApproach
1. Selectrandomsubsetoffull-lengthsequences,andbuild“backbonealignment”
2. Constructan“EnsembleofHiddenMarkovModels”onthebackbonealignment
3. AddallremainingsequencestothebackbonealignmentusingtheEnsembleofHMMs
![Page 30: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/30.jpg)
Evaluation• Simulateddatasets(somehavefragmentarysequences):– 10Kto1,000,000sequencesinRNASim–complexRNAsequenceevolutionsimulation
– 1000-sequencenucleotidedatasetsfromSATépapers– 5000-sequenceAAdatasets(fromFastTreepaper)– 10,000-sequenceIndeliblenucleotidesimulation
• Biologicaldatasets:– Proteins:largestBaliBASEandHomFam– RNA:3CRWdatasetsupto28,000sequences
![Page 31: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/31.jpg)
RNASimMillionSequences:treeerror
Using 12 processors: • UPP(Fast,NoDecomp)
took 2.2 days,
• UPP(Fast) took 11.9 days, and
• PASTA took 10.3 days
![Page 32: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/32.jpg)
0.0
0.2
0.4
0.6
0 12.5 25 50% Fragmentary
Mea
n al
ignm
ent e
rror
PASTA UPP(Default)
(a) Average alignment error
0.0
0.2
0.4
0 12.5 25 50% Fragmentary
Del
ta F
N tr
ee e
rror
PASTA UPP(Default)
(b) Average tree error
Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.
80
Performanceonfragmentarydatasetsofthe1000M2modelcondition
UPPisveryrobusttofragmentarysequences
Underhighratesofevolution,PASTAisbadlyimpactedbyfragmentarysequences(thesameistrueforothermethods).UPPcontinuestohavegoodaccuracyevenondatasetswithmanyfragmentsunderallratesofevolution.
![Page 33: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/33.jpg)
●
●
●
●
0
5
10
15
50000 100000 150000 200000Number of sequences
Wal
l clo
ck a
lign
time
(hr)
● UPP(Fast)
UPPRunningTime
Wall-clocktimeused(inhours)given12processors
![Page 34: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/34.jpg)
OtherApplicationsoftheEnsembleofHMMs
SEPP(phylogeneticplacement,Mirarab,Nguyen,andWarnowPSB2014)
TIPP(metagenomictaxonidentification,Nguyen,Mirarab,Liu,Pop,andWarnow,Bioinformatics2014)
HIPPI(proteinclassificationandremotehomologydetection),RECOMB-CG2016andBMCGenomics2016(Nguyen,Nute,Mirarab,andWarnow)
![Page 35: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/35.jpg)
![Page 36: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/36.jpg)
Objective:Distributionofthespecies(orgenera,orfamilies,etc.)withinthesample.
Forexample:Thedistributionofthesampleatthespecies-levelis:
50% speciesA
20% speciesB
15% speciesC
14% speciesD
1% speciesE
AbundanceProfiling
![Page 37: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/37.jpg)
TIPPpipeline
Input:setofreadsfromashotgunsequencingexperimentofametagenomicsample
1. AssignreadstomarkergenesusingBLAST2. Forreadsassignedtomarkergenes,perform
taxonomicanalysis3. CombineanalysesfromStep2
![Page 38: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/38.jpg)
Highindeldatasetscontainingknowngenomes
Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.
![Page 39: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/39.jpg)
“Novel”genomedatasets
Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.
![Page 40: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/40.jpg)
ProteinFamilyAssignment
• Input:newAAsequence(mightbefragmentary)anddatabaseofproteinfamilies(e.g.,PFAM)
• Output:assignment(ifjustified)ofthesequencetoanexistingfamilyinthedatabase
![Page 41: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/41.jpg)
HIPPI
• HIerarchicalProfileHMMsforProteinfamilyIdentification
• Nguyen,Nute,Mirarab,andWarnow,RECOMB-CG2016andBMC-Genomics2016
• UsesanensembleofHMMstoclassifyproteinsequences
• TestedonHMMER
![Page 42: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/42.jpg)
Pre
cis
ion
.97
.98
.98
.99
.75 .80 .85 .90 .95 1.00
Seq Length: Full
.96
.97
.98
.99
.50 .60 .70 .80 .90
Recall
Seq Length: 50%
.92
.95
.98
.20 .40 .60 .80
Seq Length: 25%
Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations
![Page 43: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/43.jpg)
FourProblems
• PhylogeneticPlacement(SEPP,PSB2012)
• Multiplesequencealignment(UPP,RECOMB2014andGenomeBiology2014)
• Metagenomictaxonidentification(TIPP,Bioinformatics2014)
• Genefamilyassignmentandhomologydetection(HIPPI,RECOMB-CG2016andBMCGenomics2016)
Aunifyingtechniqueisthe“EnsembleofHiddenMarkovModels”(introducedbyMirarabetal.,2012)
![Page 44: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/44.jpg)
Summary• UsinganensembleofHMMstendstoimproveaccuracy,foracostofrunning
time.Applicationssofartotaxonomicplacement(SEPP),multiplesequencealignment(UPP),proteinfamilyclassification(HIPPI).Improvementsaremostlynoticeableforlargediversedatasets.
• Phylogenetically-basedconstructionoftheensemblehelpsaccuracy(note:thedecompositionsweproducearenotclade-based),butthedesignanduseoftheseensemblesisstillinitsinfancy.(Manyrelativelysimilarapproacheshavebeenusedbyothers,includingSci-PhyandFlowerPowerbySjolander.)
• Thebasicideacanbeusedwithanykindofprobabilisticmodel,doesn’thavetoberestrictedtoprofileHMMs.
![Page 45: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/45.jpg)
Basicquestion:whydoesithelp?
• UsinganensembleofHMMstendstoimproveaccuracy,foracostofrunningtime.Applicationssofartotaxonomicplacement(SEPP),multiplesequencealignment(UPP),proteinfamilyclassification(HIPPI).Improvementsaremostlynoticeableforlargediversedatasets.
• Phylogenetically-basedconstructionoftheensemblehelpsaccuracy(note:thedecompositionsweproducearenotclade-based),butthedesignanduseoftheseensemblesisstillinitsinfancy.(Manyrelativelysimilarapproacheshavebeenusedbyothers,includingSci-PhyandFlowerPowerbySjolander.)
• Thebasicideacanbeusedwithanykindofprobabilisticmodel,doesn’thavetoberestrictedtoprofileHMMs.
![Page 46: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/46.jpg)
Scientific challenges: • Ultra-large multiple-sequence alignment • Alignment-free phylogeny estimation • Supertree estimation • Estimating species trees from many gene trees • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima • Theoretical guarantees under Markov models of evolution
Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics
The Tree of Life: Multiple Challenges
![Page 47: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/47.jpg)
Acknowledgments
PASTAandUPP:NamNguyen(nowpostdocatUIUC)andSiavashMirarab(nowfacultyatUCSD),undergrad:KeerthanaKumar(atUT-Austin)PASTA+BAli-Phy:MikeNute(PhDstudentatUIUC)CurrentNSFgrants:ABI-1458652(multiplesequencealignment)GraingerFoundation(atUIUC),andUIUCTACC,UTCS,BlueWaters,andUIUCcampusclusterPASTA,UPP,SEPP,andTIPPareavailableongithubathttps://github.com/smirarab/;seealsoPASTA+BAli-Phyathttp://github.com/MGNute/pasta
![Page 48: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/48.jpg)
Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15
Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets
Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam
NT 10K 10K AA (17) (2)
Average alignment SP-error
UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2)
PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1)
MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2)
Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X
Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3)
Average !FN error
UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA
PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA
MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA
Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA
Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA
Average TC score
UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3)
PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1)
MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2)
Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X
Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4)
We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average !FN error (middle), and average TC score (bottom), for the collection offull-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate within the time limit(24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and thetwo HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the !FN error is indicated by “NA”). The tier ranking for each methodis shown parenthetically
memory error message were marked as failures. Forexperiments on the million-sequence RNASim dataset,we ran the methods on a dedicated machine with 256GBof main memory and 12 cores until an alignment wasgenerated or the method failed. We also performed a lim-ited number of experiments on TACC with UPP’s internal
checkpointing mechanism, to explore performance whentime is not limited. All methods other than Muscle hadparallel implementations and were able to take advantageof the 12 available cores.On full-length datasets (Table 2) where nearly all meth-
ods were able to complete, PASTA was nearly always in
Table 3 Average alignment SP-error and tree error across fragmentary datasets
Method ROSE NT RNASim 10K Indelible 10K CRW
(16S.3 and 16S.T)
Average alignment SP-error
UPP 8.3 (1) 11.8 (1) 2.7 (1) 16.1 (1)
PASTA 25.2 (2) 47.7 (4) 8.8 (2) 23.3 (2)
MAFFT 32.5 (3) 25.5 (2) 51.3 (3) 24.5 (3)
Muscle 35.3 (4) 82.2 (5) 77.6 (4) 70.6 (5)
Clustal 62.0 (5) 35.0 (3) X 46.7 (4)
Average !FN error
UPP 1.9 (1) 3.1 (1) 2.5 (1) 7.4 (2)
PASTA 25.2 (3) 21.9 (3) 9.0 (2) 8.2 (2)
MAFFT 18.0 (2) 6.2 (2) 35.6 (3) 2.5 (1)
Muscle 27.5 (4) 43.6 (5) 45.2 (4) 30.1 (3)
Clustal 47.8 (5) 26.3 (4) X 37.4 (4)
We report the average alignment error (top) and average !FN error (bottom) on the collection of fragmentary datasets. Clustal-Omega failed to align any of the Indelible10000M2 fragmentary datasets and thus we mark the results with an X. The tier ranking for each method is shown in parentheses
![Page 49: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/49.jpg)
.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0
.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0.75
.80
.85
.90
.95
1.0
.25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.80
.90
1.0
.00 .25 .50 .75 1.0
.70
.80
.90
1.0
.00 .20 .40 .60 .80
.70
.80
.90
1.0
.00 .20 .40 .60 .80
.70
.80
.90
1.0
.00 .20 .40 .60 .80
.70
.80
.90
1.0
.00 .20 .40 .60 .80
.70
.80
.90
1.0
.00 .20 .40 .60 .80
.70
.80
.90
1.0
.00 .20 .40 .60 .80
Pre
cisi
on
Avg. Pairwise Sequence Identiy
> 30% 20−30% < 20%
Size
: 0−
10
0S
ize: >
10
0S
ize: 0
−1
00
Size
: > 1
00
Size
: 0−
10
0S
ize: >
10
0
Sequence
Length
: Full
Sequence
Length
: 50%
Sequence
Length
: 25%
Recall
Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations
![Page 50: Using Ensembles of Hidden Markov Models for Grand Challenges …tandy.cs.illinois.edu/warnow-ucla-ensembles.pdf · Profile Hidden Markov Models • Probabilistic model to represent](https://reader036.vdocuments.net/reader036/viewer/2022081611/5fc3adaeca011222f06d1c32/html5/thumbnails/50.jpg)
1. Pre-processing seed alignments
.
.
.
.
Family A
Family B
Family N
.
.
i) Compute an ML tree on each seed alignment
ii) Build ensemble of HMMs for each seed alignment, using its ML tree
HMM A1
HMM B1
HMM N1
HMM A2
HMM A3
HMM N2
HMM N5
HMM N4
HMM N3
iii) Collect HMMs into database
HMM A1 HMM A2HMM A3
HMM B1
......HMM N1HMM N2
HMM N5
HMM N4HMM N3
2. Classification of query sequences
HMM A1 HMM A2HMM A3
HMM B1
......HMM N1HMM N2
HMM N5
HMM N4HMM N3
HMM A1 = 8.9 HMM A2 = 7.3HMM A3 = 9.4HMM B1 = 5.6......HMM N1 = 4.4HMM N2 = 5.6
HMM N5 = 6.6HMM N4 = 4.7HMM N3 = 4.3
1) Family A = 9.4
8) Family B = 5.6 ......
2) Family N = 6.6 ......
Query sequence
HMM database
i) Score query sequence against all HMMs in database, keeping only scores above inclusion threshold
ii) Rank families by best scoring HMM within family
iii) Assign query sequence to top ranking family
Family A
Query
HMM database