why aren't we benchmarking bioinformatics? · why aren't we benchmarking bioinformatics?!...
TRANSCRIPT
![Page 1: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/1.jpg)
Why Aren't We Benchmarking Bioinformatics?!
Joe Parker"Early Career Research Fellow (Phylogenomics)"
Department of Biodiversity Informatics and Spatial Analysis"Royal Botanic Gardens, Kew"
Richmond TW9 3AB"[email protected]"
![Page 2: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/2.jpg)
Outline"• Introduction"• Brief history of Bioinformatics"• Benchmarking in Bioinformatics"• Case study 1: Typical benchmarking across
environments"• Case study 2: Mean-variance relationship for
repeated measures"• Conclusions: implication for statistical
genomics"
![Page 3: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/3.jpg)
A (very) brief history of bioinformatics"
Kluge & Farris (1969) Syst. Zool. 18:1-32
![Page 4: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/4.jpg)
A (very) brief history of bioinformatics"
Stewart et al. (1987) Nature 330:401-404
![Page 5: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/5.jpg)
A (very) brief history of bioinformatics"
ENCODE Consortium (2012) Nature 489:57-74
![Page 6: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/6.jpg)
A (very) brief history of bioinformatics"
Kluge & Farris (1969) Syst. Zool. 18:1-32
![Page 7: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/7.jpg)
Benchmarking to biologists"• Benchmarking as a comparative process"• i.e. ‘which software’s best?’ / ‘which
platform’ "• Benchmarking application logic /
profiling unknown"• Environments / runtimes generally either
assumed to be identical, or else loosely categorised into ‘laptops vs clusters’"
![Page 8: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/8.jpg)
Case Study 1!
aka!
‘Which program’s the best?’!
![Page 9: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/9.jpg)
Bioinformatics environments are very heterogeneous"
• Laptop:"– Portable"– Very costly form-factor"– Maté? Beer?"
• Raspi:"– Low: cost, energy (& power)"– Highly portable"– Hackable form-factor"
• Clusters:"• Not portable, setup costs
• The cloud:"– Power closely linked to budget (as limited
as) – Almost infinitely scalable -Have to have a connection to get data up
there (and down!) – Fiddly setup
![Page 10: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/10.jpg)
Benchmarking to biologists"
![Page 11: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/11.jpg)
Comparison"
System Arch CPU type, clock GHz cores
RAM Gb / MHz / type
HDD Gb
Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000
@ SATA
Raspberry Pi 2 B+ ARM ARMv7
@ 1.0 1 1 8 @ flash card
Macbook Pro (2011) x64 Core i7
@ 2.2 4 8 250 @ SSD
EC2 m4.10xlarge x64 Xeon E5
@ 2.4 40 160 320 @ SSD
![Page 12: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/12.jpg)
Reviewing / comparing new methods"
• Biological problems often scale horribly unpredictably"
• Algorithm analyses"• So empirical
measurements on different problem sets to predict how problems will scale…"
![Page 13: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/13.jpg)
Workflow"Setup
BLAST 2.2.30
CEGMA genes
Short reads
Concatenate hits to CEGMA alignments
Muscle 3.8.31
RAxML 7.2.8+
Set up workflow, binaries, and reference / alignment data. Deploy to machines.
Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available.
Append top hit sequences to CEGMA alignments.
For each:
Align in MUSCLE using default parameters
Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS.
Output and parse times.
![Page 14: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/14.jpg)
Results - BLASTP"
![Page 15: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/15.jpg)
Results - RAxML"
![Page 16: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/16.jpg)
Case Study 2!
aka!
‘What the hell’s a random seed?’!
![Page 17: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/17.jpg)
++
+
+
++
++
+
+
+
+
+++
++
+
+ ++
+
+
+
+
++
++
+++++
++
+++
+
++
+++
+
+ ++
+
++++
+
++
++
+
+
+
+
++
+
+
+
+
+
+
+
+++++
+
+++++
++ +
+++
+
+
+
+
+++
+
+
+++ + ++
++
+
++
++
+++
+
+
+
+
+++
+
+
+
++++
+
+
+
++
+
+++
+
+
+
+
+
+
++
+
+
++
+++
+
+
+
+
++ +
++
+
++
+
+
+
+
+
+
+
++
++
+
++
+
+
+
+
+
+
+++
+++
+
+
+
+
+
+ +
+
+
++
+
+
+
+
++
+
+
++
+
+
+
+
+
+ +
+
+
+
++
+
++
+
+
++++
+
+
+ +++
+
+
+
++
+
+
+
+
++
+
+
+
++
+
+
+++
+
+
+
+++
+
++
+
+
+++
+
+
++
+
+
+
+
+
+
++++
+++++
+
+
+
+
++
+
+
+
++
+
++
++
+
+
+
+
++
+
+
++
++
+
+
+
++
+
+++
+
+
+
+
+
++
+
+
++
+++
+
+
+
+
++
+++
+
+
++
+
+
+
++
+
+
+
+++
-50 -40 -30 -20 -10
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
Mean-variance plot for sitewise lnL estimates in PAMLn=10
Mean log-likelihood
variance
ooo o ooo oo ooo ooooooo ooo oo ooo oo oooooo o oooo oo ooo oo ooo ooooooo ooo ooo ooo oo ooo o ooooo o ooooooo oooooo oo oooo oooo o oo ooo ooo oo oooo oo oooooo ooooo o oo oo ooooo ooooo o oo o o ooooo o ooo o ooo ooo ooo ooo oooo o ooo ooo oo ooo ooooo ooo o ooo ooooo o ooooo o ooo ooo o ooo ooo oooo o oooo oo oooo oo ooooo o ooo oo ooo o oooooo oooo o ooo ooooo ooo ooo oo oo ooo ooooo o oo oooo o ooo oooo o o oo oo oo ooooo o o ooo o ooo o oo ooooo ooooo oo oo ooo ooo o oooo oo oo oo ooo
![Page 18: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/18.jpg)
Properly benchmarking workflows"
• Ignoring limiting-steps analyses"– in many workflows might actually be data
cleaning / parsing / transformation"• Or (most common error) inefficiently
iterating"• Or even disk I/O! "
![Page 19: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/19.jpg)
Workflow benchmarking very rare"
• Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc"
• http://beast.bio.ed.ac.uk/benchmarks"• Many e.g. bioinformatics papers"• More harm than good?"
![Page 20: Why Aren't We Benchmarking Bioinformatics? · Why Aren't We Benchmarking Bioinformatics?! Joe Parker" Early Career Research Fellow (Phylogenomics)" Department of Biodiversity Informatics](https://reader030.vdocuments.net/reader030/viewer/2022041105/5f0763ad7e708231d41cbdf6/html5/thumbnails/20.jpg)
Conclusion"• Biologists and error"• Current practice"• Help!"• Interesting challenges
too…"
Thanks:"RBG Kew, BI&SA, Mike Chester"