LIFE SCIENCES AND ALGORITHMIC DESIGN:THE NEED FOR SPEED, THE JOY OF SPEED
Raffaele Giancarlo
Dipartimento di Matematica ed InformaticaUniversità di Palermo
CINI InfoLife Lab
Summary- Short Version
• Impact of Algorithmic Theory and Practice on Modern Biology:
•Gene Myers:• ACM Paris Kannelakis Theory and Practice
Award 2001 • ISMB Career Award
2014
Summary- Long Version
• Some History
• The data deluge: is algorithmic theory and practice catering to biology ???
The Thirties
Algorithms-formalization…Turing, Church, Kleene, Godel, Post…. a non-ambiguos ordered finite sequence of steps, each effectively excetuable in finite time, producing a result in finite time
The forties again
Claude Shannon-The Birth of InformationTheory …and data compression
The guy is quite a character, please visit: https://www.youtube.com/watch?v=G5rJJgt_5mg
the fab sixties
Not only algorithms butFast algorithms and formal
methodologies for their design and analysis: Knuth, Tarjan, Hopcroft
The seventies
NP-Completeness:Not all problems seem to admit anefficient algorithmic solution …and Computational Biology has plenty of examples
Edmonds, Cook, Karp
Discrete Algorithms
• Discrete mathematical objects: good models to represent computational problems
Example: Graphs
Discrete Algorithms
•Discrete mathematical objects: Efficient organization of information
• Example : Trees
Discrete Algorithms
•How to establish the performance of an algorithm: Models of computers, Hardware, etc.
Discrete Algorithms
•How to establish the performance of an algorithm: Models of computers, Hardware, etc.
• Here: The “real thing”: The Turing Machine or equivalent
Discrete Algorithms and Bioinformatics
-Do we need more algorithms?• Pubmed search: 21471 papers [1991,
2014]• Scopus: 101178 papers (Biochemistry,
Genetics, Molecular Biology)
We need GOOD ALGORITHMS
Discrete Algorithms and Bioinformatics
•Good Algorithms Fast and memory efficient, i.e., process
growing amounts of data in “reasonable” time and little space
Accurate, i.e., able to identify useful biological information in terms of function and/or structure
Descrete Algorithms and Bioinformatics
Good Algorithms Accurate (evaluation): THE BIOLOGIST
A physical person A Statistician, i.e., statistical analysis
Surprising or unexpected “events” are related to “biologically useful” information
Example: BLAST, Transcription factors binding sites Benchmark data sets
They offer solutions, validated by experts, one can compare against
Examples: CASP, DREAM, MSA NOT AVAILABLE IN MANY CRUCIAL DOMAINS
Discrete Algorithms and Bioinformatics
•Good Algorithms The statistician: care must be exercised…(ahi
ahi ahi, no Alpitour)
Towards epistemological foundations of statistical methods for high-dimensional
biologyMehta et al., Nature Genetics 2004
Exponential growth of statistical methods for microarrays analysis
For many of them, it is unclear what they do and why they are needed: they are defined as Questionable
Discrete Algorithms and Bioinformatics
-Good Algorithms• Time• Space
Let’s Take a Global Look:•Processors Power (MIPS)•External Disk Capacity (MB)• Sequencing Capacity (kb per day)•Transmission costs are not counted
Discrete Algorithms and Bioinformatics
-On the future of genomic data [Kahn11] and good algorithms (time, space)
Discrete Algorithms and Bioinformatics
-On the future of genomic data [Kahn11] and good algorithms (time, space)- A “Meteorological map on the Data Flood”
[96-02]
[02,06]
[06,08]
[08, -]
Discrete Algorithms and Bioinformatics
•Questions:1. How long does it take for a “foundational
advance” in algorithmic theory to be perceived as such in bioinformatics and be applied,
as proof of principle, or as the base for a tool
1. Is such a delay related to the “meteorological map” outlined earlier?
Algorithmic Theory and Bio Impact
-Four small case studies:
- Suffix trees in Computational Biology- Data Compression of biological sequences- Genome scale sequence alignment- Compressive Genomics
Suffix Trees and Comp. Bio.
•A brief history: • Weiner 75• Mc Creight 76• Manber and Myers 93-Suffix arrays• Ukkonen 95• Gusfield 97: Algorithms on strings, trees
and sequences: Computer Science and Computational Biology, Cambridge Univesity Press
• Gusfield and Stoye 98• Ect., etc.
Suffix Trees and Comp. Bio.
•Compressed suffix arrays and Self-Indexes
• Ferragina and Manzini 2000• Grossi and Vitter 2000
Proof of Principle in Comp. Biology: index with a 2G footprint for the Human Genome
Sadakane and Shibuya 2001 Lippert 2002
Suffix Trees in Comp. Bio.
•Compressed suffix arrays and Self-Indexes
• Ferragina and Navarro 2005 • The pizza and chili corpus: highly tuned
collections of implementations ready for download and use
• Velimaki et al. 2007• Experimental study for CSA as a genome scale
sequence analysis tool
Suffix Trees and Comp. Bio.
•Compressed arrays and Self-Indexes• Vyvemar et al. 2012: prospects and
limitations of full text indexes in genome analysis
• Essential for: • Read Mapping, e.g. Bowtie • Short read error correction, • genome assembly
Genome scale alignments
- MUMer1 and 2- Delcher et al. 1999, 2002
-LAGAN and MultiLagan- Brudno et al,2003
- Suffix trees: Weiner 75, Mc Creight76, Miller and Myers 93, Ukkonen 95
- Sparse Dynamic Programming: H77, HS77, AG87,
EGGI92
Data Compression
• Data Compression in Computational Biology, Giancarlo, Scaturro, Utro, 2009•Compressive Sequence Analysis, Giancarlo, Rombo, Utro 2014
•General compression-Rich history, 1948…
Data Compression
• Compression of biological sequences, Grumbach and Tahi 1993
• Period 1993-2007: “only” 17 new methods specialized to biological sequences
• Period 2008-2013: 36…and counting new methods specialized to NGS data and large genomic sequence collections- a couple of fundamentally new ideas are present: problem to be studied
Compressive genomics
•In a nutshell:
• Algorithm A solves problem P on input x=AAAAAAAAAAACCCCCCCCGGGGGG Algorithm A’ solves problem P on input x’= (A,11); (C,8); (G,6)
OUTPUT IS THE SAME
Compressive Genomics
• Protein DataBase Blast Searches on a compressed DataBase, Berger et al. 2012, 2013
• Compressed Indexing and DNA Local Alignment, Lam et al., 2008
• String Matching over compressed text, Amir et al. 1994• A sub-quadratic sequence alignment algorithm over compressed text, Crochemore et al. 2003,
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal
-Remedies Part 1: Historia Magistra Vitae
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae
A. Apostolico and M. Crochemore, String pattern matching for a deluge survival kit, 2002
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal-Remedies Part 1:Historia Magistra Vitae
B. Berger, J. Peng, M. Singh, Computational Solutions for omic data, 2013
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal
-Remedies Part 2:- Algorithmic foundational work to
Discrete Algorithms and Bioinformatics
-A data deluge…ehm, universal
-Remedies Part 2:- Algorithmic foundational work to:
Break the Big Data Wall!!!
Discrete Algorithms
-New algorithmic design paradigms • External Memory algorithms: Input data
reside on disk and are too big to fit in memory• Aggarwal and Vitter 1988
• An area that has reached full maturity, Comp. Bio. may be reasonably happy with it.• Recoil, Yanovsky 2011: Compression of
embarassingly large DNA sequence collections• Bauer et al., 2012, Lightweight LCP
construction for Next Generation Sequencing Datasets
Discrete Algorithms and Bioinformatics
-New algorithmic design paradigms
Algorithms on Data Streams: the volume of data is so large that one cannot
even store it Data is produced “in a stream” and cannot be
stored on memory
M. Henzinger, P. Raghavan, S. Rajacopalan 1999
Probably not very good for Comp. Bio.
Discrete Algorithms
-New algorithmic design paradigms- Succinct data structures: storing data in
small space- G.J. Jacobson, 1988- Promising for Comp. Bio.
Full Text Self-Indexes Bloom Filters:Pell et al., 2012: 40-fold reduction
in memory requirement for metagenomes assembly
Bloom Filters have been invented in 1970
Discrete Algorithms
-New algorithmic design paradigms
- Synopsis Data Structures: Only a “relevant summary” of the data is kept-
Gibbons and Matias, 1998
No Use yet in Comp. Bio., but very promising because of its success in DataBase System design: Iceberg Queries
Discrete Algorithms
-New algorithmic design paradigms
- Approximation algorithms: well known for hard problems, e.g. TSP,
genome assembly
New: use it for “resource bounded” problems in order to obtain performance guaranteed approximations
Already in use in Comp. Bio. WITHOUT the performance guarantee part…
Conclusions
-Since the late 80’s, a solid bridge has been builtbetween Algorithmic Research and Bioinformatics and Comp. Bio. • Algorithmic Research seems to be asking the right questions in foundational terms for “BIG DATA”- Biology is a privileged testbed, with a turning point in attention around 1997
•The fact that algorithimc research “does not listen to Comp. Bio. needs” is a false metropolitan legend: •Having fun learning about algorithmic theory ? We do learning about biology!!!