the tree of life: challenges for discrete mathematics and theoretical computer science

48
The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science Fred S. Roberts DIMACS Rutgers University

Upload: wendy-velez

Post on 31-Dec-2015

16 views

Category:

Documents


3 download

DESCRIPTION

The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science. Fred S. Roberts DIMACS Rutgers University. The tree of life problem raises new challenges for mathematics and computer science just as it does for biological science. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Tree of Life: Challenges for Discrete

Mathematics and Theoretical Computer Science

Fred S. RobertsDIMACSRutgers University

Page 2: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The tree of life problem raises new challenges for mathematics and computer science just as it does for biological science.

Page 3: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

For math. and CS to become more effectively utilized, we need to:

•develop new tools;

•establish working partnerships between mathematical scientists and biological scientists;

•introduce the two communities to each others’ problems, language, and tools;.

Page 4: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

•introduce outstanding junior researchers from both sides to the issues, problems, and challenges of problems arising from the tree of life;

Page 5: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

•involve biological and mathematical scientists together to define the agenda and develop the tools of this field.

Page 6: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

These are some of the motivations for this meeting.

I will lay out some of the challenges for math and CS, with emphasis on discrete math and theoretical CS.

Page 7: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

What are DM and TCS?

•DM deals with:–arrangements–designs–codes–patterns–schedules–assignments

Page 8: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

•TCS deals with the theory of computer algorithms. •During the first 30-40 years of the computer age, TCS, aided by powerful mathematical methods, had a direct impact on technology, by developing models, data structures, algorithms, and lower bounds that are now at the core of computing.

Page 9: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

DM and TCS have found extensive use in many areas of science and public policy, for example in Molecular Biology.

These tools seem especially relevant to problems of the tree of life

Page 10: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

DM and TCS Continued•These tools are made especially relevant to the tree of life problem because of:

–Geographic Information Systems

Page 11: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

DM and TCS Continued–Availability of large and disparate computerized databases on subjects relating to species and the relevance of modern methods of data mining.

Page 12: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Outline

• Phylogenetic Tree Reconstruction• Database Issues• Nomenclature• Setting up a Species Bank• Digitization of Natural History Collections• Interoperability• The Many Applications of Research on the

Tree of Life

Page 13: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Phylogenetic Tree Reconstruction

Page 14: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Phylogeny (continued)•New methods of phylogenetic tree reconstruction owe a significant amount to modern methods of DM/TCS.•Trees, supertrees, consensus trees will all be discussed at length in this meeting•I will only make a few brief remarks about them.

Page 15: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Phylogenetic Challenges for DM/TCS

•Tailoring phylogenetic methods to describe the idiosyncracies of viral evolution -- going beyond a binary tree with a small number of contemporaneous species appearing as leaves.

•Dealing with trees of thousands of vertices, many of high degree.

•Making use of data about species at internal vertices (e.g., when data comes from serial sampling of patients).

Page 16: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Phylogenetic Challenges for DM/TCS: Continued

•Network representations of evolutionary history - if recombination has taken place.•Modeling viral evolution by a collection of trees -- to recognize the “quasispecies” nature of viruses.•Devising fast methods to average the quantities of interest over all likely trees.

Thanks to Eddie Holmes and Mike Steel for ideas.

DIMACS Working Group on Phylogenetic Trees and Rapidly Evolving Diseases, Sept. 3-6, 2003

Page 17: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Database Issues

• Assembling the tree of life requires collecting massive amounts of data about the world’s scientific species.

• Making it a collaborative project requires making such data universally available.

• There are great challenges for Math and CS, specifically DM and TCS.

Thanks to the Global Biodiversity Information Facility (GBIF) for many of the following ideas.

Page 18: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Complexity of Data

• In many ways, data about the world’s species are far more complex than genetic or protein sequence data. (GBIF)

Page 19: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Complexity of Data (cont’d)

• There are databases of images, databases in numerous forms, etc.

• Data is heterogeneous.

• Data has errors and inconsistencies.

Page 20: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Nomenclature

•There are some 1.75M named species•By some estimates, there are up to 10M actual species.

Page 21: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Nomenclature (cont’d)

• The same species is often named more than once.

• On the average, each species has two additional names (synonyms) besides its own name. (GBIF)

Page 22: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Nomenclature (cont’d)• Thus, there is need to

assemble names in an electronic catalogue, with synonyms and common misspellings.

• This would be of fundamental importance in aiding research on biodiversity.

Page 23: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Nomenclature (cont’d)

• Because of errors, one major challenge for TCS is data cleaning.

Page 24: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Nomenclature (cont’d)

• Another challenge is to search a database to see if two entries are similar.

• This is a standard problem in database theory.

• TCS algorithms involving k-nearest neighbor and other methods are very helpful here.

Page 25: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Setting up a Species Bank

Page 26: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Setting up a Species Bank (cont’d)

• A species bank would provide not only names, but also data about a species:– Type– Distribution– Ecological role– Phylogenetic history– Physiology– Genomics

• This involves issues about huge datasets.

Page 27: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Setting up a Species Bank (cont’d)

• NASA earth science satellites alone beam home image data at the rate of 1.2 terabytes a day.

• By 2010, this is expected to grow to 10 petabytes a day. (Kathleen Bergen, U.

Michigan)

Page 28: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Name Equal to: Size in Bytes

Bit 1 bit 1/8

Nibble 4 bits 1/2 (rare)

Byte 8 bits 1

Kilobyte 1,024 bytes 1,024

Megabyte 1,024 kilobytes 1,048,576

Gigabyte 1,024 megabytes 1,073,741,824

Terrabyte 1,024 gigabytes 1,099,511,627,776

Petabyte 1,024 terrabytes 1,125,899,906,842,624

Exabyte 1.024 petabytes 1,152,921,504,606,846,976

Zettabyte 1,024 exabytes 1,180,591,620,717,411,303,424

Yottabyte 1,024 zettabytes 1,208,925,819,614,629,174,706,176

Page 29: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Setting up a Species Bank (cont’d)

• The problem is even worse: We need to combine information from many databases.

• There is no known way to catalogue all species of plants in one place given current database systems techniques. (Jessie Kennedy, Napier University,

Edinburgh)

Page 30: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Setting up a Species Bank (cont’d)

• One possible approach: Tree and graph methods to support overlapping classifications as directed acyclic graphs or with complex objects (taxa or specimens) as nodes. (Jessie Kennedy)

Page 31: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections

• It has been estimated that there are between 1.5 and 3 Billion specimens in the world’s natural history collections, including herbaria, living microorganism stock centers, and other repositories (GBIF).

Page 32: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• If we could digitize information about these specimens, and make them available, we would “have a treasure trove of information about the world’s biota.” (GBIF)

• Pilot projects have shown that utilizing digitized data from several institutions’ databases can be a powerful tool. (GBIF)

Page 33: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: digitization and reference of non-standard data (photos, sonograms, field notes)

Page 34: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: Develop methods for visualizing the data (e.g., species’ distributions)

Page 35: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: Develop search engines for real-time searching of such extremely large data sets.

Page 36: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: Make information access on the web more knowledge-based so humans and intelligent software can work together. (Susan Gauch, U. Kansas)

Page 37: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: Use “intelligent agents” to organize and present relevant information on the web. (Susan

Gauch)

Page 38: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Challenge: Use partial information as “training data” for classification algorithms (Susan Gauch)

• One approach: Use training data and classification algorithms with learning capabilities.

(See: DIMACS project on Monitoring Message Streams)

Page 39: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Digitizing Natural History Collections (cont’d)

• Another approach to problems posed by digitization: Use tools of “knowledge inferencing” (Yannis Ioannidis, University of

Wisconsin)

• Still another approach: Use methods of spatio-temporal data mining (Ioannidis; see

work of Muthukrishnan at Rutgers)

Page 40: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Interoperability• Goal: Devise

standards for datasets so as to allow researchers to collaborate across datasets – develop standards leading to database interoperability. (GBIF)

Page 41: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Interoperability• Challenge: How do we develop ways to

more accurately represent observational or experimental data so that others may use them? (Jessie Kennedy)

• Challenge: Deal with issues of inconsistency and scalability.

• Challenge: Formalize issues of policy with regard to others’ databases.

• Challenge: Interoperability over a diversity of users and types of equipment.

Page 42: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Interoperability

• One approach: “Semantic Web” – the idea used to express the growing desire to make information access on the Web more knowledge-based so humans and intelligent software can work together. (Susan Gauch)

Page 43: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

Interoperability

• Another approach: Make use of languages such as XML developed to aid interoperability in business and military collaborations.

Page 44: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Many Applications of Research on the Tree of Life

• Side benefits in many fields:– Agriculture– Biomedicine– Biotechnology– Natural resource management– Pest control– Control of emergent diseases– Sustainable use of biodiversity resources– Global climate change

Page 45: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Many Applications of Research on the Tree of Life

• Let’s say you’re importing bananas from South America

Page 46: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Many Applications of Research on the Tree of Life

• A camera in the hold of the ship sees a spider.

• What kind of spider is it?

• Is it safe to unload your cargo of bananas?

Page 47: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Many Applications of Research on the Tree of Life

• Luckily, you have a digitized natural history database.

• With an efficient search feature.

(Thanks to Diana Lipscomb for this example)

Page 48: The Tree of Life:  Challenges for Discrete Mathematics and Theoretical Computer Science

The Many Applications of Research on the Tree of Life