leveraging semantic metadata for ecological data discovery and integration for analysis and modeling...

25
Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions from Shawn Bowers, Chad Berkley, Dan Higgins, Rich Williams, Deana Pennington, Jing Tao, and others National Center for Ecological Analysis and Synthesis University of California, Santa Barbara

Post on 22-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling

Matthew B. JonesMark P. Schildhauer

with contributions from Shawn Bowers, Chad Berkley, Dan Higgins,

Rich Williams, Deana Pennington, Jing Tao, and others

National Center for Ecological Analysis and SynthesisUniversity of California, Santa Barbara

Page 2: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Topical Outline

• Case study: Predicting species distributions under climate change with ecological niche modeling

• Challenges presented by Ecological Niche Modeling

• Scientific Workflows and the Kepler system

• Improving scientific workflows with semantics

• Future work

Page 3: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Ecological Niche modeling

• Correlate current bioclimatic and topography data with observed species distribution to develop prediction algorithm

• Project predicted distribution of species using the prediction algorithm against 5 IPCC climate change scenarios

• Can use the Genetic Algorithm for Ruleset Prediction (GARP) and other algorithms

• oyamel fir prediction (Oberhauser and Peterson 2005)

Page 4: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Informatics Challenges

• Data discovery, access, and archiving

• Data integration

• Compute cycles

• Alternative model testing

• Model complexity

Page 5: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Data Discovery, Access, and Archiving

• Niche modeling requires many different data sources– Species observation data– Environmental data

• Precipitation, land use, LAI, topography, etc.

– Climate change data• IPCC Climate Change scenarios

• Currently, these types of data are either completely inaccessible or accessible with only significant manual effort in locating and accessing them from multiple independent providers

• Need to archive selected model results and data– Typically, these are handled in an ad-hoc basis, typically without

documentation or archiving plans

• For niche modeling, outputs are diverse:– GA rule sets– Predicted distributions under current and altered conditions– Maps of these distributions

Page 6: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Data Integration

• To utilize data, need to normalize and integrate to a common frame of reference

• For niche modeling, that includes finding an optimal extent, resolution, and projection for all data types

• Currently, custom scripts or applications are used for such transformations– Extremely time consuming

Page 7: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Compute cycles

• Ecological modeling problems are typically computation-limited

• For niche modeling, researchers desire to examine, for example, predicted distribution of all mammals of the Western Hemisphere under current conditions and 5 IPCC climate change scenarios

(200 to 500 runs per speciesx

~2000 mammal speciesx

3 minutes/run)

=833 to 2083 days

Page 8: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Testing alternative models

• Researchers want to ‘tweak’ models– Explore alternative algorithms– Modify parameterization

• Iterate over many combinations of these• Final results and intermediate versions of models need to be

saved and versioned

• For niche modeling, the following algorithms are commonly used by researchers:

• DA, discriminant analysis

• BM, Bayesian Model

• BP, bioclimatic profiles

• CART, classification and regression trees

• GAM, generalized additive models

• GLM, generalized linear models

• GARP, genetic algorithm for rule-set production

• MD, mahalanobis distance method

• NNETW, neural networks

• SI, spatial interpolation

From: Segurado and Araujo. 2004. An evaluation of methods for modelling species distributions. Journal of Biogeography 31, 1555–1568.

Page 9: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Model complexity

• Models and analyses typically consist of 100s of analytical processing steps– Understanding the model becomes very difficult

• “Spaghetti code” is common

– Only experts can modify or review the model– Complexity increases the chance of undetected errors

Page 10: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Current approaches to ENM

• Evolving, but typically these are custom simulation models– e.g., GARP

• Tend towards monolithic applications that handle everything in one place (data ingestion, transformation, model execution, output management, statistical analysis)– These models typically are difficult to extend, modify, or

understand and require specialized expertise to use

– This is typical of many models in ecology, and is largely due to the difficulty of managing complexity in modern programming languages

Page 11: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

A

Source(e.g., data)

C

Sink(e.g., display)

B

Alternative approach: scientific workflows

• What are scientific workflows?– Graphical model of data flow among processing steps

– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models

A’

Processor(e.g., regression)

B

ED F

Page 12: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Kepler Scientific Workflow System

• Software to design and execute scientific workflows– Variety of analytical components (including spatial data transformations)– Support for R scripts and Matlab scripts– Real-time data access to sensor networks– Cross-project collaboration

• SEEK, SciDAC, GEON, Ptolemy, RoadNet, EOL, Resurgence

• EcoGrid access to heterogeneous environmental data– EML Data support

• Experimental data, survey data, spatial raster and vector data, etc.

– DarwinCore Data support• Museum collections

– GeoSciences Network (GEON) Data Support

• Demonstration workflows from many domains– Ecology: Ecological Niche Modeling– Genomics: Promoter Identification Workflow– Geology: Geologic Map Information Integration– Oceanography: Real-time Revelle example of data access

Page 13: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

A simple Kepler workflow

Data source from EcoGrid(metadata-driven ingestion)

res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)

R processing script

Page 14: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Ecological Niche Model in Kepler

Page 15: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Have scientific workflows improved ENM?

• Data discovery, access, and archiving– Direct access to data archives from natural history collections, ecology, and

geology

– Ability to archive outputs back into data storage systems

• Data integration:– directly handled by specialized components in the workflow

• Compute cycles– Current and growing grid computing support in Kepler increases runtime efficiency

• Alternative model testing– Decouples components and allows simple modifications of the model

– Workflows act as a full description of an executed process• Can be saved to the same repositories as data, allowing for complete replication of the

model results

• Model complexity: – Visual display that documents and elucidates the model

– Hierarchical modeling allows abstraction at higher levels

Page 16: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Semantics in scientific workflows

• Components and their ports typically have:– Explicit ‘structural type’

• e.g., int, float, string, {double}

– Implicit semantic type• Not sure whether the stream of values from a port

represents ‘rainfall’ values or ‘body size’ values

A B

int intstring intint int

rainfall bodysizebodysize bodysize

int int

Page 17: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Ecological ontologies

• Model of knowledge in a domain like ecology or biodiversity– What was measured (e.g., biomass)– Type of measurement (e.g., Energy)– Context of measurement (e.g., Psychotria limonensis)– How it was measured (e.g., dry weight)

Page 18: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Knowledge Representation

Current SEEK Ontologies– Ecological Concepts, Models, Networks– Measurements– Properties– Statistical Analyses– Time and Space– Taxonomic Identifiers– Units– Symbiosis

Recent Developments– Biodiversity (measured traits, computation of traits)– Descriptive Terminology for Plant Communities– Analytical components– Ontology documentation

Future Goals– “Fill-in” existing concepts, evolve the ontology framework– More domains …

Page 19: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

• Label data with semantic types• Label inputs and outputs of analytical components with semantic

types

Semantic Annotation

Data Ontology Workflow Components

Page 20: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Annotating a Component

Page 21: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Semantic workflow validation

• Check if an existing workflow is semantically valid– All connected ports have compatible semantic types– All ports that are required are connected– Visually indicate status with red links for invalid

connections

Page 22: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Searching with Semantics

Page 23: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

In summary…

• Typical analytical models are complex and difficult to comprehend and maintain

• Scientific workflows provide an intuitive way to introduce structure and efficiency to the modeling and analysis process

• Adding semantic tools to workflow design and execution also increases usability of the workflow tool

• Kepler is an evolving but effective tool for scientists– http://kepler-project.org

Page 24: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Current and future work

• Knowledge Representation– Better match between ontology and scientist’s mental

model– Refined ontologies for biodiversity and niche modeling– Refined supporting ontologies (e.g., space & time)

• Kepler– Semantically-driven data integration– Workflow composition and transformation– Ontology directed workflow design– Final niche modeling workflow completed

Page 25: Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions

http://seek.ecoinformatics.org

SWDB Aug 29, 2004

August, 2005

Acknowledgements

This material is based upon work supported by:

The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.

Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation.

Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence