leveraging semantic metadata for ecological data discovery and integration for analysis and modeling...
Post on 22-Dec-2015
230 views
TRANSCRIPT
Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling
Matthew B. JonesMark P. Schildhauer
with contributions from Shawn Bowers, Chad Berkley, Dan Higgins,
Rich Williams, Deana Pennington, Jing Tao, and others
National Center for Ecological Analysis and SynthesisUniversity of California, Santa Barbara
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Topical Outline
• Case study: Predicting species distributions under climate change with ecological niche modeling
• Challenges presented by Ecological Niche Modeling
• Scientific Workflows and the Kepler system
• Improving scientific workflows with semantics
• Future work
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Ecological Niche modeling
• Correlate current bioclimatic and topography data with observed species distribution to develop prediction algorithm
• Project predicted distribution of species using the prediction algorithm against 5 IPCC climate change scenarios
• Can use the Genetic Algorithm for Ruleset Prediction (GARP) and other algorithms
• oyamel fir prediction (Oberhauser and Peterson 2005)
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Informatics Challenges
• Data discovery, access, and archiving
• Data integration
• Compute cycles
• Alternative model testing
• Model complexity
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Data Discovery, Access, and Archiving
• Niche modeling requires many different data sources– Species observation data– Environmental data
• Precipitation, land use, LAI, topography, etc.
– Climate change data• IPCC Climate Change scenarios
• Currently, these types of data are either completely inaccessible or accessible with only significant manual effort in locating and accessing them from multiple independent providers
• Need to archive selected model results and data– Typically, these are handled in an ad-hoc basis, typically without
documentation or archiving plans
• For niche modeling, outputs are diverse:– GA rule sets– Predicted distributions under current and altered conditions– Maps of these distributions
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Data Integration
• To utilize data, need to normalize and integrate to a common frame of reference
• For niche modeling, that includes finding an optimal extent, resolution, and projection for all data types
• Currently, custom scripts or applications are used for such transformations– Extremely time consuming
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Compute cycles
• Ecological modeling problems are typically computation-limited
• For niche modeling, researchers desire to examine, for example, predicted distribution of all mammals of the Western Hemisphere under current conditions and 5 IPCC climate change scenarios
(200 to 500 runs per speciesx
~2000 mammal speciesx
3 minutes/run)
=833 to 2083 days
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Testing alternative models
• Researchers want to ‘tweak’ models– Explore alternative algorithms– Modify parameterization
• Iterate over many combinations of these• Final results and intermediate versions of models need to be
saved and versioned
• For niche modeling, the following algorithms are commonly used by researchers:
• DA, discriminant analysis
• BM, Bayesian Model
• BP, bioclimatic profiles
• CART, classification and regression trees
• GAM, generalized additive models
• GLM, generalized linear models
• GARP, genetic algorithm for rule-set production
• MD, mahalanobis distance method
• NNETW, neural networks
• SI, spatial interpolation
From: Segurado and Araujo. 2004. An evaluation of methods for modelling species distributions. Journal of Biogeography 31, 1555–1568.
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Model complexity
• Models and analyses typically consist of 100s of analytical processing steps– Understanding the model becomes very difficult
• “Spaghetti code” is common
– Only experts can modify or review the model– Complexity increases the chance of undetected errors
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Current approaches to ENM
• Evolving, but typically these are custom simulation models– e.g., GARP
• Tend towards monolithic applications that handle everything in one place (data ingestion, transformation, model execution, output management, statistical analysis)– These models typically are difficult to extend, modify, or
understand and require specialized expertise to use
– This is typical of many models in ecology, and is largely due to the difficulty of managing complexity in modern programming languages
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
A
Source(e.g., data)
C
Sink(e.g., display)
B
Alternative approach: scientific workflows
• What are scientific workflows?– Graphical model of data flow among processing steps
– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models
A’
Processor(e.g., regression)
B
ED F
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Kepler Scientific Workflow System
• Software to design and execute scientific workflows– Variety of analytical components (including spatial data transformations)– Support for R scripts and Matlab scripts– Real-time data access to sensor networks– Cross-project collaboration
• SEEK, SciDAC, GEON, Ptolemy, RoadNet, EOL, Resurgence
• EcoGrid access to heterogeneous environmental data– EML Data support
• Experimental data, survey data, spatial raster and vector data, etc.
– DarwinCore Data support• Museum collections
– GeoSciences Network (GEON) Data Support
• Demonstration workflows from many domains– Ecology: Ecological Niche Modeling– Genomics: Promoter Identification Workflow– Geology: Geologic Map Information Integration– Oceanography: Real-time Revelle example of data access
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
A simple Kepler workflow
Data source from EcoGrid(metadata-driven ingestion)
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Ecological Niche Model in Kepler
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Have scientific workflows improved ENM?
• Data discovery, access, and archiving– Direct access to data archives from natural history collections, ecology, and
geology
– Ability to archive outputs back into data storage systems
• Data integration:– directly handled by specialized components in the workflow
• Compute cycles– Current and growing grid computing support in Kepler increases runtime efficiency
• Alternative model testing– Decouples components and allows simple modifications of the model
– Workflows act as a full description of an executed process• Can be saved to the same repositories as data, allowing for complete replication of the
model results
• Model complexity: – Visual display that documents and elucidates the model
– Hierarchical modeling allows abstraction at higher levels
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Semantics in scientific workflows
• Components and their ports typically have:– Explicit ‘structural type’
• e.g., int, float, string, {double}
– Implicit semantic type• Not sure whether the stream of values from a port
represents ‘rainfall’ values or ‘body size’ values
A B
int intstring intint int
rainfall bodysizebodysize bodysize
int int
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Ecological ontologies
• Model of knowledge in a domain like ecology or biodiversity– What was measured (e.g., biomass)– Type of measurement (e.g., Energy)– Context of measurement (e.g., Psychotria limonensis)– How it was measured (e.g., dry weight)
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Knowledge Representation
Current SEEK Ontologies– Ecological Concepts, Models, Networks– Measurements– Properties– Statistical Analyses– Time and Space– Taxonomic Identifiers– Units– Symbiosis
Recent Developments– Biodiversity (measured traits, computation of traits)– Descriptive Terminology for Plant Communities– Analytical components– Ontology documentation
Future Goals– “Fill-in” existing concepts, evolve the ontology framework– More domains …
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
• Label data with semantic types• Label inputs and outputs of analytical components with semantic
types
Semantic Annotation
Data Ontology Workflow Components
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Annotating a Component
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Semantic workflow validation
• Check if an existing workflow is semantically valid– All connected ports have compatible semantic types– All ports that are required are connected– Visually indicate status with red links for invalid
connections
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Searching with Semantics
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
In summary…
• Typical analytical models are complex and difficult to comprehend and maintain
• Scientific workflows provide an intuitive way to introduce structure and efficiency to the modeling and analysis process
• Adding semantic tools to workflow design and execution also increases usability of the workflow tool
• Kepler is an evolving but effective tool for scientists– http://kepler-project.org
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Current and future work
• Knowledge Representation– Better match between ontology and scientist’s mental
model– Refined ontologies for biodiversity and niche modeling– Refined supporting ontologies (e.g., space & time)
• Kepler– Semantically-driven data integration– Workflow composition and transformation– Ontology directed workflow design– Final niche modeling workflow completed
http://seek.ecoinformatics.org
SWDB Aug 29, 2004
August, 2005
Acknowledgements
This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.
Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis
The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
The Andrew W. Mellon Foundation.
Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence