data integration, analysis, and synthesis
DESCRIPTION
Data Integration, Analysis, and Synthesis. Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment. http://knb.ecoinformatics.org - PowerPoint PPT PresentationTRANSCRIPT
Data Integration, Analysis, and Synthesis
Matthew B. JonesNational Center for Ecological Analysis and Synthesis
University of California Santa Barbara
Scalable Information Networks for the Environment
http://knb.ecoinformatics.org
Funding: National Science Foundation (DEB99-80154, DBI99-04777)
NCEAS’ Mission
Integrate existing data for broad ecological synthesis
Use synthesis to inform policy and management
Synthesis at NCEAS
Research Management Policy
200+ synthesis projects 1900+ participating scientists
Research projects Hunsaker – Quantification of Uncertainty in
Spatial Data for Ecological Applications Ives & Frost – Intrinsic and Extrinsic Variability
in Community Dynamics Osenberg -- Meta-Analysis, Interaction
Strength and Effect Size; Application of Biological Models to the Synthesis of Experimental Data
Murdoch – Complex Population Dynamics
Management projects Andelman – Designing and Assessing the
Viability of Nature Reserve Systems at Regional Scales: Integration of Optimization, Heuristic and Dynamic Models
Boersma & Kareiva – Prospectus For An Analysis of Recovery Plans and Delisting
Kareiva – Habitat Conservation Planning for Endangered Species
Lubchenco, Palumbi, & Gaines – Developing the Theory of Marine Reserves
Policy projects Costanza & Farber -- The Value of the World's
Ecosystem Services and Natural Capital: Toward a Dynamic, Integrated Approach
http://www.nceas.ucsb.edu/
Synthesis projects
Use existing data...
Distributed sources Varying protocols Varying formats
Obtained via personal collaboration
Functional breakdown Functional breakdown for synthesis
Data discovery Data access Data storage Data interpretation
Quality assessment Data Conversion & Integration Analysis & Modeling Visualization
Presentation Outline Integration, Analysis, and
Synthesis:
Challenges
Population survey Experimental Taxonomic survey Behavioral Meteorological Oceanographic Hydrology …
Data Heterogeneity Economic Social (urban
ecology) Paleoecological Historical
Land use Demographics
Types of Heterogeneity Intensional vs. Arbitrary Heterogeneity
Syntax (format) CSV, Fixed ASCII, proprietary binary
Schema (organization) Non-normalized models
Semantics (meaning/methods) Protocol semantics (e.g., scale) Parameter semantics (e.g., bodysize (g)) Conceptual framework (e.g., experimental trts) Taxonomy + nomenclature
Data Dispersion Data are distributed among:
Independent researcher holdings Research station collections
LTER Network (24 sites) Org. of Biological Field Stations (168 sites) Univ. Cal Natural Reserve System (36 sites) MARINE (62 sites) PISCO
Agency databases Museum databases
Access via personal networking Not scalable
Lack of Metadata Majority of ecological data
undocumented Lack information on syntax, schema and
semantics of data Impossible to understand data without
contacting the original researchers
Documentation conventions widely vary Requires large time investment to
understand each data set
Scaling Data Integration Because of:
Data heterogeneity Data dispersion Lack of documentation
Integration and synthesis are limited to a manual process Thus, difficult to scale integration
efforts up to large numbers of data sets
Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3
Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8
Date Site Species Density 10/1/1993 N654 Picea
rubens 13
10/3/1994 N654 Picea rubens
14.5
10/1/1993 N654 Betula papyifera
3
10/31/1993 1 Picea rubens
13.5
10/31/1993 1 Betula papyifera
1.6
11/14/1994 1 Picea rubens
8.4
11/14/1994 1 Betula papyifera
1.8
A
B
C
Presentation Outline Integration, Analysis, and
Synthesis:
Challenges Current work
Knowledge Network for Biocomplexity Partnership for Biodiversity Informatics
Knowledge Network for Biocomplexity (KNB) National network for biocomplexity
data Data discovery Data access Data interpretation
Enable advanced services Data integration Analysis framework Hypothesis modeling Visualization
Central Role of Metadata What metadata?
Ownership, attribution, structure, contents, methods, quality, etc.
Critical for addressing data heterogeneity issues
Critical for developing extensible systems
Critical for long-term data preservation
Allows advanced services to be built
KNB Components Ecological Metadata Language (EML) Morpho -- data management for ecologists
Cross platform Java application Metacat -- flexible metadata & data system
Analysis and Modeling engine Data integration engine Semantic Query Processor Hypothesis Modeling Engine
Ecological Metadata Language
XML syntax for representing metadata
Extensible – can add new metadata
Modular – can subset metadata for specific applications
EML 2.0beta3 modules eml-resource -- Basic resource info eml-dataset -- Data set info eml-literature -- Citation info eml-software -- Software info eml-party -- People and Organizations
eml-entity -- Data entity (table) info eml-attribute -- Attribute (variable) info eml-constraint -- Integrity constraints eml-physical -- Physical format info eml-access -- Access control eml-distribution -- Distribution info
eml-project -- Research project info eml-coverage -- Geographic, temporal and taxonomic coverage eml-protocol -- Methods and QA/QC
Metacat metadata system
LTERMetacat
NCEASMetacat
Metacat Catalog
Morpho clients
Key
SDSCMetacatSite metadata system
AND
SEV
CAP
OBFS
Web clients
XML wrapper
NRSMetacat
SEVMetacat
Metacat architectureMetacat Server
RDBMS(Oracle)
TransformationSubsystem
LDAP
Java
Ser
vlet
Eng
ine
(Tom
cat)
HTT
P Se
rver
(Apa
che)
JDBCAPI
LDAPAdapter
Met
acat
Ser
vlet
(Dis
patc
her)
AuthenticationInterface
StorageSubsystem
QuerySubsystem
ReplicationSubsystem
ValidationSubsystem
Data StorageInterface
FSAdapter
File System
Metacat web interface
UCNatural Reserve System
OBFS Network
LTERNetwork
Functional breakdown Functional breakdown for synthesis
Data discovery Data access Data storage Data interpretation
Quality assessment Data Conversion & Integration Analysis & Modeling Visualization
Quality Assessment system
SemanticMetadata+
+ + ResearcherDecisionsData
QualityAssessmentReport
Quality Assessment Integrity constraint checking Data type checking Metadata completeness Data entry errors Outlier detection Check assertions about data
e.g., trees don’t shrink e.g., sea urchins do
Data IntegrationSemanticMetadata+
+ + ResearcherDecisionsData
Date Site Species Density10/1/1993 N654 Picea
rubens13
10/3/1994 N654 Picearubens
14.5
10/1/1993 N654 Betulapapyifera
3
10/31/1993 1 Picearubens
13.5
10/31/1993 1 Betulapapyifera
1.6
11/14/1994 1 Picearubens
8.4
11/14/1994 1 Betulapapyifera
1.8
IntegratedData Set
Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3
Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8
Date Site Species Density 10/1/1993 N654 Picea
rubens 13
10/3/1994 N654 Picea rubens
14.5
10/1/1993 N654 Betula papyifera
3
10/31/1993 1 Picea rubens
13.5
10/31/1993 1 Betula papyifera
1.6
11/14/1994 1 Picea rubens
8.4
11/14/1994 1 Betula papyifera
1.8
A
B
C 0
2
4
6
8
10
12
14
16
Pice
a ru
bens
Pice
a ru
bens
Betu
la p
apyi
fera
Pice
a ru
bens
Betu
la p
apyi
fera
Pice
a ru
bens
Betu
la p
apyi
fera
Dens
ity (#
/m2)
Scaling Analysis and Modeling
Data and Metadata Input
(from Morpho/Metacat)
Execution engine (plugins)
SASR
MatlabSimulation models
...
Analysis + Model Metadata
InputsOutputs
Processing
Output
Scaling Analysis and Modeling
Execution Engine
Data and Metadata InputConfiguration for Analysis and Models
DDLSpecification(Inputs andDDL Code)
ProceduralSpecification(Inputs andproc code)
Input MapSpecification(test inputsmapped to
metadata/datafields)
Script withunresolvedvariables
Input MapParser
TestSpecification
Parser
Script withsymbolically
resolvedvariables
Script/Metadata/Data Validation
and ConflictResolution
User orontological
input forconflict
resolution
Data/MetadataInput facilitator
and Parser
DataPackage
(Metadatawith data
file)
Fullyresolved
final scriptScriptExecutor
Output(HTML,
XML, Text,etc.)
Script withsome fullyresolvedvariables
AnalyticalEnginePlugin
OutputStream from
AnalyticalEngine
OuputRenderer
OuputConfig File
Semantic metadata Describes the relationship between
measurements and ecologically relevant concepts
Drawn from a controlled vocabulary Ontology for ecological
measurements
Ecological Ontologies
BiodiversitySpecies TaxonOrganism
SpeciesEveness (J')
ShannonDiversity (H')
S
ii ppH1
ln'
SHJ
ln''Species
Count (S)
Abundance (N)
Abundance ofSpecies i (Ni)
SamplingArea (A)
ProportionalAbundance
Species i (pi)
NNp i
i
isaisa
has
has
has
has
has
S
iNN1
What drives synthesis Science questions Hypotheses Analyses + Models Integrated Data Original Data
Conclusions
Barriers to integration can be addressed using structured metadata
Can accomplish a lot with ‘just’ mechanical transformations
Domain ontologies + semantic mediation are paths to scaling integration
Analysis drives all other phases of integration