federated databases for the geosciences csig july 21, 2005 douglas s. greer
TRANSCRIPT
Overview
• Database Federation Primer– Basic concepts and principles– DB2 Information Integrator
• The CHRONOS Federated Database– Integration of 7 independently developed
geoscience databases
Top-Level View of a Federated Database
Applications
FederatedDatabase
Data Source A Data Source DData Source CData Source B
Federated DB Data Sources
• Geographically Distributed Data Sources
• Heterogeneous Data Sources– Relational Databases – most common– Non-relational Sources– Web Pages / Web Services– Flat Files
Federated Databases
• May or may not actually contain data
• Federated database can create Global Views that define data in a uniform way across the data sources
• Applications can then access data through the global view using the standardized SQL schema
IBM DB2 Information Integrator
• Provides a framework for strategic information integration to help applications access, manipulate and integrate diverse and distributed data sources across multiple servers in real time.
• Can access structured and unstructured data types including relational databases such as Oracle, MySQL, PostgreSQL and MS SQL Server
Connecting to the Remote Database
• Step 1 – Create WRAPPER– Mechanism that the federated server uses to
communicate with a data source
– Identifies “Driver” code
• Step 2 – Identify SERVER– Identifies the connection to a data source
– Specifies which WRAPPER to use
– Directly or Indirectly specifies the server name, server type, version, database name and special parameters
Connecting to the Remote Database
• Step 3 – Specify USER MAPPING– Maps between a federated database user and an
authorized user (account and password) of a data source
• Step 4 – Define NICKNAMES– Pointer to a table or view in a data source
– Creates a binding between a local name and the data source name and hides the associated metadata details
A Simple Federated View
CREATE VIEW <Table_Name> AS
SELECT
(Database #1 SQL Command)
UNIONSELECT
(Database #2 SQL Command) UNION
SELECT(Database #3 SQL Command)
Identifying Data Sources
CREATE VIEW <Table_Name> ASSELECT
‘PALEOSTRAT’ AS db_namegenus_id AS genus…FROM PSTRAT.tbl_taxonomy…
UNIONSELECT
‘PALEOBIOLOGY’ AS db_namegenus_name AS genus…
Materialized Views• Federated databases normally do not store data
locally. Data from remote sites is fetched as needed.• Materialized Views create a local copy of a Global-
View.– Advantage: faster access – Disadvantages: Data may be stale. Refreshes required
• Several of the CHRONOS Global-Views have versions that use materialized views to increase performance
CHRONOS Project
• Create a dynamic, interactive and time-calibrated framework for Earth history
• Network of chronostratigraphy databases• Online stratigraphic record• Visualization and analytical tools• Develop a better understanding of fundamental
Earth processes through time
CHRONOS Federated Databases
• The following databases are all part of the CHRONOS Federated Database at SDSC based on IBM’s DB2 Information Integrator– Neptune– PaleoStrat– PaleoBiology– Janus– TimeScale– FAUNMAP– MIOMAP
Neptune Database
• Developed at ETH Zürich and currently hosted by Iowa State University
• Contains microfossil occurrences reported in DSDP and ODP samples
• PostgreSQL based• Contains four basic types of data: Fossil Records,
Taxonomy, Age models and Biogeography data• Schema contains approximately 20 tables with
hundreds of thousands of taxonomic occurrences
PaleoStrat Database
• Developed at Boise State University in collaboration with the CHRONOS
• Designed to support geoscience tools with broad applicability
• Contains sedimentary, paleontologic and stratigraphic data
• MS SQL Server based• Approximately 120 tables with thousands
taxonomic occurrences• Data from other databases currently being loaded
PaleoBiology Database
• Hosted by the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California at Santa Barbara
• Contains collection-based occurrence and taxonomic information about marine and terrestrial animals and plants
• MySQL based• 16 tables with hundreds of thousands of
taxonomic occurrences
Janus Database
• Database for the Integrated Ocean Drilling Program (IODP) hosted at Texas A&M University
• Contains numerous types of ocean drilling data collected by United States, Japanese and European ships
• Oracle based• Approximately 580 tables with millions of
taxonomic occurrences
TimeScale Database
• Contains data and information from the 2004 Global Time Scale of the International commission on Stratigraphy and 19 other time scales
• Supports web service conversions tools• PostgreSQL based• Approximately 25 tables with thousands of data
records
FAUNMAP Database
• Hosted by Illinois State Museum• Contains information about the historical
distribution of mammal species in the United States
• MySQL based• Approximately 30 tables with tens of thousands of
data records
MIOMAP Database
• Hosted by University of California, Berkeley• Contains comprehensive spatial and temporal
analysis of Miocene mammal taxa for the Western United States
• MySQL based• Thousands of records in a relatively small number
of tables
The Taxa Global-View
• Simple View to list taxa in all of the databases
• CHRONOS Taxa– Database Name– Table_Name– Taxon_ID– Genus– Species
Conop9 Application
• Developed by Peter M. Sadler, Dept. of Earth Sciences, Univ. of California Riverside
• Correlates stratigraphic sections by minimizing the number of inconsistencies in the order of first and last occurrences of fossils between sections
• Originally developed for flat files then adapted to CHRONOS DB2/II global-views
Conop9 Global View
• Developed for the Conop9 Application• The Conop9 SDSC global-view provides a much
larger collection of data than that available in the older flat file system
• The CHRONOS global-view presents exactly the data needed by Conop9 but uses different SQL statements for each database – this involves joins across four tables in Neptune, seven tables in PaleoStrat and five tables in Janus
Conop9 Global-View Attributes
• CHRONOS Conop Global View Fields– Database Name– Genus– Species– Taxon_id – Used to create Conop9 input tables– Hole_id – Which stratigraphic section does this come from– LAD – Last Appearance Datum, newest observation of this
taxa for this hole– FAD – First Appearance Datum, oldest observation of this
taxa for this hole– LAD and FAD are the result of an SQL computation
Age/Depth Plot Global-Views
• Uniform Global-View of hole location for ADP application
• Surprisingly there are significant differences between databases
• CHRONOS Hole_Summary– Database Name– Hole_ID– Latitude– Longitude
Age/Depth Plot Views
• Uniform Global-View for Hole/Taxa Description for ADP application
• CHRONOS Hole_Desc– Database Name– Hole_ID– Elevation– Meters_of_Section– Taxa_Count