in43a-0326 datafinder: using ontologies and reasoning to ... · organizational schemes are used for...

1
DataFinder: Using Ontologies and Reasoning to Enhance Metadata Search Thomas Russ and Hans Chalupsky University of Southern California, Information Sciences Institute 4676 Admiralty Way, Marina del Rey, CA 90292 [email protected], [email protected] Manual Search is No Solution Especially with Many Files… Doesn’t use useful abstractions, e.g. Only specific models, not model types The mesh spacing is specified separately in all three dimensions in the metadata Individual components of the region definition are not aggregated Lack of regional comparisons Individual components must be found, combined and compared separately Alternate metadata schema used by different computational pathways, e.g: Example: Find a velocity mesh with Grid spacing = 125m In region bounded by 34°N, 118°W and 36°N, 117°W Before DataFinder, the only way to find files in the repository was through an interface that allowed browsing the metadata associated with individual files. Velocity Model Classes and Instances DataFinder is part of the Southern California Earthquake Center (SCEC) Community Modeling Environment’s suite of tools for generating computational workflows and locating the resulting data products. A workflow is a specific series of computations that produce data files based on simulation models. A given workflow involves running various software modules which pass data via often large data files. When a workflow is being instantiated, a decision needs to be made whether to use an existing data file or to generate the required file by running a computational model. Since it is often more efficient to use existing data, locating previously computed results can produce a better workflow. The files produced by SCEC computational models have descriptive metadata stored as pairs of attribute names and values. These attributes describe the content and provenance of the files and can be used to identify files of interest. But depending on which software was used to prepare the files, different attribute names and different organizational schemes are used for the metadata. The low-level nature of the metadata descriptions and the variety of metadata schemata present challenges for finding files needed for workflows which DataFinder addresses. In addition to its part in workflow instantiation process, DataFinder also provides a way to locate end data products for users. Currently in proof-of-concept form for locating map files, such an extension of the current web-based interface to DataFinder will make them accessible to a wider range of users. It is not easy to find data files that match the requirements of a given workflow. The existing SCEC CME metadata browsing tools were not up to this task. They present the metadata on a per-file basis which just does not scale. The example of searching for a velocity mesh covering a particular region shown at right illustrates the problems with a browsing approach. In addition to the browsing problems, there are additional shortcomings of using low- level attribute-value pairs: DataFinder: Semantic Enhancement with Ontologies Available Metadata and Current Approach Why Not Just Extend the Metadata? The model type information could just be added to the file metadata, but our approach is more flexible. It is easier to have a deeper hierarchy of classes, our approach can be applied to existing metadata stores, and it is easy to add new distinctions as the need arises. Handling multiple schemata is also easier. DataFinder translates the domain concepts specified by the user into the low-level SCEC CME metadata attribute names. Computational pathway builders can identify the logical file names of the files needed to execute a workflow that solves their geophysical problem. DataFinder also allows users to locate end data products using domain-level terms instead of program-specific and varied metadata attributes. The domain-level terms can also include hierarchies not present in the metadata itself, but rather inferred from other attributes. This interface was created to show the capabilities of DataFinder in locating velocity mesh data files. The interface allows querying by type or model as well as individual models, and abstracts multi-axial gird spacing parameters into a single value. Goeographic regions can be specified using one of two representations. Containment queries for geographic regions are also implemented. Computational Workflows and the Problem of Finding Data Files A sample SCEC CME workflow process. This is part of a “Pathway 2” computation which uses wave propagation models to compute the effects of an earthquake rupture on the ground and structures. (In contrast “Pathway 1” computations which use empirically-derived intensity measure relationships) DataFinder’s role in workflow instantiation is to locate already existing data files which would otherwise need to be computed. Irrelevant One Part The Rest DataFinder Query: Model Type (3D) DataFinder Query: Single Model (CVM3.0) IN43A-0326 2005 AGU Fall Meeting DataFinder Interface and Examples Research funded by the SCEC Community Modeling Environment Project which is funded by the National Science Foundation (NSF) under contract EAR-0122464 (The SCEC Community Modeling Environment (SCEC/CME): An Information Infrastructure for System-Level Earthquake Research) The Southern California Earthquake Center's Community Modeling Environment uses computer codes for simulation and hazard analysis computations. The process of running workflows using several computational models produces numerous intermediate and final data files. These files have descriptive metadata stored as pairs of attribute names and values. Depending on which software was used to prepare the files, different attribute names and different organizational schemes are used for the metadata. Previous search tools for this metadata repository rely on the user knowing the structure and names of the metadata attributes in order to find stored information. Matters are made even harder because sometimes the type of information in a data file must be inferred. For example, seismic hazard maps are described simply as “JPEGFile”, with the domain content of the file inferable only by looking at the workflow that produced the file. This greatly limits the ability to actually find data of interest. DataFinder uses ontologies to provide a semantic overlay for the metadata attributes that are used to index data files. A domain ontology is combined with a metadata attribute ontology to link geophysical and seismic hazard domain concepts with the metadata attributes that describe the computational products. DataFinder uses a domain ontology and additional rules expressed in first- order logic to provide this semantic enhancement. The domain and metadata attribute ontology is represented in the PowerLoom representation language. DataFinder is implemented using a hybrid reasoning approach based on combining the strengths of the PowerLoom logical reasoning engine with the database technology underlying the metadata repository to provide scalability. The PowerLoom reasoning engine allows adding semantic enhancements by overlaying the raw metadata with a hierarchy of concepts, providing more abstract views of the data collection. For example, a velocity mesh is one of the intermediate data products used in seismic wave propagation simulations. It is the output of a 3-dimensional wave propagation velocity model of the subsurface geology. This mesh can be computed using any one of several models which have different characteristics. Each velocity mesh has metadata that records the specific model used. But the models also fall into general classes such as 1-dimensional and 3- dimensional models. Such classification of models can be used to allow retrieval of velocity meshes created by any 3-dimensional model, without the user being required to specify all of the particular model names. This provides an abstraction over the attributes and values stored with the dataset. Other mappings are used to present a uniform DataFinder Queries with Regional Bounds Added Example 1 Example 2 DataFinder Velocity Mesh Interface Query by model type Query by specific model Mesh size Alternate region represen- tations Geographic volume specification Interface by Scott Callahan Internal structure: Pathway 1 has indirection via earthquake rupture file metadata Map Eqk PGA EqkRup n=492 Map file generated using forecast of multiple earthquake ruptures Map file generated using single earthquake rupture Map file generated with PGA as the intensity measure relationship Conjunction of pga- map and shake-map Image File Hierarchy with Definitions of Terms. DataFinder uses ontologies to provide a semantic overlay that enriches the raw metadata. A simple example is the class and instance structure that describes velocity models. This structure lets a model type query work even though type information is not present in the metadata attributes itself. It is inferred based on the relationships in the ontology. The metadata information is stored in the computational GRID’s Metadata Catalog System (MCS). It has a limited search function and comparison function, but it still requires working at the low-level attribute-name value level. The MCS doesn’t have as expressive a query mechanism as DataFinder’s PowerLoom engine, and it doesn’t support a rich a modeling language. Pathway 1 Seismic Hazard Computations use a different organizational approach to the metadata. Instead of copying all values forward to derived products, they use pointers to the input file metadata. interface to metadata information that is stored using different attribute names or even different organizational schemes. This abstraction layer gives DataFinder the ability to allow users to locate end data products using domain-level descriptions instead of program-specific and varied metadata attributes. The hierarchical structure of image files shows a more complicated part of the ontology, since the inference about what constitutes a hazard map is more involved. It requires combining metadata attributes that are spread across multiples files in the repository. Region Time Period Rupture Forecast Model Velocity Model Anelastic Wave Model Site Response Model Seismic Hazard Computation Rupture Forecasts Velocity Mesh Ground Motion 4-D Wave Field SA Maps

Upload: others

Post on 26-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN43A-0326 DataFinder: Using Ontologies and Reasoning to ... · organizational schemes are used for the metadata. The low-level nature of the metadata descriptions and the variety

DataFinder: Using Ontologies and Reasoning to Enhance Metadata SearchThomas Russ and Hans Chalupsky

University of Southern California, Information Sciences Institute4676 Admiralty Way, Marina del Rey, CA 90292

[email protected], [email protected]

Manual Search is No Solution

Especially with Many Files…

Doesn’t use useful abstractions, e.g.Only specific models, not model typesThe mesh spacing is specified separately in all

three dimensions in the metadataIndividual components of the region definition

are not aggregatedLack of regional comparisons

Individual components must be found, combinedand compared separately

Alternate metadataschema used bydifferentcomputational pathways, e.g:

Example: Find a velocity mesh with Grid spacing = 125m

In region bounded by34°N, 118°W and 36°N, 117°W

Before DataFinder, the only way to find files inthe repository was through an interface thatallowed browsing the metadata associated withindividual files.

Velocity Model Classes and InstancesDataFinder is part of the Southern California EarthquakeCenter (SCEC) Community Modeling Environment’ssuite of tools for generating computational workflows andlocating the resulting data products. A workflow is aspecific series of computations that produce data filesbased on simulation models. A given workflow involvesrunning various software modules which pass data viaoften large data files.

When a workflow is being instantiated, a decision needsto be made whether to use an existing data file or togenerate the required file by running a computationalmodel. Since it is often more efficient to use existingdata, locating previously computed results can produce abetter workflow.

The files produced by SCEC computational models havedescriptive metadata stored as pairs of attribute names andvalues. These attributes describe the content andprovenance of the files and can be used to identify files ofinterest. But depending on which software was used toprepare the files, different attribute names and differentorganizational schemes are used for the metadata. Thelow-level nature of the metadata descriptions and thevariety of metadata schemata present challenges forfinding files needed for workflows which DataFinderaddresses.

In addition to its part in workflow instantiation process,DataFinder also provides a way to locate end dataproducts for users. Currently in proof-of-concept formfor locating map files, such an extension of the currentweb-based interface to DataFinder will make themaccessible to a wider range of users.

It is not easy to find data files that match the requirementsof a given workflow. The existing SCEC CME metadatabrowsing tools were not up to this task. They present themetadata on a per-file basis which just does not scale.The example of searching for a velocity mesh covering aparticular region shown at right illustrates the problemswith a browsing approach. In addition to the browsingproblems, there are additional shortcomings of using low-level attribute-value pairs:

DataFinder: Semantic Enhancement with Ontologies

Available Metadata and Current Approach

Why Not Just Extend the Metadata?The model type information could just be added to the filemetadata, but our approach is more flexible. It is easier tohave a deeper hierarchy of classes, our approach can beapplied to existing metadata stores, and it is easy to addnew distinctions as the need arises. Handling multipleschemata is also easier.

DataFinder translates the domain concepts specified bythe user into the low-level SCEC CME metadata attributenames. Computational pathway builders can identify thelogical file names of the files needed to execute aworkflow that solves their geophysical problem.

DataFinder also allows users to locate end data productsusing domain-level terms instead of program-specific andvaried metadata attributes. The domain-level terms canalso include hierarchies not present in the metadata itself,but rather inferred from other attributes.

This interface was created to show the capabilities ofDataFinder in locating velocity mesh data files. Theinterface allows querying by type or model as well asindividual models, and abstracts multi-axial gird spacingparameters into a single value.Goeographic regions can be specified using one of tworepresentations. Containment queries for geographicregions are also implemented.

Computational Workflows and the Problem of Finding Data Files

A sample SCEC CME workflow process. This is part of a “Pathway 2” computation which uses wavepropagation models to compute the effects of an earthquake rupture on the ground and structures. (Incontrast “Pathway 1” computations which use empirically-derived intensity measure relationships)DataFinder’s role in workflow instantiation is to locate already existing data files which would otherwise needto be computed.

Irrelevant

One Part The Rest

DataFinder Query: Model Type (3D)

DataFinder Query: Single Model (CVM3.0)

IN43A-0326

2005 AGU Fall Meeting

DataFinder Interface and Examples

Research funded by the SCEC Community Modeling Environment Project which is funded by the National Science Foundation (NSF) under contractEAR-0122464 (The SCEC Community Modeling Environment (SCEC/CME): An Information Infrastructure for System-Level Earthquake Research)

The Southern California Earthquake Center's CommunityModeling Environment uses computer codes forsimulation and hazard analysis computations. Theprocess of running workflows using severalcomputational models produces numerous intermediateand final data files. These files have descriptive metadatastored as pairs of attribute names and values. Dependingon which software was used to prepare the files, differentattribute names and different organizational schemes areused for the metadata. Previous search tools for thismetadata repository rely on the user knowing the structureand names of the metadata attributes in order to findstored information. Matters are made even harderbecause sometimes the type of information in a data filemust be inferred. For example, seismic hazard maps aredescribed simply as “JPEGFile”, with the domain contentof the file inferable only by looking at the workflow thatproduced the file. This greatly limits the ability toactually find data of interest.

DataFinder uses ontologies to provide a semantic overlayfor the metadata attributes that are used to index datafiles. A domain ontology is combined with a metadataattribute ontology to link geophysical and seismic hazarddomain concepts with the metadata attributes thatdescribe the computational products. DataFinder uses adomain ontology and additional rules expressed in first-order logic to provide this semantic enhancement. Thedomain and metadata attribute ontology is represented inthe PowerLoom representation language. DataFinder isimplemented using a hybrid reasoning approach based oncombining the strengths of the PowerLoom logicalreasoning engine with the database technology underlyingthe metadata repository to provide scalability.

The PowerLoom reasoning engine allows addingsemantic enhancements by overlaying the raw metadatawith a hierarchy of concepts, providing more abstractviews of the data collection. For example, a velocitymesh is one of the intermediate data products used inseismic wave propagation simulations. It is the output ofa 3-dimensional wave propagation velocity model of thesubsurface geology. This mesh can be computed usingany one of several models which have differentcharacteristics. Each velocity mesh has metadata thatrecords the specific model used. But the models also fallinto general classes such as 1-dimensional and 3-dimensional models. Such classification of models can beused to allow retrieval of velocity meshes created by any3-dimensional model, without the user being required tospecify all of the particular model names. This providesan abstraction over the attributes and values stored withthe dataset. Other mappings are used to present a uniform

DataFinder Queries with RegionalBounds Added

Example 1

Example 2

DataFinder Velocity Mesh Interface

Query by model type

Query byspecificmodel

Mesh size

Alternateregionrepresen-tations

Geographicvolumespecification

Interface by Scott Callahan

Internal structure:Pathway 1 has indirection viaearthquake rupture file metadata

MapEqkPGA

EqkRup

n=492

Map file generated usingforecast of multipleearthquake ruptures

Map file generated usingsingle earthquake rupture

Map file generated withPGA as the intensitymeasure relationship

Conjunction of pga-map and shake-map

Image File Hierarchy with Definitions of Terms.

DataFinder uses ontologies to provide a semanticoverlay that enriches the raw metadata. A simpleexample is the class and instance structure thatdescribes velocity models. This structure lets amodel type query work even though type informationis not present in the metadata attributes itself. It isinferred based on the relationships in the ontology.

The metadata information is stored in the computationalGRID’s Metadata Catalog System (MCS). It has alimited search function and comparison function, but itstill requires working at the low-level attribute-namevalue level. The MCS doesn’t have as expressive a querymechanism as DataFinder’s PowerLoom engine, and itdoesn’t support a rich a modeling language.

Pathway 1SeismicHazardComputationsuse a differentorganizational approach to the metadata. Instead ofcopying all values forward to derived products, theyuse pointers to the input file metadata.

interface to metadata information that is stored usingdifferent attribute names or even different organizationalschemes. This abstraction layer gives DataFinder theability to allow users to locate end data products usingdomain-level descriptions instead of program-specific andvaried metadata attributes.

The hierarchical structure of image files showsa more complicated part of the ontology, sincethe inference about what constitutes a hazardmap is more involved. It requires combiningmetadata attributes that are spread acrossmultiples files in the repository.

Region

TimePeriod

RuptureForecastModel

VelocityModel

AnelasticWaveModel

SiteResponse

Model

SeismicHazard

Computation

RuptureForecasts

VelocityMesh

GroundMotion

4-DWaveField

SA Maps