in43a-0326 datafinder: using ontologies and reasoning to ... · organizational schemes are used for...

Post on 26-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DataFinder: Using Ontologies and Reasoning to Enhance Metadata SearchThomas Russ and Hans Chalupsky

University of Southern California, Information Sciences Institute4676 Admiralty Way, Marina del Rey, CA 90292

tar@isi.edu, hans@isi.edu

Manual Search is No Solution

Especially with Many Files…

Doesn’t use useful abstractions, e.g.Only specific models, not model typesThe mesh spacing is specified separately in all

three dimensions in the metadataIndividual components of the region definition

are not aggregatedLack of regional comparisons

Individual components must be found, combinedand compared separately

Alternate metadataschema used bydifferentcomputational pathways, e.g:

Example: Find a velocity mesh with Grid spacing = 125m

In region bounded by34°N, 118°W and 36°N, 117°W

Before DataFinder, the only way to find files inthe repository was through an interface thatallowed browsing the metadata associated withindividual files.

Velocity Model Classes and InstancesDataFinder is part of the Southern California EarthquakeCenter (SCEC) Community Modeling Environment’ssuite of tools for generating computational workflows andlocating the resulting data products. A workflow is aspecific series of computations that produce data filesbased on simulation models. A given workflow involvesrunning various software modules which pass data viaoften large data files.

When a workflow is being instantiated, a decision needsto be made whether to use an existing data file or togenerate the required file by running a computationalmodel. Since it is often more efficient to use existingdata, locating previously computed results can produce abetter workflow.

The files produced by SCEC computational models havedescriptive metadata stored as pairs of attribute names andvalues. These attributes describe the content andprovenance of the files and can be used to identify files ofinterest. But depending on which software was used toprepare the files, different attribute names and differentorganizational schemes are used for the metadata. Thelow-level nature of the metadata descriptions and thevariety of metadata schemata present challenges forfinding files needed for workflows which DataFinderaddresses.

In addition to its part in workflow instantiation process,DataFinder also provides a way to locate end dataproducts for users. Currently in proof-of-concept formfor locating map files, such an extension of the currentweb-based interface to DataFinder will make themaccessible to a wider range of users.

It is not easy to find data files that match the requirementsof a given workflow. The existing SCEC CME metadatabrowsing tools were not up to this task. They present themetadata on a per-file basis which just does not scale.The example of searching for a velocity mesh covering aparticular region shown at right illustrates the problemswith a browsing approach. In addition to the browsingproblems, there are additional shortcomings of using low-level attribute-value pairs:

DataFinder: Semantic Enhancement with Ontologies

Available Metadata and Current Approach

Why Not Just Extend the Metadata?The model type information could just be added to the filemetadata, but our approach is more flexible. It is easier tohave a deeper hierarchy of classes, our approach can beapplied to existing metadata stores, and it is easy to addnew distinctions as the need arises. Handling multipleschemata is also easier.

DataFinder translates the domain concepts specified bythe user into the low-level SCEC CME metadata attributenames. Computational pathway builders can identify thelogical file names of the files needed to execute aworkflow that solves their geophysical problem.

DataFinder also allows users to locate end data productsusing domain-level terms instead of program-specific andvaried metadata attributes. The domain-level terms canalso include hierarchies not present in the metadata itself,but rather inferred from other attributes.

This interface was created to show the capabilities ofDataFinder in locating velocity mesh data files. Theinterface allows querying by type or model as well asindividual models, and abstracts multi-axial gird spacingparameters into a single value.Goeographic regions can be specified using one of tworepresentations. Containment queries for geographicregions are also implemented.

Computational Workflows and the Problem of Finding Data Files

A sample SCEC CME workflow process. This is part of a “Pathway 2” computation which uses wavepropagation models to compute the effects of an earthquake rupture on the ground and structures. (Incontrast “Pathway 1” computations which use empirically-derived intensity measure relationships)DataFinder’s role in workflow instantiation is to locate already existing data files which would otherwise needto be computed.

Irrelevant

One Part The Rest

DataFinder Query: Model Type (3D)

DataFinder Query: Single Model (CVM3.0)

IN43A-0326

2005 AGU Fall Meeting

DataFinder Interface and Examples

Research funded by the SCEC Community Modeling Environment Project which is funded by the National Science Foundation (NSF) under contractEAR-0122464 (The SCEC Community Modeling Environment (SCEC/CME): An Information Infrastructure for System-Level Earthquake Research)

The Southern California Earthquake Center's CommunityModeling Environment uses computer codes forsimulation and hazard analysis computations. Theprocess of running workflows using severalcomputational models produces numerous intermediateand final data files. These files have descriptive metadatastored as pairs of attribute names and values. Dependingon which software was used to prepare the files, differentattribute names and different organizational schemes areused for the metadata. Previous search tools for thismetadata repository rely on the user knowing the structureand names of the metadata attributes in order to findstored information. Matters are made even harderbecause sometimes the type of information in a data filemust be inferred. For example, seismic hazard maps aredescribed simply as “JPEGFile”, with the domain contentof the file inferable only by looking at the workflow thatproduced the file. This greatly limits the ability toactually find data of interest.

DataFinder uses ontologies to provide a semantic overlayfor the metadata attributes that are used to index datafiles. A domain ontology is combined with a metadataattribute ontology to link geophysical and seismic hazarddomain concepts with the metadata attributes thatdescribe the computational products. DataFinder uses adomain ontology and additional rules expressed in first-order logic to provide this semantic enhancement. Thedomain and metadata attribute ontology is represented inthe PowerLoom representation language. DataFinder isimplemented using a hybrid reasoning approach based oncombining the strengths of the PowerLoom logicalreasoning engine with the database technology underlyingthe metadata repository to provide scalability.

The PowerLoom reasoning engine allows addingsemantic enhancements by overlaying the raw metadatawith a hierarchy of concepts, providing more abstractviews of the data collection. For example, a velocitymesh is one of the intermediate data products used inseismic wave propagation simulations. It is the output ofa 3-dimensional wave propagation velocity model of thesubsurface geology. This mesh can be computed usingany one of several models which have differentcharacteristics. Each velocity mesh has metadata thatrecords the specific model used. But the models also fallinto general classes such as 1-dimensional and 3-dimensional models. Such classification of models can beused to allow retrieval of velocity meshes created by any3-dimensional model, without the user being required tospecify all of the particular model names. This providesan abstraction over the attributes and values stored withthe dataset. Other mappings are used to present a uniform

DataFinder Queries with RegionalBounds Added

Example 1

Example 2

DataFinder Velocity Mesh Interface

Query by model type

Query byspecificmodel

Mesh size

Alternateregionrepresen-tations

Geographicvolumespecification

Interface by Scott Callahan

Internal structure:Pathway 1 has indirection viaearthquake rupture file metadata

MapEqkPGA

EqkRup

n=492

Map file generated usingforecast of multipleearthquake ruptures

Map file generated usingsingle earthquake rupture

Map file generated withPGA as the intensitymeasure relationship

Conjunction of pga-map and shake-map

Image File Hierarchy with Definitions of Terms.

DataFinder uses ontologies to provide a semanticoverlay that enriches the raw metadata. A simpleexample is the class and instance structure thatdescribes velocity models. This structure lets amodel type query work even though type informationis not present in the metadata attributes itself. It isinferred based on the relationships in the ontology.

The metadata information is stored in the computationalGRID’s Metadata Catalog System (MCS). It has alimited search function and comparison function, but itstill requires working at the low-level attribute-namevalue level. The MCS doesn’t have as expressive a querymechanism as DataFinder’s PowerLoom engine, and itdoesn’t support a rich a modeling language.

Pathway 1SeismicHazardComputationsuse a differentorganizational approach to the metadata. Instead ofcopying all values forward to derived products, theyuse pointers to the input file metadata.

interface to metadata information that is stored usingdifferent attribute names or even different organizationalschemes. This abstraction layer gives DataFinder theability to allow users to locate end data products usingdomain-level descriptions instead of program-specific andvaried metadata attributes.

The hierarchical structure of image files showsa more complicated part of the ontology, sincethe inference about what constitutes a hazardmap is more involved. It requires combiningmetadata attributes that are spread acrossmultiples files in the repository.

Region

TimePeriod

RuptureForecastModel

VelocityModel

AnelasticWaveModel

SiteResponse

Model

SeismicHazard

Computation

RuptureForecasts

VelocityMesh

GroundMotion

4-DWaveField

SA Maps

top related