simdap simulation data access protocol

Garching, June 2008

SimDAP

Simulation Data Access Protocol

Claudio GhellerCINECA ([email protected])

Garching, June 2008

SimDAP in a nutshell

Simulation Data Access Protocol, hereafter SimDAP, defines a standard to access numerical simulation outputs (theoretical data), hereafter Snapshots.

The goal of the SimDAP protocol is to preview and retrieve data found in a previous search phase.

Since data could be huge, the SimDAP service can provide solutions to download ONLY the data of interest, reducing the communicated data volume.

The SimDAP protocol describes the interface to the “data shrink” services

The result of any SimDAP operation is a reference to one or more data files

Garching, June 2008

SimDAP examples

Search for simulations with Lambda>0.7

I like this one

It’s too large !!!

Let’s select a sub-region!!!

Metadata VOTable

Binary data file

Data is too large!!!

Extract a sub region… it is still large

Perform the analysis on-site

Finally I have a jpeg… cannot be too large!!!

Garching, June 2008

SimDAP target data

SimDAP deals with Snapshots

Generally speaking, Snapshots are RAW data produced by a numerical model

In principle any set of M physical parameters in a N-dimensional phase space can be the object of SimDAP

For simplicity, we have started considering data which represents a spatial distribution of phisical quantities, in different time steps. Therefore support of space and time are assumed by default.

E.g.:

o x, y and z coordinates of a set of particles at various evolutionary times

o The temperature over a computational mesh

o The x-ray luminosity derived from temperature and density (direct outcome of the simulations

Garching, June 2008

SimDAP data model

SimDAP adopts SimDB as the standard data model.

SimDB is essential in the discovery phase (not part of SimDAP), which provides the basic input parameters to SimDAP

o The experiment id (the simulation)

o The result id (the snapshots)

o The data provider id (as registered)

The result of the SimDAP operation is a reference to one or more file.

The result may be delivered as a VOTable describing, in terms of SimDB, the outcome of the SimDAP operation and containing the references to the data files

The data files presently have not a precise standard. Explorative work is in progress on this topic.

Garching, June 2008

SimDAP protocol

SimDAP does NOT specify anything about the implementation of the related services. This is up to the service provider.

SimDAP defines only the (standard) interface to the (web) service. This means that the following items will be standard:

o Service goal (what it does)

o Input parameters (what it is needed to run the service)

o Results (what is returned by the service)

Custom services are supported, BUT they must be fully described, possibly via registry

Garching, June 2008

SimDAP services

At present the following services are expected to be part of the protocol:

o Preview

o Download

o Cutout

o Custom

Each service MUST support a METADATA function which returns the input parameter supported by the service.

This information can be used either by the client applications (in particular for custom services) or by the registry, for users seaarching for services according to their capabilities.

Garching, June 2008

Preview: goals

The result of the SimDB search is a list of simulations and/or snapshots.

There is NO easy and/or standard way to understand if the content of the snapshot is fine for you.

However, you cannot download all the hits to check them.

The PREVIEW service allows you to have a pre-defined view of one or more snapshots.

Possible preview services can be based on:

o Selection and download of a subset of the whole snapshot (randomized, decimated…)

o visualization of the data, by 3D interactive rendering of sampled data, or orthogonal projections,

o statistical analysis

o Object catalogues (e.g. cluster of galaxies identified in a cosmological simulation)

o …

All these functionalities could act on precalculated infos or interactively.

Garching, June 2008

Preview: examples

Garching, June 2008

Preview: input parameters, result

The only mandatory input parameters are:

o Simulation id

o Snapshot id

Further parameters can be specified and published by the service. They allows the user to specify possible customization of the preview service.

If multiple preview functionalities are implemented, each is treated as a separate service.

The output is heterogeneous. If it is a file (decimated/reduced dataset), it must have the standard TVO format (VOTable+binary).

Garching, June 2008

Download: goals

Once the snapshots of interest have been identified, the user can decide to download them.

Two possible solutions:

o Direct download – the user get the data file as it is. This is

part of the SimDB protocol. No further actions are required on the data.

o SimDAP download – the user get back the snapshot in the standard TVO file format (VOTable+binary). A further operation may be supported and applied: fields selection. This operation allows the user to download only those physical quantities he is actually interested in.

Garching, June 2008

Download: input parameters, result

If only the direct download is available, the reference to the file is enough. However this is not strictly part of the SimDAP protocol.

The only mandatory input parameters are:

o Simulation id

o Snapshot id

The FIELD parameter has to be supported if the fields selection is available

Further parameters can be specified and published by the service. They allows the user to specify possible customization of the download service (e.g. automatic format or endianism conversions).

The output is always a file. The expected format is the TVO format (VOTable+binary), unless explicitly specified.

Garching, June 2008

Cutout: goals

o Data could be too large to be moved from the server.

o The user could be interested only in a small fraction of the data

The cutout service let the user to focus on a region of interest, extracting the corresponding data and downloading the resulting file.

In principle the cutout could be of any shape. For simplicity, SimDAP only deals with 3D rectangular selections, identified by a 3D point (a vertex or the center of the selection region) and the size of the selection box in the 3 coordinate directions.

The cutout can be completely different according to the data: regular meshes, AMR, point-like/unstructured data.

Search for simulations with Lambda>0.7

I like this one

It’s too large !!!

Let’s select a sub-

region!!!

Metadata VOTable

Binary data file

Garching, June 2008

Cutout: input parameters, result

The Cutout service requires different classes of inputs

Source parameters

o Simulation id

o Snapshot id

Physical quantities selection parameters:

o FIELD

Cutout fields parameters and corresponding units:

o COORD_X, COORD_Y, COORD_Z

o UNITS

Selection region parametes:

o VERT_X, VERT_Y, VERT_Z

o SIZE_X, SIZE_Y, SIZE_Z

Further parameters can be specified and published by the service. They allows the user to specify possible customization of the cutout service.

Garching, June 2008

The UNITS problem

The Cutout function requires the knowledge of the cutout units…

Example:

The user needs to extract all the data inside a simulated volume of a cosmological simulation. He wants to use “natural” units to identify the vertex position and the box size: Mpc

However data could be stored in different units (e.g. kpc or cm!!!).

In order to make the cutout possible two basic operations MUST be accomplished:

1. The server MUST “send” the units to the client (or conversion factors to some “natural” units);

2. The client, using the units, MUST convert the input parameters.

Garching, June 2008

Cutout tools

The Cutout function requires proper tools to select the region of interest.

The tools can be the same (or derived by) those used for the preview.

An example using VisIVO…

Garching, June 2008

Cutout results

The Cutout result is DATA.

The result data is characterized by raw data and metadata.

The latter are organized as a VOTable (in general, an XML file).

The VOTable describes the data and contains the acref parameter(s) to one (or more) file(s) containing the raw data.

The raw data could not be immediately available (access to secondary storage devices, CPU demanding operations…). In this case DATA STAGING is necessary.

Garching, June 2008

Custom Services and Service Registration

Custom services are supported. In this case the complete description of the service must be available as a registry entry

In general the SimDAP service is to be registered. This means:

o Publish information about the service name and owner

o Publish the URL of the service

o Publish the available services (preview, download, cutout, custom…)

Garching, June 2008

VOTables example: 1

VOTable for the velocity field of a fluid on a fixed 3D mesh

<RESOURCE name="myVectorField" type="results" >

<DESCRIPTION>Velocity Field from N-Body run</DESCRIPTION>

<INFO name="QUERY_STATUS" value="OK"/>

<TABLE name="VelocityField" ID="Vel" order="sequential” arraysize="41x41x41" >

<FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x" datatype="float" unit="km/s" />

<FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y" datatype="float" unit="km/s"/>

<FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z" datatype="float" unit="km/s"/>

<DATA>

<BINARY>

<STREAM acref="file:///scratch/myhome/test.bin"/>

</BINARY>

</DATA>

</TABLE>

</RESOURCE>

</VOTABLE>

Garching, June 2008

VOTables example 2

VOTable for the temperature field of a mesh based quantity and the position

of N-Body particles extracted from the same spatial region. <RESOURCE name=myMixedData type="results"> <INFO name="QUERY_STATUS" value="OK"/> <TABLE name="Particles" ID="NBody" order="sequential” arraysize="100000"> <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x“ datatype="float" unit="Mpc" /> <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y“ datatype="float" unit="Mpc"/> <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z” datatype="float"unit="Mpc"/> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/particles.bin"/> </BINARY> </DATA> </TABLE>

<TABLE name=“Mesh" ID=“MeshTemp" order="sequential” arraysize=“41x41x41"> <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian“ datatype="float"

unit="K" /> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/mesh.bin"/> </BINARY> </DATA> </TABLE></RESOURCE></VOTABLE>

Garching, June 2008

Raw data file formats

Data file formats can be different according to their usage

Archive side files should be

• High performance (fast access)

• Standard (portable and persistent)

Result files should be

• Simple (specific I/O libraries are not required to access them)

• Self descriptive (e.g. XML metadata headers)

• Compressed (to minimize transfer effort)

In any case, data size is crucial. ASCII files are “deprecated”. Base64 (or similar) encoding for http transfers are to avoid. Waist of time (for conversions) and “space” (increased size).

Garching, June 2008

Result files

A simple solution is represented by raw binary files with the following characteristics.

• In a file more variables can be stored

• Each variable represent a scalar quantity

• Components of multidimensional quantities are stored as separate variables

• Variables have the same number of elements but they can have different types

• Variables can be stored either as Tabular or as Sequential (see next slide)

• A descriptor file (XML) is associated to the binary to make it self-descriptive

Advantages: (little) standardization, simplicity, no I/O specific libraries required, fast access

Drawbacks: limited portability (endianism problem, data types), little standardization, no compression

Garching, June 2008

Result files: Tabular vs Sequential

Tabular files are closer to observational data, so more compatible to a standard VOTable idea.

If the file contains the 3 variables vx, vy, vz, their Tabular storage is:

vx(1), vy(1), vz(1)

vx(2), vy(2), vz(2)

…

vx(N), vy(N), vz(N)

This is suitable for variables (like the components of a vector) which are always accessed as N-uple. Or for data analysis tools which need (and load) all the stored variables for their goal.

However it leads to poor performances if variables has to be loaded separately in memory. Loading one variable requires continuous jumps on the file.

Garching, June 2008

Result files: Tabular vs Sequential (cont.ed)

Sequential files are a common choice for “simulators”

If the file contains the 2 variables rho and press, their Sequential storage is:rho(1)

rho(2)

…

rho(N)press(1)press(2)…press(N)

Each variable can be read with a single I/O call. This leads to high performance access to the file. This is typically required dealing with large files.

Garching, June 2008

Archive files

Archive files are not “visible” to the end user. Therefore the data provider can choose any suitable format.

The choice should be in general driven by several properties:

• The format should be standard and well supported, in order to ensure the preservation of the data, their portability between different computing platforms, software, compilers... (if the technology changes we don’t want to change the data)

• The files should be fast and efficiently accessible, since data is large and complex operations could be necessary to handle it (e.g. extract the particles which falls in a certain region)

Various formats, with such features, are available.

Garching, June 2008

File formats: HDF5

HDF5 (http://hdf.ncsa.uiuc.edu) represents a possible solution to deal with such data

HDF5 is• Portable between most of

modern platform• High performance• Well supported• Well documented• Rich of tools• Flexible and extendible

HDF5 data files are• Platform independent (portable)• Well organized• Self defined• Metadata enriched• Efficiently accessible

HDF5 drawbacks• Requires some expertise and

skill to be used• Information are difficult to

access• Can be subject to major library

changes (see HDF4 to HDF5)

Garching, June 2008

VisIVO Server Services for TVO

TVO archive

VisualizationWeb Services

Customizable data view

Garching, June 2008

Visualization Web Service

VisIVO Web Service has been realized using the SOAP engine AXIS.

VisIVO Server

You can write a client application using JAVA or C++

The ITVO web portal is a client application

The service implements a data staging mechanism for the VisIVO Server outcomes. (.png files)

Garching, June 2008

Developer guidelines: web services

The ITVO web portal describes the web service classes using Class Diagrams and publishing the JAVA code

ITVO Web Services are free software: you can redistribute them and/or modify uthem under the terms of the GNU General Public License V3

Garching, June 2008

Developer guidelines: client side

The ITVO web portal include some JAVA client easy to use and to include in your application.

ITVO Web Services and client are free software: you can redistribute them and/or modify uthem under the terms of the GNU General Public License V3

simdap simulation data access protocol

Documents