semantic integration of geospatial data from earth ... · aws identi cation [8]. ... [19] applies...

15
Semantic Integration of Geospatial Data from Earth Observations through Topological Relations Helbert Arenas, Nathalie Aussenac-Gilles, Catherine Comparot, and Cassia Trojahn Institut de Recherche en Informatique de Toulouse, Toulouse, France {prenom.nom}@irit.fr Abstract. Earth observation is a rapidly evolving domain. Recently launched satellites, which deliver between 8 and 10TB of image data per day, open emerging opportunities in domains ranging from environmen- tal monitoring to urban planning and climate studies. However, domain- oriented applications require raw image metadata to be enriched with data coming from various sources (either static or dynamic), in order to support decision-making processes related to the observed areas. One of challenges to be addressed concerns the integration of heterogeneous data highly relying on spatio-temporal representations. This paper presents a semantic approach to integrate data with the aim of enriching metadata of satellite imagery with various open data sets that are relevant to de- scribe Earth Observations for a particular need. We propose a semantic vocabulary that specializes standards (like SOSA, GeoSPARQL) as well as a process - based on spatial and temporal features - to select, map and integrate heterogeneous geo-spatial data sets. This process relies on image tiles to handle data with a fixed spatial component while the tem- poral relationships are calculated on the fly based on temporal topology. 1 Introduction Earth Observation (EO) provides added value to a wide variety of areas. Re- cently, the European Space Agency (ESA) has launched the Sentinel program, with two types of satellites, Sentinel-1 and Sentinel-2 already providing high quality images (estimated between 8 to 10TB of data daily). They provide im- ages of Earth captured with different technologies and available for free. The availability of these data opens up many economic opportunities through new applications in fields as diverse as agriculture, environment, urban planning, oceanography and climatology. These business applications, however, have a strong need to couple these images with data on the observed areas. These data come from various measurement sensors. They are available from different sources with heterogeneous formats and distinct temporal features: they may be either static, like soil data, or dynamic, like weather observations. They can be useful for instance to indicate that an image contains a region affected by a natural phenomenon such as an earthquake or heat wave, and may be used for

Upload: doanduong

Post on 29-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data fromEarth Observations through Topological

Relations

Helbert Arenas, Nathalie Aussenac-Gilles, Catherine Comparot, and CassiaTrojahn

Institut de Recherche en Informatique de Toulouse, Toulouse, France{prenom.nom}@irit.fr

Abstract. Earth observation is a rapidly evolving domain. Recentlylaunched satellites, which deliver between 8 and 10TB of image data perday, open emerging opportunities in domains ranging from environmen-tal monitoring to urban planning and climate studies. However, domain-oriented applications require raw image metadata to be enriched withdata coming from various sources (either static or dynamic), in order tosupport decision-making processes related to the observed areas. One ofchallenges to be addressed concerns the integration of heterogeneous datahighly relying on spatio-temporal representations. This paper presents asemantic approach to integrate data with the aim of enriching metadataof satellite imagery with various open data sets that are relevant to de-scribe Earth Observations for a particular need. We propose a semanticvocabulary that specializes standards (like SOSA, GeoSPARQL) as wellas a process - based on spatial and temporal features - to select, mapand integrate heterogeneous geo-spatial data sets. This process relies onimage tiles to handle data with a fixed spatial component while the tem-poral relationships are calculated on the fly based on temporal topology.

1 Introduction

Earth Observation (EO) provides added value to a wide variety of areas. Re-cently, the European Space Agency (ESA) has launched the Sentinel program,with two types of satellites, Sentinel-1 and Sentinel-2 already providing highquality images (estimated between 8 to 10TB of data daily). They provide im-ages of Earth captured with different technologies and available for free. Theavailability of these data opens up many economic opportunities through newapplications in fields as diverse as agriculture, environment, urban planning,oceanography and climatology. These business applications, however, have astrong need to couple these images with data on the observed areas. Thesedata come from various measurement sensors. They are available from differentsources with heterogeneous formats and distinct temporal features: they maybe either static, like soil data, or dynamic, like weather observations. They canbe useful for instance to indicate that an image contains a region affected by anatural phenomenon such as an earthquake or heat wave, and may be used for

Page 2: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

2 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

deciding what to do in this area or for longer-term analyses. Moreover, by ex-ploiting the spatio-temporal characteristics of a phenomenon (its spatial imprintand its date), it becomes possible to know whether a geo-located entity withinthe footprint of this image (i.e. a city), has undergone the same phenomenon. Inthis context, because the images are already described by satellite metadata, oneof the challenges is the integration of heterogeneous data from various sources tothe image metadata. Previous works have already demonstrated the gain broughtby semantic technologies to facilitate this task [21], [23].

In line with recent work on Ontology-based Data Access (OBDA) and DataIntegration (OBDI) [16] [15] [8], we present a semantic approach to integratedata with the aim of enriching metadata of satellite imagery with data fromvarious sources that provide EOs for a particular need. OBDA requires to definea semantic vocabulary that will enable an homogeneous data representation andquery, and to write mapping rules or algorithms to populate the model with datafrom the heterogeneous sources. In the particular case of EO data, an importantfact is that the data from diverse origin can relate through spatio-temporal topo-logical relationships. Data come from geo-spatial data sets with heterogeneousformats (shapefile, KML, CSV, GeoJSON, TIFF). The data integration processneeds to properly manage the spatial and temporal properties and relationships.To avoid duplicating static data that would tag all the images of the same areaover time, the notion of tile defined by ESA is very convenient: the Earth surfaceis associated a grid where a tile represents a fixed area on the Earth surface.

In this paper we present a framework where diverse geographical informa-tion and metadata of EO images are semantically integrated. First, we propose astraightforward vocabulary that allows the semantic and homogeneous descrip-tion of geo-spatial data as well as the metadata of satellite images as entitieswith spatial and temporal properties. A subset of the geo-spatial data to be in-tegrated to the image meta-data is contextual information measured on Earth,so that it can be considered as sensor data. Thus this vocabulary specializeswell-known LOD vocabularies, in particular, SOSA1 and GeoSPARQL [13].

As a second contribution, we defined an integration process that is basedon the topology of entities and Linked Data principles. The diversity of datasources raises different heterogeneity issues. For each data set to be integrated,we defined mapping templates and functions. Temporal properties and relationscontribute to integrate dynamic data. To handle the spatial component of staticand dynamic data the process relies on image tiles. As a side effect, using tilesenables to better scale up by reducing the amount of data to be handled. Lastbut not least, the integration process produces various triple-stores and JSONfiles of EOs and measurement data that can be reused for other purposes. Forinstance, we have generated JSON files and RDF triples2 that connect the landcover information to the ESA tiles.

We illustrate our approach through a case study exploiting Sentinel-2 imagemetadata and contextual data (weather report data, Earth land cover, agri-

1 https://www.w3.org/2015/spatial/wiki/SOSA_Ontology2 Data will soon be available on line at http://sparkindata.irit.fr

Page 3: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 3

culture reports, etc.). For this study, the semantic representation of the imagemetadata provided by the CNES (French National Center for Space Studies) arelinked to data overlapping the image footprint and with similar capture time,in particular, meteorological data coming from Meteo France, Land Cover andAdministrative Units.

The rest of this paper is organized as follows. Section 2 discusses the mainrelated work. Section 3 overviews our approach. Section 4 presents the proposedvocabulary and Section 5 details the data selection, alignment and integrationprocesses. Finally, Section 6 concludes the paper and presents future work.

2 Related Work

2.1 Ontology-based Data Integration

Ontology-based Data Integration (OBDI) is one of the topics of Ontology-basedData Management (ODBM) which aims at accessing and using data by meansof an ontology [16]. The ontology is a means to standardize data access fromheterogeneous sources, and to take advantage of the formal semantics for aneasier data management, consistency checking or flaws identification [8].

According to this computing paradigm, data access is realized through athree-level architecture, constituted by an ontology O, a set of data sources S,characterized by their schemes, and M the set of mappings between the two.These mappings may be used either to build a knowledge graph from the data,i.e. to design a semantic representation of the data (an ABox) using the classesand properties defined in the ontology; or it may be used on the flow to rewriteontology-based SPARQL queries into SQL queries to search the data-base andretrieve a small set of data. In both cases, directly rewriting queries avoidsrewriting a full data set into RDF to make it accessible in semantic applications;only the required data are represented in RDF.

The second approach is a simplification of the first one: it relies on algorithmsrather than mappings to rewrite queries. For instance, the REQUIEM algorithmby [19] applies on DL1 description logic models; the SPARQL-generate algorithmmatches data files of any format to RDF graphs [15]. Due to the cost of storingand maintaining linked data, many data sets are not made available in the LOD.A solution can be to integrate non-RDF data sets on-demand as Linked Data.ODMTP (for On-Demand Mapping using Triple Patterns) implements a solutionusing a Triple Pattern Fragments (TPF) server over non-RDF data sets [17].

Some of the more advanced works based on mappings is the MASTRO Studio.Its authors claim that it is the only full-fledged ODBM system which provides, inaddition to OBDA functionalities, capacities to document and inspect an OBDAspecification. A similar system is QUEST [22], which performs query answeringover DL-Lite ontologies, and can work in both classical (i.e. with a local ABox)and virtual mode (i.e. using mappings to query the database).

Page 4: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

4 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

2.2 Publishing Linked Data on Earth Observation

Collecting and integrating geographical data produced by a variety of disciplinesand human activities is at the core of the Digital Earth project [11]. New datadriven applications in fields like environment, agriculture [23], risk management[1] or climate watch require data from heterogeneous sources and the ability tointegrate data streams. Making geographic data sets available and then interop-erable at the semantic level remains an open issue that various projects addressby referring to Linked Data principles [6]. Indeed, Linked Data comes with bestpractices for exposing, sharing, and integrating data via dereferenceable URIson the Web [12]. Specific guidelines to publish spatial data as Linked Open Data(LOD) are available from the W3C with special attention on the representationof spatial relations and Coordinate Reference Systems (CRS) [24].

Various ontologies and vocabularies are recommended to represent densegeospatial raster data in the LOD. The W3C suggests the RDF Data Cube(QB) ontology [7] in combination with other W3C and OGC standard ontolo-gies including the Semantic Sensor Network ontology (SSN)3, the Time ontology(Time) 4, the Simple Knowledge Organisation System (SKOS)5, PROV-O6 andthe recent DataCube extension for spatio-temporal entities, QB4ST7. EO im-agery produces voluminous data sets like gridded coverages derived from Land-sat satellite sensors, and even large RDF triple sets. Current triple stores are notsuitable for storing such large data sets. A solution is then to keep the data inits original repository and collect the required RDF representations on the flow,thanks to SPARQL queries through an OBDA interface that query observationaldata sources, coupled with a triple store for observational metadata.

2.3 Interlinking Data on Earth Observation

Whereas the first initiatives aimed at publishing data from one single source,recent works showed that LD principles could also make it easier to integratedata from diverse sources in one or several RDF triple-stores [6] [23]. The result-ing LD repositories form Virtual Earth Observatories that, thanks to the newlinks identified between the data and inferred knowledge, provide much richerinformation sets than EO images and their standard metadata alone [14]. TheGeoKnow project share this vision: it leveraged spatial data in the Web of Data,and made available devices to collect, merge and aggregate spatial data as wellas a Linked Data stack to publish, reuse and visualize it [9].

Interlinking data on EO means discovering spatial and temporal links amongthe RDF graph obtained after data publication [5]. Thanks to spatial links, datafrom observations can be associated to tiles and then to EO images. Thanksto temporal links, temporal observations can be linked to images too. In case

3 http://purl.oclc.org/NET/ssnx/ssn4 https://www.w3.org/TR/owl-time5 http://www.w3.org/2004/02/skos/core6 https://www.w3.org/TR/prov-o7 https://www.w3.org/TR/qb4st/

Page 5: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 5

entities of the same nature are collected from various sources, an entity resolutionalgorithm can identify mappings between similar or identical spatial entities. Weare concerned only by temporal and spatial relationships.

The OGC introduced the notion of geolinked data to refer to geographicallyrelated data. In early works, geometry was not directly stored within the at-tribute data, but in a separate geo-spatial data-set. This option adds constrainswhen comparing the geometry of each entity. However current repositories storetogether an RDF representations of the geometry with the RDF spatial enti-ties. Atemezing [3] identified various types of geometries (point, line or poly-gone) and various tools to build an RDF representation of the geometry (likeGeometry2RDF8 or TripleGeo9). The process defined by Vilches-Blazquez andcolleagues [6] precisely compares data geometries, so that spatial data could beretrieved and interlinked on a high level of granularity. We have adopted a sim-ilar modality in our approach, and rely on a precise comparison of the spatialcomponent of each entity to integrate data.

Atemezing [3] also proposes and models four vocabularies for representingCRS, topographic entities and their geometries. These ontologies extend existingvocabularies and offer two additional advantages: an explicit use of CRS identi-fied by URIs for geometry, and the ability to describe structured geometries inRDF. The data is published as the French authoritative database GEOFLA.

Another difficulty of the integration of spatial data comes from the differencein the data temporal validity. Some data, i.e. the position of weather stations,of cities and most of administrative places, and even land cover, are valid for avery long period, larger than the one of the application, and can be considered asstable or static. In contrast, some data streams are continuously providing newdata at regular time spans. For instance, temperature measures are given every3 hours by Meteo France weather reports, and tens of new EO images and theirmeta-data are available on the PEPS server every day. The W3C RDF Datacube recommendation [7] suggests linking each image to tiles so that one couldmake statements on the tiles. Tiles are geo-located square areas determined by agrid decomposition of the Earth surface. Each EO image provided by Sentinel 2Single Tile (S2ST) has already a tile. More recently, [1] proposes a framework inwhich satellite images are classified and enriched with additional semantic datain order to enable queries about what can be found at a particular location.This is achieved by a reasoning capabilities relying on domain-specific spatialreasoning rules enabling to answer high level queries.

3 Semantic approach for EO data integration: overviewand architecture

The architecture of our integration platform is modular. Its different levels allowdecoupling stages in the process from raw data to semantic data. Figure 1 depictsthe architecture, consisting of different modules:

8 https://github.com/boricles/geometry2rdf9 https://github.com/GeoKnow/TripleGeo

Page 6: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

6 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

– Data selection : the first step of the data integration process is to identifyand access the data sources to be collected. A data set is either a file orthe result of a query to retrieve data, from a data store. The formats offiles currently considered are CSV, RDF, XML, TIFF, Shape files. The datasources used in this work are described in Section 5.1.

– Data conversion: Once the data sources have been selected and the datagathered, they are first converted into a JSON pivot representation. To doso, we have reused dedicated scripts or developed customized ones, accordingto the specific kind of data source. The intermediate JSON files are storedin a MongoBD data base as a security back-up.

– Data alignment: The data in JSON files is mapped to instances of classesin the ontology presented in Section 4. The mapping process relies on atemplate and a processing mechanism implemented as a Python module. InSection 5.2, we provide examples of the mapping templates. Thanks to thePython module we can implement customized functions that use the valuesin JSON documents as input data or parameters. Thanks to these functionswe can perform sophisticated operations that are not possible in alternativeapproaches such as RML.

– Data integration: The integration process relies on the topological rela-tionships between the instances of the model classes. Topological relation-ships can be either spatial or temporal. At this point, all the instances inour knowledge base have a static spatial representation. Then it is possibleto pre-process the topological relationships and store them as declarativestatements in the triple store. It is also possible to evaluate the topologicalrelationships on the fly, however this demands a computing cost that due tothe nature of our data (fixed positions) we consider unnecessary10. The tem-poral component of the entities in the knowledge base is represented usingOWL Time. Then temporal topological relationships can be establisehd ondemand at query time using SPARQL.

This work has been carried out in the context of the SparkInData project11,which aims at delivering a platform intended to offer a support to deploy ap-plications in the spatial domain. The SparkInData platform includes a cloudarchitecture and a docker environment to implement services.

4 Model for integration of earth observation data

Our model for data integration relies on two existing vocabularies, the SOSAcore ontology and GeoSPARQL ontology. In our previous work [2], image metadarecords and meteorological observations were represented with DCAT and SSN,

10 In the future, as the model evolves, we might use features with a dynamic spa-tial representation, for instance a weather sensor located in a car. In this case, theidentification of topological relationships would need to be done dynamically.

11 SparkInData project is funded by “Investing for the Future” French program.

Page 7: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 7

Fig. 1. Architecture of our services.

respectively. Here, we propose instead to adopt SOSA as a core ontology thatcan be shared across these different types of data, as detailed below.

SOSA is a light-weight but self-contained core ontology representing elemen-tary classes and properties of SSN (Semantic Sensor Network). SOSA describessensors and their observations, the involved procedures, the studied features ofinterest, the samples used to do so, and the observed properties. SOSA is rele-vant for a wide range of applications, including satellite imagery. We have henceadopted SOSA for describing image metadata and meteorological observationsas respectively, Earth observations and meteorological observations (Figure 2).However, we specialized SOSA in order to better type the instances of theseconcepts, although the trend in domains largely adopting SOSA, such as IoT,is to avoid this kind of construction and to directly use SOSA as main vocab-ulary [20]. GeoSPARQL, an OGC standard, defines a small ontology for therepresentation of features, spatial relations and functions [13] [4]. While alter-native vocabularies exist, such as GeoRDF which allows for representing simpledata like latitude, longitude, and altitude as properties of points (using WGS84as reference datum) and GeoOWL, which allows for expressing spatial objects(lines, rectangles, polygons), we opted for GeoSPARQL because it offers goodreasoning capabilities to compare geometries. To sump up, we represent tempo-ral relationships mainly using the time properties of SOSA (reusing OWL Timevocabulary), and spatial relations thanks to GeoSPARQL.

As depicted in Figure 2, a satellite image metadata record has a spatialdimension and a temporal dimension. Both of these dimensions contribute tolink observation data to the image metadata. The temporal dimension of an im-age metadata record identifies the moment when the image has been captured.The external data source, the weather information, also has a temporal dimen-sion. Weather stations record measurements periodically. So we use the classsosa:Observation as a way to connect the measured variables to the weather

Page 8: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

8 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

station while at the same time, providing a temporal dimension for the observa-tions. Then, we can link with a temporal relationships (before, after) an imagemetadata record and weather measurements or store periods of interest (e.g.,one week after the image was created).

With respect the spatial dimension, we use GeoSPARQL to create state-ments that describe the topological relationships (contains, overlaps) between asatellite image footprint and other entities with a spatial nature. GeoSPARQLallows to express such relationships between two resources (two geometries ortwo features) using topological properties (direct properties) or topological func-tions (computed properties). In our model, the class eom:Footprint specializesboth geo:Feature and sosa:FeatureOfInterest : a footprint is a closed polygon (ageometry) that represents the geographic area covered by the image. Thanksto this specialization we are able to link the metadata records with any otherinformation with a spatial component and defined as a geo:Feature.

Fig. 2. The integration model. The SOSA and GeoSPARQL vocabularies are special-ized in 4 modules dedicated to each knowledge source and to the grid representation.

We represent a weather station as an instance of the class mfo:MeteoStationwhich is a subclass of sosa:Platform. The sensors operating in a weather sta-tion are represented as instances of mfo:MeteoSensor which is a subclass ofsosa:Sensor. The specific geographic position of the measurement is representedas an instance of the class mfo:MeteoFeatureOfInterest, a subclass of sosa:Feature-OfInterest. The class mfo:MeteoFeatureOfInterest is also a subclass of geo:Feature;thus knowing the position of a mfo:MeteoFeatureOfInterest, it is easy to identifyfeatures of other nature that overlap the weather observations.

Page 9: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 9

In order to link EOs to administrative units of France (regions, departmentsand cities) thanks to their geographic location (point or polygon), we haveenriched the model with the admin:AdministrativeUnit class as a sub-class ofgeo:Feature. Finally, for Sentinel 2 Single Tile images, tiles correspond to theimage Feature of interest.

5 Data selection, conversion and alignment

5.1 Data selection

Sources of dynamic data As stated above, within the SparkInData project,we use metadata records of Sentinel images12. The revisit time for Sentinel-1is twelve days, while for Sentinel-2 it is five days. The metadata records areobtained from RESTO, a data service managed by CNES (Centre Nationald’Etudes Spatiales) [10] in GeoJSON format. For instance the following URL willreturn all the metadata records for the collection Sentinel-2 Single Tile for Francethat have been produced between 23:00 on 2017-09-19 and 00:00 on 2017-09-25:https://peps.cnes.fr/resto/api/collections/S2ST/search.json?q=France&startDate=2017-09-19T23:00:00&completionDate=2017-09-25T00:00:00.

Using the RESTO API it is possible to specify the parameters to be retrieved i. e.Ispecific metadata in the record, such as cloud cover, interval of time, geographic areaof interest, etc. We collect this data once every night. As dynamic contextual data, weuse weather information provided by SYNOP Meteo France13. This organization offersdata as monthly compiled CSV zipped files. The observations are taken every threehours for each one of the 62 weather stations in France. A separate file contains a list ofthe weather stations with their position as points encoded as geographic coordinates.

Sources of static data The KML grid file is available from ESA14. In the case ofimages from the Sentinel-2 Single tile data set, information about the spatial coverageof the image can be obtained from the metadata in two forms: 1) the image footprint,2) the identifier of the tile that corresponds to the image. Then, it is possible to linkthe geometry of the tiles to other data sets. Another source of static data, GLC-SHARE (Global Land Cover SHARE) is produced by FAO, and provides Land Coverinformation. This data set is available as an image in TIFF format. Each pixel has aspatial resolution of approximately 1 sqkm. The land cover information is thematic.The pixel values in the land cover image are integers that represent the most prevalentland cover for the area that the pixel covers. We have pre-calculated the Land Covercomposition for each tile over France. Then, we can connect images to this information,no need to do the operation on the fly. For each tile we have information regarding thepercentage of each of the land cover classes (artificial surfaces, cropland, tree coveredareas, water bodies, etc.) existing in the area the tile covers. Finally, we collect dataabout administrative units from the Open platform for French public data 15. The datais originally provided as shapefiles.

12 https://sentinel.esa.int/web/sentinel/missions/ (07/2016)13 https://donneespubliques.meteofrance.fr/ (07/2016)14 https://sentinel.esa.int/web/sentinel/missions/sentinel-2/news/-/

article/sentinel-2-tiling-grid-updated15 https://www.data.gouv.fr

Page 10: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

10 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

5.2 Data conversion and alignment

As presented above, we use data from various sources (metadata from satellite images,Land Cover, Administrative Units, and weather observations) with diverse originalformats. In order to standardize procedures, we transform the data into JSON andproceed to apply a mapping tool to obtain RDF. The mapping mechanism will beexplained in the following sections, with examples of the conversion of AdministrativeUnits and Weather Observations. The conversion procedure for the other data sourcesis similar, although mapping mechanism can be more complex due to the amount ofitems that need to be mapped.

Administrative units A common language for the conversion of data into RDF isRML. One of the major limitations of RML is the lack of ease to implement customizedfunctions for particular pieces of information. Let’s illustrate these limitations andconsider the following JSON document:

{"wkt": "MULTIPOLYGON(((

-1.0988062299633785 45.64032288975508, ...

-1.0988062299633785 45.64032288975508))...)",

"name": "Poitou-Charentes", "geomType": 5,

"inseeInfo": {"adminType": "region", "insee": "54"}}

It describes an administrative unit located in France. The value of wkt attribute ofthe key is the geometry of an administrative unit encoded as Well known text (WKT),while the key name is a string that gives the name of the unit. The key inseeInfocontains information referring to the identification of this unit according to the InstitutNational de la Statistique et des Etudes Economiques (INSEE) (the statistical bureauof France). Using the information contained in inseeInfo, we could obtain the URI ofthis administrative unit as it is represented in the INSEE knowledge base. However, thisrequires to create a SPARQL query, and send it to the INSEE SPARQL endpoint16.This task is not easy to implement with available RML processors. To solve this prob-lem, we developed a customized solution for mapping JSON into RDF. The solutionconsists of a triple template and a processor encoded in Python. For the administrativeunits we use the module admin of the ontology described in Section 4. The followingparagraph is an example of a template designed to process the previously describedJSON document into the vocabulary admin.

@prefix geo: <http://www.opengis.net/ont/geosparql#> .

@prefix admin: <http://melodi.irit.fr/ontologies/administrativeUnits.owl#> .

# this template defines the structure of a administrative unit

<dummy> a admin:AdministrativeUnit .

<dummy> a getUrlAdministrativeUnitType($.inseeInfo.adminType) .

<dummy> admin:hasInseeCode stringToLiteral($.inseeInfo.insee) .

<dummy> admin:hasName stringToLiteral($.name) .

# here i define the spatial representation of an administrative unit.

<dummy> a geo:Feature .

<dummy> geo:hasGeometry <dummy_geo> .

<dummy_geo> a geo:Geometry .

16 http://rdf.insee.fr/sparql

Page 11: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 11

<dummy_geo> geo:asWKT valueToWktLiteral($.wkt) .

# here i will link this instance to the corresponding insee administrative unit

<dummy> owl:sameAs getInseeUrl($.inseeInfo) .

The template consists of triples that contain elements that are replaced by actualvalues. In some cases, the values contained in the JSON document need further pro-cessing. We provide this processing using customized functions, that use as parametersthe information extracted from the JSON document. We extract the information fromthe JSON file, using JSON Path. For instance, in the case of stringToLiteral($.name)the value in the JSON document for the key name is assigned the datatype stringliteral. In the case of getInseeUrl($.inseeInfo) a more sophisticated processing is imple-mented: the function creates a SPARQL query with the parameter values, sends it tothe INSEE SPARQL endpoint, examines the result and returns the URI of the INSEEadministrative unit that matches the parameter. The SPARQL query generated by thefunction getInseeUrl() is the following one:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX igeo:<http://rdf.insee.fr/def/geo#>

SELECT ?adminUnit WHERE {

?adminUnit rdf:type igeo:Region .

?adminUnit igeo:codeINSEE "54"^^<http://www.w3.org/2001/XMLSchema#token> .}

The resulting RDF is depicted in the following snippet:

@prefix geo: <http://www.opengis.net/ont/geosparql#> .

@prefix admin: <http://melodi.irit.fr/ontologies/administrativeUnits.owl#> .

@prefix l_admin: <http://melodi.irit.fr/lod/administrativeUnit/> .

l_admin:region_54 a admin:AdministrativeUnit .

l_admin:region_54 a admin:Region .

l_admin:region_54 owl:sameAs <http://id.insee.fr/geo/region/54> .

l_admin:region_54 admin:hasInseeCode "54"^^xsd:String .

l_admin:region_54 admin:hasName "Poitou-Charentes"^^xsd:String .

l_admin:region_54 a geo:Feature .

l_admin:region_54 geo:hasGeometry l_admin:region_54_geo .

l_admin:region_54_geo geo:asWKT "MULTIPOLYGON(((

-1.0988062299633785 45.64032288975508, ...

-1.0988062299633785 45.64032288975508))...)"^^wkt:Literal .

Weather observations The temporal dimension of weather data is of particularimportance. The observations contained in the SYNOP dataset have a diverse tempo-ral dimension. For instance, the observations codified as tminsol depict the lowest soiltemperature recorded in the previous 12 hours. On the other hand, the code t corre-sponds to the temperature at the moment of measuring. Using our approach, we canimplement functions that are able to handle the diverse temporal nature of this typeof data. For instance, the following snippet represents an observation of type tminsol,for the station with id 07747, recorded on the 2017/12/06 03 hrs.

{ "temporalInfo" :

{ "timeStamp" : 1512529200,

"month" : "12",

"day" : "06",

Page 12: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

12 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

"hour" : "03",

"year" : "2017" },

"tminsol" : 271.45,

"numer_sta" : "07747" }

To process observations from Meteo France SYNOP, our processing script implementsthe function getMFO PhenomenonTime(doc), in which the parameter is the JSONdocument. In the template this is represented as:

<dummy> sosa:phenomenonTime getMFO_PhenomenonTime(doc) .

The function scans the JSON document, and retrieves the type of observationby identifying the key (tminsol). Using this information, the function knows that itneeds to create an instance of the class time:Interval. Then, it proceeds to examinethe element temporalInfo, and computes the beginning of the interval (The end of theinterval is the value contained in temporalInfo). Both, beginning and end of the intervalare encoded as instances of time:Instant. The result of the function is:

gmfo:Obs_07747_20171206030000_tminsol sosa:phenomenonTime

gmfo:TimeInterval_1512486000_1512529200 .

gmfo:TimeInterval_1512486000_1512529200 a time:TemporalEntity .

gmfo:TimeInterval_1512486000_1512529200 time:hasBeginning

gmfo:TimeInterval_1512486000_1512529200_beginning .

gmfo:TimeInterval_1512486000_1512529200_beginning time:inXSDDateTime

"2017-12-05T15:00:00+0100"^^xsd:dateTime .

gmfo:TimeInterval_1512486000_1512529200 time:hasEnd

gmfo:TimeInterval_1512486000_1512529200_end .

gmfo:TimeInterval_1512486000_1512529200_end time:inXSDDateTime

"2017-12-06T03:00:00+0100"^^xsd:dateTime .

Satellite image metadata The metadata files are obtained in GeoJSON format.The OWL vocabulary we use to represent this information is eom (Section 4). Theprocess to transform metadata records from GeoJSON to RDF is similar to the onepreviously described for Administrative Units and Weather observations. The onlydifference is the template, that has to be designed taking into consideration the datasource and the target vocabulary.

Image tiles Sentinel images, have different characteristics depending on the sensorthat create them. In September 2016, ESA started to distribute Sentinel 2 images assingle tile (S2ST) packages. A S2ST represents a fragment of the original image witha fixed size (aprox. 100 x 100km). The advantage over a regular S2 image, is thatthe user can better select its area of interest and download only the information thathe/she requires. A S2ST has a smaller file size than a regular S2 image. One S2STimage can be around 500Mb, while a Sentinel 2 image before tiling can be more than3Gb. The size and shape of the Sentinel 2 single tile images is based on a regular gridprovided by ESA as a KML file. In our work we transformed the grid file to JSONand then proceed to process it into RDF using the procedure previously described.The vocabulary we use to represent the grid is grid. It is described in Section 4. Wecan calculate topological relationships of spatial elements with the tiles, then we can

Page 13: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 13

extrapolate these information to the images. For instance, by knowing that images[img1, img2, img3] share the tile tile1, and that the tile1 overlaps adminUniti, we caninfer that [img1, img2, img3] also overlap adminUniti.

Land cover In our research we use Land Cover as a contextual data set. In orderto integrate this data source, we use a service implemented as a Django module inPython. The service has a REST interface, it accepts as a parameter a WKT polygonin SRS EPSG:4326. The service crops the original data set into a temporal file usingthe polygon. Then it creates a frequency table. The response of the server is JSON doc-ument containing the percentage of the area for each land cover class. In our work, weuse this JSON document as the input for our JSON to RDF transformation procedure.

5.3 Data integration

Integration of data with fixed spatial component Spatial relations thatare relatively stable e.g. topological relations between grids (SS2) and administrativeunits (image Y overlaps region R), or land cover information for each cell of the gridare computed and stored in the triple store (Figure 3 ). In our approach, we use apython script to calculate the topological relationships between instances of classes.The python script uses the library shapely to make the topological comparisons. Thenwe register them in a triple store using declarative statements involving GeoSPARQLtopological properties

Fig. 3. The spatial integration of geolocalized features. Links correspond to spatialtopological properties of GeoSPARQL used to linked instances of the classes.

Integration of data with temporal dimension It is possible to establishtemporal relationships between an image metadata record and weather measurementsor to establish periods of interest. A user can define a relevant period of time and link animage metadata record to the available weather information (e.g., weather informationcaptured one week after the image was created). The user defined period works as atemporal buffer that provides context to the image metadata record.

6 Conclusion

The integration of EO data from heterogeneous sources with satellite image metadatacan be gain a lot thanks to semantic web technologies and OBDM. Publishing some data

Page 14: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

14 Helbert Arenas, Nathalie Aussenac, Catherine Comparot, Cassia Trojahn

sets and image metadata as LOD opens new opportunities to use satellite images in alarger variety of applications by providing an easier access to linked Earth observations.Moreover, for large and dynamic data sets, using SPARQL queries to jointly searchobservational databases and LD enables to create RDF triples on the flow and avoids toconvert huge data sets into RDF triples. In this paper, we have proposed a spatial dataintegration framework. Several of our contributions improve this process: we designed avocabulary to represent EO data and image metadata; we proposed an RDF conversionprocess using resource specific templates and a Python library that overcomes someof the RML limitations; we also proposed an integration process that exploits thedata geometry and GeoSparql to link spatial data, and finally SPARQL queries to getdynamic data linked to images according to spatial and temporal features. As futurework, we plan to consider domain-oriented sources of data for a particular use case(agriculture and data sources as agricultural reports) and provide rules and reasoningcapabilities that help the specific domain analysis.

References

1. M. Alirezaie, A. Kiselev, M. Lngkvist, F. Klgl, and A. Loutfi. An ontology-basedreasoning framework for querying satellite images for disaster monitoring. Sensors,17(11), 2017.

2. H. Arenas, N. Aussenac-Gilles, C. Comparot, and C. Trojahn. Semantic integra-tion of geospatial data from earth observations. In Knowledge Engineering andKnowledge Management - EKAW 2016 Satellite Events, pages 97–100, 2016.

3. G. A. Atemezing. Publishing and consuming geo-spatial and government data onthe semantic web. PhD thesis, Thesis, 04 2015.

4. R. Battle and D. Kolas. Enabling the Geospatial Semantic Web with Parliamentand GeoSPARQL. Semantic Web, 3(October 2012):355–370, 2012.

5. L. M. V. Blazquez, V. Saquicela, and O. Corcho. Interlinking geospatial informa-tion in the web of data. In Bridging the Geographic Information Sciences - In-ternational AGILE’2012 Conference, Avignon, France, April, 24-27, 2012, pages119–139, 2012.

6. L. M. V. Blazquez, B. Villazon-Terrazas, O. Corcho, and A. Gomez-Perez. Inte-grating geographical information in the linked digital earth. International Journalof Digital Earth, 7(7):554–575, 2014.

7. D. Brizhinev, S. Toyer, K. Taylor, and Z. Zhang. Publishing and using earth obser-vation data with the rdf data cube and the discrete global grid system. Technicalreport, W3C and OGC, 2017.

8. M. Console and M. Lenzerini. Reducing global consistency to local consistencyin ontology-based data access - extended abstract. In M. Bienvenu, M. Ortiz,R. Rosati, and M. Simkus, editors, Informal Proceedings of the 27th InternationalWorkshop on Description Logics, Vienna, Austria, July 17-20, 2014., volume 1193of CEUR Workshop Proceedings, pages 496–499. CEUR-WS.org, 2014.

9. A. Garcıa-Rojas, S. Athanasiou, J. Lehmann, and D. Hladky. Geoknow: Leveraginggeospatial data in the web of data. In Open Data on the Web, 2013.

10. J. Gasperi. Semantic Search Within Earth Observation Products Database Basedon Automatic Tagging of Image Content. In Proceedings of the Conference on BigData from Space, pages 4–6, 2014.

11. A. Gore. The digital earth. Australian Surveyor, 43(2):89–91, 1998.12. T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space ;

Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, 2011.

Page 15: Semantic Integration of Geospatial Data from Earth ... · aws identi cation [8]. ... [19] applies on DL1 description logic models; the SPARQL-generate algorithm matches data les of

Semantic Integration of Geospatial Data through Topological Relations 15

13. D. Kolas, M. Perry, and J. Herring. Getting started with GeoSPARQL. Technicalreport, OGC, 2013.

14. M. Koubarakis, M. Karpathiotakis, K. Kyzirakos, C. Nikolaou, S. Vassos, G. Gar-bis, M. Sioutis, K. Bereta, S. Manegold, M. Kersten, M. Ivanova, H. Pirk, Y. Zhang,C. Kontoes, I. Papoutsis, T. Herekakis, D. Mihail, M. Datcu, G. Schwarz, O. Du-mitru, D. Molina, K. Molch, U. Giammatteo, M. Sagona, S. Perelli, E. Klien,T. Reitz, and R. Gregor. Building virtual earth observatories using ontologies andlinked geospatial data. In M. Krotzsch and U. Straccia, editors, Web Reasoningand Rule Systems: 6th Int. Conf. RR 2012, Vienna, Austria, Sept. 10-12, 2012.Proceedings, pages 229–233, Berlin, Heidelberg, 2012. Springer.

15. M. Lefrancois, A. Zimmermann, and N. Bakerally. A SPARQL extension for gen-erating RDF from heterogeneous formats. In Proc. Extended Semantic Web Con-ference (ESWC’17), Portoroz, Slovenia, May 2017.

16. M. Lenzerini. Ontology-based data management. In Proceedings of the 20th ACMInternational Conference on Information and Knowledge Management, CIKM ’11,pages 5–6, New York, NY, USA, 2011. ACM.

17. B. Moreau, P. Serrano-Alvarado, E. Desmontils, and D. Thoumas. Querying non-rdf datasets using triple patterns. In Nikitina et al. [18].

18. N. Nikitina, D. Song, A. Fokoue, and P. Haase, editors. Proc. of the ISWC 2017Posters & Demonstrations and Industry Tracks co-located with (ISWC 2017), Vi-enna, Austria, Oct.23rd-25th, 2017, volume 1963 of CEUR Workshop Proceedings.CEUR-WS.org, 2017.

19. H. Perez-Urbina, B. Motik, and I. Horrocks. A comparison of query rewritingtechniques for dl-lite. In B. C. Grau, I. Horrocks, B. Motik, and U. Sattler, editors,Proceedings of the 22nd International Workshop on Description Logics (DL 2009),Oxford, UK, July 27-30, 2009, volume 477 of CEUR Workshop Proceedings. CEUR-WS.org, 2009.

20. A. Pomp, A. Paulus, S. Jeschke, and T. Meisen. Eskape: Platform for enablingsemantics in the continuously evolving internet of things. In 2017 IEEE 11thInternational Conference on Semantic Computing (ICSC), pages 262–263, 2017.

21. F. Reitsma and J. Albrecht. Modeling with the semantic web in the geosciences.IEEE Intelligent Systems, 20(2):86–88, 2005.

22. M. Rodrguez-muro and D. Calvanese. High performance query answering overdl-lite ontologies. In Proceedings of KR’12, pages 308–318, 2012.

23. D. Sukhobok, H. Sanchez, J. Estrada, and D. Roman. Linked data for commonagriculture policy: Enabling semantic querying over sentinel-2 and lidar data. InNikitina et al. [18].

24. J. Tandy, L. van den Brink, and P. Barnaghi. Spatial data on the web best practices,w3c working group note. Technical report, W3C and OGC, 2017.