support action big data europe empowering communities … · domain-specific big data integrator...

41
Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020) Support Action Big Data Europe Empowering Communities with Data Technologies Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months Deliverable 5.6 Domain-Specific Big Data Integrator Instances III Dissemination Level Public Due Date of Deliverable Month 32, 01/09/2017 Actual Submission Date 27/09/2017 Work Package WP5, Big Data Integrator Instances Task T5.2 Type Other Approval Status Approved Version 1.00 Number of Pages 41 Filename D5.6_Domain-Specific Big Data Integrator Instances III Abstract: Documentation of the Big Data Integrator Instances deployed for executing the pilots of the Big Data Integrator across all seven societal challenges. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Ref. Ares(2017)4715575 - 27/09/2017

Upload: phungkien

Post on 02-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020)

Support Action

Big Data Europe – Empowering Communities with

Data Technologies

Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months

Deliverable 5.6

Domain-Specific Big Data Integrator Instances III

Dissemination Level Public

Due Date of Deliverable Month 32, 01/09/2017

Actual Submission Date 27/09/2017

Work Package WP5, Big Data Integrator Instances

Task T5.2

Type Other

Approval Status Approved

Version 1.00

Number of Pages 41

Filename D5.6_Domain-Specific Big Data Integrator

Instances III

Abstract: Documentation of the Big Data Integrator Instances deployed for executing

the pilots of the Big Data Integrator across all seven societal challenges.

The information in this document reflects only the author’s views and the European Community is not liable for any use

that may be made of the information contained therein. The information in this document is provided “as is” without

guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a

particular purpose. The user thereof uses the information at his/ her sole risk and liability.

Ref. Ares(2017)4715575 - 27/09/2017

D5.6 – v. 1.00

Page

2

History

Version Date Reason Revised by

0.00 21/01/2016 Document structure S. Konstantopoulos

0.01 25/8/2017 First draft based on descriptions of

third piloting cycle.

A. Charalambidis,

S. Konstantopoulos

0.02 30/1/2017 Pilot corrections on SC5 T. Davvetas,

I. Mouchakis

0.03 11/9/2017 SC4 pilot description L. Selmi

0.08 15/9/2017 Peer review A. Versteden and H.

Jabeen

0.09 25/9/2017 Address peer review comments A. Charalambidis,

S. Konstantopoulos

1.00 27/9/2017 Final version

D5.6 – v. 1.00

Page

3

Author List

Organisation Name Contact Information

NCSR-D Ioannis Mouchakis [email protected]

NCSR-D Stasinos Konstantopoulos [email protected]

NCSR-D Angelos Charalambidis [email protected]

NCSR-D Georgios Stavrinos [email protected]

NCSR-D Vangelis Karkaletsis [email protected]

NCSR-D Thanasis Davvetas [email protected]

TenForce Aad Versteden [email protected]

UoB Hajira Jabeen [email protected]

SWC Jürgen Jakobitsch [email protected]

FhG Luigi Selmi [email protected]

D5.6 – v. 1.00

Page

4

Executive Summary

This report documents the instantiations of the Big Data Integrator Platform that underlies the

pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon

2020 Societal Challenges. These platform instances will be provided to the relevant networking

partners to be used for executing the pilot sessions foreseen in WP6.

For each of the seven pilots, this document provides (a) a brief summary of the pilot description

prepared in WP6, and especially of the use cases provided in the pilot descriptions; (b) the

technical requirements for carrying out these use cases; (c) an architecture that shows the BDI

components required to cover these requirements; and (d) the list of components in the

architecture and their status (available as part of BDI, or otherwise available, or to be

developed as part of the pilot).

D5.6 – v. 1.00

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe.

The components that are made available to the pilots by BDI are listed

here: https://github.com/big-data-europe/README/wiki/Components

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1: Health, Demographic Change and Wellbeing

SC2 Societal Challenge 2: Food Security, Sustainable Agriculture and Forestry,

Marine, Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3: Secure, Clean and Efficient Energy

SC4 Societal Challenge 4: Smart, Green and Integrated Transport

SC5 Societal Challenge 5: Climate Action, Environment, Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6: Europe in a changing world – Inclusive, innovative

and reflective societies

SC7 Societal Challenge 7: Secure societies – Protecting freedom and security

of Europe and its citizens

AK Agroknow, Belgium

CERTH Centre for Research and Technology, Greece

CRES Center for Renewable Energy Sources and Saving, Greece

FAO Food and Agriculture Organization of the United Nations, Italy

D5.6 – v. 1.00

Page

6

FhG Fraunhofer IAIS, Germany

InfAI Institute for Applied Informatics, Germany

NCSR-D National Center for Scientific Research “Demokritos”, Greece

OPF Open PHACTS Foundation, UK

SWC Semantic Web Company, Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam, the Netherlands

D5.6 – v. 1.00

Page

7

Table of contents 1. Introduction ............................................................................................................. 9

1.1 Purpose and Scope .......................................................................................... 9

1.2 Methodology .................................................................................................... 9

2. Third SC1 Pilot Deployment ...................................................................................11

2.1 Use Cases ......................................................................................................11

2.2 Requirements ..................................................................................................11

2.3 Architecture .....................................................................................................13

2.4 Deployment .....................................................................................................14

3. Third SC2 Pilot Deployment ...................................................................................15

3.1 Overview .........................................................................................................15

3.2 Requirements ..................................................................................................15

3.3 Architecture .....................................................................................................16

3.4 Deployment .....................................................................................................16

4. Third SC3 Pilot Deployment ...................................................................................18

4.1 Overview .........................................................................................................18

4.2 Requirements ..................................................................................................19

4.3 Architecture .....................................................................................................20

4.4 Deployment .....................................................................................................21

5. Third SC4 Pilot Deployment ...................................................................................22

5.1 Use cases .......................................................................................................22

5.2 Requirements ..................................................................................................23

5.3 Architecture .....................................................................................................25

5.4 Deployment .....................................................................................................25

6. Third SC5 Pilot Deployment ...................................................................................27

6.1 Use cases .......................................................................................................27

6.2 Requirements ..................................................................................................28

6.3 Architecture .....................................................................................................30

6.4 Deployment .....................................................................................................31

D5.6 – v. 1.00

Page

8

7. Third SC6 Pilot Deployment ...................................................................................32

7.1 Use cases .......................................................................................................32

7.2 Requirements ..................................................................................................33

7.3 Architecture .....................................................................................................35

7.4 Deployment .....................................................................................................35

8. Third SC7 Pilot Deployment ...................................................................................37

8.1 Use cases .......................................................................................................37

8.2 Requirements ..................................................................................................37

8.3 Architecture .....................................................................................................39

8.4 Deployment .....................................................................................................39

9. Conclusions............................................................................................................41

D5.6 – v. 1.00

Page

9

1. Introduction

1.1 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe. These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6.

1.2 Methodology

Task 5.2 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europe’s Social challenges. To this end, T5.2

comprises seven (7) distinct sub-tasks, each one dedicated to a different domain of application.

Participating partners and their role: NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform. This task includes two phases: the design and the deployment

phase. The design phase involves the following.

● Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot.

● Prepare a first draft of the sections for the third cycle pilots, where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted.

● Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory. This draft also includes a list of components and their availability: (a) base

platform components that are prepared in WP4; (b) pilot-specific components that are

D5.6 – v. 1.00

Page

10

already available; or (c) pilot-specific components that will be developed for the pilot.

Components are also assigned a partner responsible for their implementation.

● Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements, to establish that the pilot is relevant

to the communities it is aimed at.

During deployment phase, work in this task will follow and document development of the

individual components and test their integration into the platform.

D5.6 – v. 1.00

Page

11

2. Third SC1 Pilot Deployment

2.1 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health, Demographic Change and

Wellbeing.

Previous pilots demonstrated reproducing the functionality of an existing data integration and

processing system (the Open PHACTS Discovery Platform) on BDI and ability of manually

updating some underlying datasets. The final pilot will build on the existing system and:

● Establish the effect of regular source data refreshers on data veracity, and whether

changes in the underlying data are reflected in use cases.

● Establish the effect of data model updates to the integration of the OPF datasets.

2.2 Requirements

Table 1 lists the ingestion, storage, processing, and output requirements set by this pilot.

Table 1: Requirements of the Third SC1 Pilot

Requirement Comment

R1 RDF data storage with support

for versioning

The current Open PHACTS Discovery Platform is

based on distributed Virtuoso. Different datasets

are stored in different graphs. Using a federation

of RDF stores, each holding one dataset, allows to

easily replace a dataset with its updated version,

have it checked for veracity and integration and

decide it would be rolled back or accepted.

R2 Datasets are aligned and linked In conjunction with R1, a modular data ingestion

D5.6 – v. 1.00

Page

12

at data ingestion time, and the

transformed data is stored.

component should dynamically decide which data

transformers to invoke.

D5.6 – v. 1.00

Page

13

Figure 1: Architecture of the Third SC1 pilot

2.3 Architecture

To satisfy the requirements above, the following modules will be deployed:

Storage infrastructures:

● Distributed triple stores for the live data and for the updates.

● Semagrow for establishing different federations depending on which version of each

dataset should be used.

Processing infrastructures:

● Scientific Lenses query expansion

Other modules:

● Data connector, including the data transformation modules for the alignment of data at

ingestion time

● REST API for querying, that builds a SPARQL query by using keywords to fill in pre-

defined query templates. The querying services also uses Scientific Lenses to expand

queries.

D5.6 – v. 1.00

Page

14

2.4 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

Table 2: Components needed to Deploy Third SC1 Pilot

Module Task Responsible

4store1 BDI dockers made available by WP4. NCSR-D

Semagrow2 BDI dockers made available by WP4. NCSR-D

Data connector and

transformation modules

Already developed in previous pilot

cycle.

VU

Query endpoint Already developed in previous pilot

cycle.

VU

Scientific Lenses query

expansion module

Already developed in previous pilot

cycle.

VU

1 https://github.com/big-data-europe/docker-4store 2 https://github.com/big-data-europe/docker-semagrow

D5.6 – v. 1.00

Page

15

3. Third SC2 Pilot Deployment

3.1 Overview

The pilot is carried out by AK, FAO, and SWC in the frame of SC2 Food Security, Sustainable

Agriculture and Forestry, Marine, Maritime and Inland Water Research and the Bioeconomy.

The second pilot cycle builds upon the first pilot cycle (cf. D5.1, Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture. The second pilot demonstrated text

mining, data processing, phenologic modelling and variety identification workflows.

The third pilot will include:

● An intuitive user interface environment to enable the browsing of the linked information.

3.2 Requirements

Table 3 lists the ingestion, storage, processing, and output requirements set by this pilot.

Table 3: Requirements of Third SC2 Pilot

Requirement Comment

R1 Intuitive, easy-to-use, interface for

browsing and selecting linked

information.

To be developed and documented for the

pilot.

D5.6 – v. 1.00

Page

16

Figure 2: Architecture of the Third SC2 pilot

3.3 Architecture

To satisfy the requirements above, the following modules should be modified with respect to

the second piloting prototype:

Data presentation modules:

● Intuitive user interface that queries the Graph database module.

3.4 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

Table 4: Components needed to Deploy Third SC2 Pilot

D5.6 – v. 1.00

Page

17

Module Task Responsible

Spark over HDFS3, Flume4,

Kafka5

BDI dockers made available by WP4. FH, TF, InfAI,

SWC

GraphDB GraphDB official docker package will

be used for the pilot.

SWC

Flume agents for publication

ingestion and processing

Already developed in previous pilot

cycle.

SWC

Flume agents for data

ingestion

Already developed in previous pilot

cycle.

SWC, AK

Data storage schema Already developed in previous pilot

cycle.

SWC, AK

Phenolic modelling Already developed in previous pilot

cycle.

AK

Spark AKSTEM Already developed in previous pilot

cycle.

AK

Variety Identification Already developed in previous pilot

cycle.

AK

User interface To be developed for the pilot. AK

3 https://github.com/big-data-europe/docker-spark 4 https://github.com/big-data-europe/docker-flume 5 https://github.com/big-data-europe/docker-kafka

D5.6 – v. 1.00

Page

18

4. Third SC3 Pilot Deployment

4.1 Overview

The pilot is carried out by CRES in the frame of SC3 Secure, Clean and Efficient Energy.

The third pilot cycle demonstrates that the Big Data tools can support developers and system

operators for the exploitation of the available coarse weather forecast in order to achieve

localized (point) energy production forecasts.

The following datasets are involved:

● Renewable energy supply (RES) units data that contain information about the

identification and geolocation of the units. RES data can be accessed using a web

service6 from the regulator site.

● Weather Research and Forecasting (WRF) points. The source of the WRF points will

be defined during the pilot implementation. The WRF points are files in GRIB format.

● Terrain data obtained from NASA7.

The following processing is carried out:

● Orchestration and data management of background actions such as CFD simulations,

correlation optimization, inclusion of on line observations.

The following outputs are made available for visualization or further processing:

● Computational Fluid Dynamics (CFD) simulations produced by OpenFOAM8

● Power production forecasts

6http://www.rae.gr/geoserver/wfs?service=wfs&version=2.0.0&request=GetFeature&typeN

ames=rae_status:V_SDI_R_AIOLIKA13&propertyName=ID1,AA,A_M,POWER_MW,PARATIRISEIS,G

EOMETRY 7 https://asterweb.jpl.nasa.gov/gdem.asp 8 http://www.openfoam.com/

D5.6 – v. 1.00

Page

19

4.2 Requirements

Table 5 lists the ingestion, storage, processing, and output requirements set by this pilot. Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5.

Table 5: Requirements of Third SC3 Pilot

Requirement Comment

R1 Download RES datasets, WRF

points and terrain data+

Data connectors need to be developed.

R2 Orchestrate the background

CFD simulation

D5.6 – v. 1.00

Page

20

Figure 3: Architecture of the Third SC3 pilot

4.3 Architecture

To satisfy the requirements above, the following modules will be deployed:

Storage infrastructures:

● HDFS

● PostgreSQL

Processing infrastructures:

● Spark

Other modules:

● A data connector that can ingest WRF points.

D5.6 – v. 1.00

Page

21

● A data connector that can access and ingest RES data from regulator.

● Data visualization that can visualize the production forecasts.

4.4 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

Table 6: Components needed to Third Second SC3 Pilot

Module Task Responsible

Spark9, HDFS10, Postgres11 BDI dockers made available by WP4. FH, TF, InfAI,

NCSR-D

RES and Terrain data

connector

To be developed for the pilot. CRES

WRF Data connector To be developed for the pilot. CRES

Data visualization To be extended for the pilot. CRES

9 https://github.com/big-data-europe/docker-spark 10 https://github.com/big-data-europe/docker-hadoop 11 https://github.com/big-data-europe/docker-postgres

D5.6 – v. 1.00

Page

22

5. Third SC4 Pilot Deployment

5.1 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart, Green and Integrated

Transport.

The pilot demonstrates how to implement the workflow for the ingestion, processing and

storage of near real-time and historical floating car data (FCD) in a distributed setting. The pilot

demonstrates the following workflows:

● The monitoring of the current traffic conditions in the city of Thessaloniki using the near

real-time floating car data available for 1200 vehicles.

● The forecasting of future traffic conditions based on a model trained from the historical

and the near real-time FCD data.

The data from each vehicle contains the vehicle’s identifier, valid only 24 hours, speed,

direction, orientation, latitude and longitude. In order to be used as a proxy for the traffic flow

in each road segment the location of the vehicle must be matched with the road segment in

which the vehicle is being driven. The near real-time Floating Car Data (FCD) stream is

generated by the taxi fleet and is provided via web services. The historical data is provided via

FTP and contains 40 GB of data. The road network is extracted from the OpenStreetMap

database. The map matching can be performed in parallel and is based on R scripts and stored

procedures that make queries on a geographical database using topological rules. The

matched records are aggregated in time windows in order to compute the average flow

(number of vehicles) and speed within each time window. The result of the aggregation are

records containing the identifier of a road segment, the traffic flow (number of vehicles in the

time window), the average speed and the timestamp. The result data is stored in a distributed

database.

The objective of the second pilot was on one hand to implement a traffic forecasting service

and on the other hand to deploy the pilot in a distributed setting using Docker swarm. The

D5.6 – v. 1.00

Page

23

forecasting service was based on a neural network to train a model for each road segment in

order to learn the pattern of the traffic flow.

The objective of the third pilot is to improve the prediction algorithm in order to better represent

the time correlation of successive counts of the vehicles and the spatial correlation between

the road segments since the number of vehicles in a road segment depends on the number of

vehicles in road segments that are connected. One more task for the 3rd cycle will be to

parallelize the training and validation of the models. In order to test the deployment of docker

containers and run them using a docker compose file we have developed a simplified version

of the pilot that uses a subset of the containers and HDFS for the storage of the FCD data and

the result of the processing. The simplified version of the pilot matches the vehicles to a cell

within a grid that covers the area of Thessaloniki and it doesn’t require the R scripts nor

PostGis. One last work will be the visualization of the forecasts.

5.2 Requirements

Table 7 lists the ingestion, storage, processing, and output requirements set by this pilot.

Table 7: Requirements of Third SC4 Pilot

Requirement Comment

R1 The pilot will processes the raw

FCD to map match the records

within temporal windows

The FCD map matched data are used to count the

number of vehicles in each road segment and the

average speed and to make predictions within

different time windows.

R2 Train a machine learning

algorithm to predict the short-

term traffic flow and average

speed

The algorithm will use the map matched records as

input

D5.6 – v. 1.00

Page

24

R3 The traffic predictions will be

saved in Elasticsearch or in a

distributed file system (HDFS)

Traffic condition and prediction will be used for

queries, statistics, evaluation of the quality of

predictions

R4 Visualize the short-term forecast

of a road segment

The visualization can be configured in Kibana

showing the time series of past and future counts

of the traffic flow and average speed in a road

segment

R5 The pilot can be started in two

configurations: single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes. The cluster configuration must provide

cluster of any components: messaging system

(Kafka), processing modules (Flink, PostGis, R

server), storage (Elasticsearch, HDFS)

D5.6 – v. 1.00

Page

25

Figure 4: Architecture of the Third SC4 pilot

5.3 Architecture

The architecture of the third pilot will be the same as used for the second pilot.

5.4 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

Table 8: Components needed to Deploy Third SC4 Pilot

Module Task Responsible

D5.6 – v. 1.00

Page

26

Kafka12, Flink13 BDI docker containers made

available by WP4 for the ingestion,

processing and message passing

NCSR-D, SWC,

TF, FhG.

PostGIS, R server Docker containers developed for the

SC4 pilot to implement the map-

matching algorithm

CERTH, FhG

Machine learning algorithm

for traffic flow forecasting

The algorithm will be trained and

validated using the FCD data

CERTH, FhG

HDFS14, Elasticsearch15 Docker containers to store the input

FCD data and the output of the

forecasting algorithm

FhG

Time series visualization Visualize the traffic flow pattern in a

road segment and the forecast. The

visualization can be based on Kibana

FhG

12 https://github.com/big-data-europe/docker-kafka 13 https://github.com/big-data-europe/docker-flink 14 https://github.com/big-data-europe/docker-hadoop 15 https://github.com/big-data-europe/docker-elasticsearch

D5.6 – v. 1.00

Page

27

6. Third SC5 Pilot Deployment

6.1 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action, Environment, Resource

Efficiency and Raw Materials.

The pilot demonstrates the following workflow: A hazardous substance is released in the

atmosphere, where the timely and reliable estimation of the release origin and of the expected

consequences facilitates informed decision making and timely response. The pilot focuses

specifically on situations where no information is known about the release itself except from

readings at monitoring stations. Under these circumstances, decision makers need a tool that

uses measurements and atmospheric conditions to estimate the source location and the

expected dispersion of the plume. Information about the dispersion is used to link against geo-

located data about people, infrastructure, industry and other production units, and any other

information relevant to potential effects on the population and the environment.

The user accesses user interface provided by the pilot to explore the effect of the release on

people, production and environment visualizing on a map dispersion modelling results for the

different pollution sources that are consistent with the observed measurements. Given these

choices, the application shows on a map the concentration plumes of each station, enriched

with numerical information relevant to people, environment, production, etc. about the areas

affected by the plume. The user can filter these results by moving a slidebar to, e.g., filter by

population to only show metropolitan areas.

The platform initiates

● a weather matching algorithm, that is a search for similarity of the current weather and

the pre-computed weather patterns, as well as

● a dispersion matching algorithm, that is a search for similarity of the current substance

dispersion patterns with the precomputed ones.

D5.6 – v. 1.00

Page

28

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past, while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions.

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request.

The following datasets are involved:

● NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF16)

● GRIB files from National Oceanic and Atmospheric Administration (NOAA17)

● GeoNames18 geographical database.

The following processing will be carried out:

● The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 6.3).

● The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather.

● The HYSPLIT (HYbrid Single Particle Lagrangian Integrated Trajectory model)

atmospheric dispersion model computes dispersion patterns given predominant

weather conditions.

The following outputs are made available for visualization or further processing:

● The dispersions produced by HYSPLIT.

● Effect of the dispersion on people, environment, production.

6.2 Requirements

Table 9 lists the ingestion, storage, processing, and output requirements set by this pilot.

16 http://apps.ecmwf.int/datasets 17 https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs/ 18 http://www.geonames.org/

D5.6 – v. 1.00

Page

29

Table 9: Requirements of Third SC5 Pilot

Requirement Comment

R1 Provide a means of accessing

weather data compatible with

HYSPLIT.

Data connector/interface. WPS normalization

performs the necessary transformations and

variable renamings to ensure compatibility.

R2 Dispersion visualization Weather and dispersion matching must produce

output that can be stored in Strabon/visualized on

Sextant.

R3 Joining dispersion with RDF data

relevant to population,

environment, production, etc.

It must be possible to join any geo-tagged dataset

(such as population size, land usage) with the

dispersion shapes.

D5.6 – v. 1.00

Page

30

Figure 5: Architecture of the Third SC5 pilot

6.3 Architecture

To satisfy the requirements described above the following components will be deployed:

Storage infrastructure:

● HDFS for storing NetCDF and GRIB files

● Strabon for storing dispersions

● Virtuoso for storing geo-tagged RDF data

● Semagrow for joining dispersion shapes with geo-tagged RDF data

Processing components:

● Scikit-learn to host the weather clustering algorithm.

Other modules:

● ECMWF and NOAA data connectors

D5.6 – v. 1.00

Page

31

● WPS normalization procedure

● WRF downscaling component

● HYSPLIT atmospheric dispersion model

● Weather and dispersion matching

● Sextant for visualizing the dispersion layer.

6.4 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

Table 10: Components needed to Deploy Third SC5 Pilot

Module Task Responsible

HDFS19, Sextant20,

Strabon21, Virtuoso,

Semagrow22

BDI dockers made available by WP4. TF, UoA, NCSR-D

Scikit-learn, HYSPLIT,

weather clustering and

matching, dispersion

matching

Developed or packaged for the

second SC5 pilot.

NCSR-D

19 https://github.com/big-data-europe/docker-hadoop 20 https://github.com/big-data-europe/docker-sextant 21 https://github.com/big-data-europe/docker-strabon 22 https://github.com/big-data-europe/docker-semagrow

D5.6 – v. 1.00

Page

32

ECMWF and NOAA data

connector

Developed for the second SC5 pilot. NCSR-D

Data visualization UI To be developed in the pilot. NCSR-D

7. Third SC6 Pilot Deployment

7.1 Use cases

The pilot is carried out by NCSR-D, and SWC in the frame of SC6 Europe in a changing world

- inclusive, innovative and reflective societies.

The first pilot demonstrated the ingestion, homogenization and visualization of municipality

economic data. The second pilot extended the first pilot by incorporating different input formats

by developing a modular parsing library.

The third pilot extends the previous pilots by including budget data from other municipalities

and focus on the mapping of different accounting information systems.

The datasets of the pilot will also enhanced including:

● Budget execution data of Municipality of Kalamaria, Greece,

● Budget execution data of Municipality of Vienna, Austria,

● Budget execution data of Municipality of Linz, Austria.

Statistical data will be described in the RDF DataCube23 vocabulary.

23 Cf. https://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/

D5.6 – v. 1.00

Page

33

The third pilot will also investigate the possibility to use Classification of Functions of

Government (COFOG24) hierarchy in the process of manual mapping the diverse budget item

scheme from different countries.

The following processing is carried out:

● Data ingestion and homogenization

● Aggregation, analysis, correlation over scientific data

The following outputs are made available for visualization or further processing:

● Structured information extracted from budget datasets exposed as a SPARQL endpoint

● Metadata for dataset searching and discovery

● Aggregation and analysis

7.2 Requirements

Table 11 lists the ingestion, storage, processing, and output requirements set by this pilot.

Table 11: Requirements of Third SC6 Pilot

Requirement Comment

R1 Storage of the manual mappings

between accounting systems

The triple store will be used for storing

manual mappings.

R2 User interface for editing the mappings SWC’s pool party will be used for the

editing and reviewing of the mappings

R3 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed (if needed) for

the added datasets.

24 Cf. https://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=4

D5.6 – v. 1.00

Page

34

R4 Intuitive, easy-to-use, interface for

searching and selecting relevant data

sources. The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible.

The GraphSearch UI will be used to create

visualizations from SPARQL queries.

D5.6 – v. 1.00

Page

35

Figure 6: Architecture of the Third SC6 pilot

7.3 Architecture

To satisfy the requirements above, the components needed are the following:

● PoolParty for manually editing the mappings between different accounting systems.

● Virtuoso or 4store for storing the mappings,

● GraphSearch for visualizing the homogenized budget data

● Data analysis that will be performed on demand by pre-defined queries in the

dashboard.

The architecture can remain unchanged since the components needed are already provisioned

from the previous pilot cycles.

7.4 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot.

D5.6 – v. 1.00

Page

36

Table 12: Components needed to Deploy Third SC6 Pilot

Module Task Responsible

4store25 Already developed in previous pilot

cycle.

NCSR-D, SWC

Metadata extraction Parsers for different data sources will

be developed if needed for the pilot.

SWC

GraphSearch GUI To be extended if needed for the third

pilot cycle.

SWC

25 https://github.com/big-data-europe/docker-4store

D5.6 – v. 1.00

Page

37

8. Third SC7 Pilot Deployment

8.1 Use cases

The pilot is carried out by SatCen, UoA, and NCSR-D in the frame of SC7 Secure societies –

Protecting freedom and security of Europe and its citizens.

The previous pilots demonstrated the event detection from social media and the change

detection from ESA Sentinels satellite images. Both detections work in a parallel fashion using

Big Data technologies.

The third cycle focused on improving the detection and enhancing the presentation of the

detected events. The following inputs will be used:

● Sentinel 2 satellite images that will add the possibility to visualize RGB images and will

also constitute a support to the ground truth identification of changes from the Sentinel

1 processing workflow.

● Flickr geotagged images that will be presented to the user along with a detected event.

8.2 Requirements

Table 13 lists the ingestion, storage, processing, and output requirements set by the second

cycle of the pilot. Since the present pilot cycle is an extension of the first pilot, the requirements

of the first pilot also apply. Table 13 lists only the new requirements.

Table 13: Requirements of Third SC7 Pilot

Requirement Comment

R1 Ingestion and storage of Flickr

geotagged images.

The NOMAD data connectors will be

adapted in order to access API of Flickr

D5.6 – v. 1.00

Page

38

and store to Cassandra.

R2 Ingestion of Sentinel 2 satellite images The ingestion procedure for Sentinel 1 data

will be adapted if needed to also support

Sentinel 2 data.

R3 Flickr images and Sentinel 2 images

must be visualized in the user interface

The user interface based on Sextant will be

extended to support the new information.

Figure 7: Architecture of the Third SC7 pilot

D5.6 – v. 1.00

Page

39

8.3 Architecture

To satisfy the requirements above, the components needed are the following:

● HDFS for storing Sentinel 2 images

● Cassandra for storing Flickr metadata

● Sextant for displaying the additional information

The architecture can remain unchanged since the components needed are already provisioned

from the previous pilot cycles.

8.4 Deployment

Table 14 lists the components provided to the pilot as part of BDI26 and components that will

be developed within WP6 in the context of executing the pilot.

Table 14: Components needed to Deploy Third SC7 Pilot

Module Task Responsible

Big Data Integrator:

HDFS/Hadoop27,

Cassandra28, Spark29,

Semagrow30, Strabon31,

SOLR32

Already developed in previous pilot

cycle.

FH, TF, InfAI,

NCSR-D, UoA,

SwC

26 Cf. https://github.com/big-data-europe/README/wiki/Components 27 https://github.com/big-data-europe/docker-hadoop 28 https://github.com/big-data-europe/docker-cassandra 29 https://github.com/big-data-europe/docker-spark 30 https://github.com/big-data-europe/docker-semagrow 31 https://github.com/big-data-europe/docker-strabon 32 https://github.com/big-data-europe/docker-solr

D5.6 – v. 1.00

Page

40

Cassandra and Strabon

stores

Already developed in previous pilot

cycle.

NCSR-D and

UoA

Change detection module Already developed in previous pilot

cycle.

UoA

Event Detection module Already developed in previous pilot

cycle.

NCSR-D

Sentinel 2 data connector To be extended to access Sentinel 2

data

UoA

Flickr data connector To be extended to access the Flickr

API.

NCSR-D

User interface To be enhanced for the pilot. UoA

D5.6 – v. 1.00

Page

41

9. Conclusions

This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the third piloting round.

The relevant work in this task is to ensure that the components are within the scope of what is

prepared in WP4 and that they interoperate and can be used in the same application.

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure and

provided to the piloting partners as a basis for their piloting applications, which will be

developed in WP6. The lessons learned from the pilot deployments will be reported in the

relevant WP6 evaluation deliverable (D6.8). As a result of this preliminary testing and the

interaction between the technical partners and the piloting partners, some of the original pilot

descriptions have been refined and fully specified and their usage of BDI components has

been clarified. This ensures that the pilot descriptions are consistent with the second public

release of the BDI platform (D4.3) and can be reproduced by interested third parties.