support action big data europe empowering communities … · domain-specific big data integrator...
TRANSCRIPT
Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020)
Support Action
Big Data Europe – Empowering Communities with
Data Technologies
Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months
Deliverable 5.6
Domain-Specific Big Data Integrator Instances III
Dissemination Level Public
Due Date of Deliverable Month 32, 01/09/2017
Actual Submission Date 27/09/2017
Work Package WP5, Big Data Integrator Instances
Task T5.2
Type Other
Approval Status Approved
Version 1.00
Number of Pages 41
Filename D5.6_Domain-Specific Big Data Integrator
Instances III
Abstract: Documentation of the Big Data Integrator Instances deployed for executing
the pilots of the Big Data Integrator across all seven societal challenges.
The information in this document reflects only the author’s views and the European Community is not liable for any use
that may be made of the information contained therein. The information in this document is provided “as is” without
guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a
particular purpose. The user thereof uses the information at his/ her sole risk and liability.
Ref. Ares(2017)4715575 - 27/09/2017
D5.6 – v. 1.00
Page
2
History
Version Date Reason Revised by
0.00 21/01/2016 Document structure S. Konstantopoulos
0.01 25/8/2017 First draft based on descriptions of
third piloting cycle.
A. Charalambidis,
S. Konstantopoulos
0.02 30/1/2017 Pilot corrections on SC5 T. Davvetas,
I. Mouchakis
0.03 11/9/2017 SC4 pilot description L. Selmi
0.08 15/9/2017 Peer review A. Versteden and H.
Jabeen
0.09 25/9/2017 Address peer review comments A. Charalambidis,
S. Konstantopoulos
1.00 27/9/2017 Final version
D5.6 – v. 1.00
Page
3
Author List
Organisation Name Contact Information
NCSR-D Ioannis Mouchakis [email protected]
NCSR-D Stasinos Konstantopoulos [email protected]
NCSR-D Angelos Charalambidis [email protected]
NCSR-D Georgios Stavrinos [email protected]
NCSR-D Vangelis Karkaletsis [email protected]
NCSR-D Thanasis Davvetas [email protected]
TenForce Aad Versteden [email protected]
UoB Hajira Jabeen [email protected]
SWC Jürgen Jakobitsch [email protected]
FhG Luigi Selmi [email protected]
D5.6 – v. 1.00
Page
4
Executive Summary
This report documents the instantiations of the Big Data Integrator Platform that underlies the
pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon
2020 Societal Challenges. These platform instances will be provided to the relevant networking
partners to be used for executing the pilot sessions foreseen in WP6.
For each of the seven pilots, this document provides (a) a brief summary of the pilot description
prepared in WP6, and especially of the use cases provided in the pilot descriptions; (b) the
technical requirements for carrying out these use cases; (c) an architecture that shows the BDI
components required to cover these requirements; and (d) the list of components in the
architecture and their status (available as part of BDI, or otherwise available, or to be
developed as part of the pilot).
D5.6 – v. 1.00
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe.
The components that are made available to the pilots by BDI are listed
here: https://github.com/big-data-europe/README/wiki/Components
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1: Health, Demographic Change and Wellbeing
SC2 Societal Challenge 2: Food Security, Sustainable Agriculture and Forestry,
Marine, Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3: Secure, Clean and Efficient Energy
SC4 Societal Challenge 4: Smart, Green and Integrated Transport
SC5 Societal Challenge 5: Climate Action, Environment, Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6: Europe in a changing world – Inclusive, innovative
and reflective societies
SC7 Societal Challenge 7: Secure societies – Protecting freedom and security
of Europe and its citizens
AK Agroknow, Belgium
CERTH Centre for Research and Technology, Greece
CRES Center for Renewable Energy Sources and Saving, Greece
FAO Food and Agriculture Organization of the United Nations, Italy
D5.6 – v. 1.00
Page
6
FhG Fraunhofer IAIS, Germany
InfAI Institute for Applied Informatics, Germany
NCSR-D National Center for Scientific Research “Demokritos”, Greece
OPF Open PHACTS Foundation, UK
SWC Semantic Web Company, Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam, the Netherlands
D5.6 – v. 1.00
Page
7
Table of contents 1. Introduction ............................................................................................................. 9
1.1 Purpose and Scope .......................................................................................... 9
1.2 Methodology .................................................................................................... 9
2. Third SC1 Pilot Deployment ...................................................................................11
2.1 Use Cases ......................................................................................................11
2.2 Requirements ..................................................................................................11
2.3 Architecture .....................................................................................................13
2.4 Deployment .....................................................................................................14
3. Third SC2 Pilot Deployment ...................................................................................15
3.1 Overview .........................................................................................................15
3.2 Requirements ..................................................................................................15
3.3 Architecture .....................................................................................................16
3.4 Deployment .....................................................................................................16
4. Third SC3 Pilot Deployment ...................................................................................18
4.1 Overview .........................................................................................................18
4.2 Requirements ..................................................................................................19
4.3 Architecture .....................................................................................................20
4.4 Deployment .....................................................................................................21
5. Third SC4 Pilot Deployment ...................................................................................22
5.1 Use cases .......................................................................................................22
5.2 Requirements ..................................................................................................23
5.3 Architecture .....................................................................................................25
5.4 Deployment .....................................................................................................25
6. Third SC5 Pilot Deployment ...................................................................................27
6.1 Use cases .......................................................................................................27
6.2 Requirements ..................................................................................................28
6.3 Architecture .....................................................................................................30
6.4 Deployment .....................................................................................................31
D5.6 – v. 1.00
Page
8
7. Third SC6 Pilot Deployment ...................................................................................32
7.1 Use cases .......................................................................................................32
7.2 Requirements ..................................................................................................33
7.3 Architecture .....................................................................................................35
7.4 Deployment .....................................................................................................35
8. Third SC7 Pilot Deployment ...................................................................................37
8.1 Use cases .......................................................................................................37
8.2 Requirements ..................................................................................................37
8.3 Architecture .....................................................................................................39
8.4 Deployment .....................................................................................................39
9. Conclusions............................................................................................................41
D5.6 – v. 1.00
Page
9
1. Introduction
1.1 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe. These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6.
1.2 Methodology
Task 5.2 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europe’s Social challenges. To this end, T5.2
comprises seven (7) distinct sub-tasks, each one dedicated to a different domain of application.
Participating partners and their role: NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform. This task includes two phases: the design and the deployment
phase. The design phase involves the following.
● Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot.
● Prepare a first draft of the sections for the third cycle pilots, where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted.
● Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory. This draft also includes a list of components and their availability: (a) base
platform components that are prepared in WP4; (b) pilot-specific components that are
D5.6 – v. 1.00
Page
10
already available; or (c) pilot-specific components that will be developed for the pilot.
Components are also assigned a partner responsible for their implementation.
● Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements, to establish that the pilot is relevant
to the communities it is aimed at.
During deployment phase, work in this task will follow and document development of the
individual components and test their integration into the platform.
D5.6 – v. 1.00
Page
11
2. Third SC1 Pilot Deployment
2.1 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health, Demographic Change and
Wellbeing.
Previous pilots demonstrated reproducing the functionality of an existing data integration and
processing system (the Open PHACTS Discovery Platform) on BDI and ability of manually
updating some underlying datasets. The final pilot will build on the existing system and:
● Establish the effect of regular source data refreshers on data veracity, and whether
changes in the underlying data are reflected in use cases.
● Establish the effect of data model updates to the integration of the OPF datasets.
2.2 Requirements
Table 1 lists the ingestion, storage, processing, and output requirements set by this pilot.
Table 1: Requirements of the Third SC1 Pilot
Requirement Comment
R1 RDF data storage with support
for versioning
The current Open PHACTS Discovery Platform is
based on distributed Virtuoso. Different datasets
are stored in different graphs. Using a federation
of RDF stores, each holding one dataset, allows to
easily replace a dataset with its updated version,
have it checked for veracity and integration and
decide it would be rolled back or accepted.
R2 Datasets are aligned and linked In conjunction with R1, a modular data ingestion
D5.6 – v. 1.00
Page
12
at data ingestion time, and the
transformed data is stored.
component should dynamically decide which data
transformers to invoke.
D5.6 – v. 1.00
Page
13
Figure 1: Architecture of the Third SC1 pilot
2.3 Architecture
To satisfy the requirements above, the following modules will be deployed:
Storage infrastructures:
● Distributed triple stores for the live data and for the updates.
● Semagrow for establishing different federations depending on which version of each
dataset should be used.
Processing infrastructures:
● Scientific Lenses query expansion
Other modules:
● Data connector, including the data transformation modules for the alignment of data at
ingestion time
● REST API for querying, that builds a SPARQL query by using keywords to fill in pre-
defined query templates. The querying services also uses Scientific Lenses to expand
queries.
D5.6 – v. 1.00
Page
14
2.4 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
Table 2: Components needed to Deploy Third SC1 Pilot
Module Task Responsible
4store1 BDI dockers made available by WP4. NCSR-D
Semagrow2 BDI dockers made available by WP4. NCSR-D
Data connector and
transformation modules
Already developed in previous pilot
cycle.
VU
Query endpoint Already developed in previous pilot
cycle.
VU
Scientific Lenses query
expansion module
Already developed in previous pilot
cycle.
VU
1 https://github.com/big-data-europe/docker-4store 2 https://github.com/big-data-europe/docker-semagrow
D5.6 – v. 1.00
Page
15
3. Third SC2 Pilot Deployment
3.1 Overview
The pilot is carried out by AK, FAO, and SWC in the frame of SC2 Food Security, Sustainable
Agriculture and Forestry, Marine, Maritime and Inland Water Research and the Bioeconomy.
The second pilot cycle builds upon the first pilot cycle (cf. D5.1, Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture. The second pilot demonstrated text
mining, data processing, phenologic modelling and variety identification workflows.
The third pilot will include:
● An intuitive user interface environment to enable the browsing of the linked information.
3.2 Requirements
Table 3 lists the ingestion, storage, processing, and output requirements set by this pilot.
Table 3: Requirements of Third SC2 Pilot
Requirement Comment
R1 Intuitive, easy-to-use, interface for
browsing and selecting linked
information.
To be developed and documented for the
pilot.
D5.6 – v. 1.00
Page
16
Figure 2: Architecture of the Third SC2 pilot
3.3 Architecture
To satisfy the requirements above, the following modules should be modified with respect to
the second piloting prototype:
Data presentation modules:
● Intuitive user interface that queries the Graph database module.
3.4 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
Table 4: Components needed to Deploy Third SC2 Pilot
D5.6 – v. 1.00
Page
17
Module Task Responsible
Spark over HDFS3, Flume4,
Kafka5
BDI dockers made available by WP4. FH, TF, InfAI,
SWC
GraphDB GraphDB official docker package will
be used for the pilot.
SWC
Flume agents for publication
ingestion and processing
Already developed in previous pilot
cycle.
SWC
Flume agents for data
ingestion
Already developed in previous pilot
cycle.
SWC, AK
Data storage schema Already developed in previous pilot
cycle.
SWC, AK
Phenolic modelling Already developed in previous pilot
cycle.
AK
Spark AKSTEM Already developed in previous pilot
cycle.
AK
Variety Identification Already developed in previous pilot
cycle.
AK
User interface To be developed for the pilot. AK
3 https://github.com/big-data-europe/docker-spark 4 https://github.com/big-data-europe/docker-flume 5 https://github.com/big-data-europe/docker-kafka
D5.6 – v. 1.00
Page
18
4. Third SC3 Pilot Deployment
4.1 Overview
The pilot is carried out by CRES in the frame of SC3 Secure, Clean and Efficient Energy.
The third pilot cycle demonstrates that the Big Data tools can support developers and system
operators for the exploitation of the available coarse weather forecast in order to achieve
localized (point) energy production forecasts.
The following datasets are involved:
● Renewable energy supply (RES) units data that contain information about the
identification and geolocation of the units. RES data can be accessed using a web
service6 from the regulator site.
● Weather Research and Forecasting (WRF) points. The source of the WRF points will
be defined during the pilot implementation. The WRF points are files in GRIB format.
● Terrain data obtained from NASA7.
The following processing is carried out:
● Orchestration and data management of background actions such as CFD simulations,
correlation optimization, inclusion of on line observations.
The following outputs are made available for visualization or further processing:
● Computational Fluid Dynamics (CFD) simulations produced by OpenFOAM8
● Power production forecasts
6http://www.rae.gr/geoserver/wfs?service=wfs&version=2.0.0&request=GetFeature&typeN
ames=rae_status:V_SDI_R_AIOLIKA13&propertyName=ID1,AA,A_M,POWER_MW,PARATIRISEIS,G
EOMETRY 7 https://asterweb.jpl.nasa.gov/gdem.asp 8 http://www.openfoam.com/
D5.6 – v. 1.00
Page
19
4.2 Requirements
Table 5 lists the ingestion, storage, processing, and output requirements set by this pilot. Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5.
Table 5: Requirements of Third SC3 Pilot
Requirement Comment
R1 Download RES datasets, WRF
points and terrain data+
Data connectors need to be developed.
R2 Orchestrate the background
CFD simulation
D5.6 – v. 1.00
Page
20
Figure 3: Architecture of the Third SC3 pilot
4.3 Architecture
To satisfy the requirements above, the following modules will be deployed:
Storage infrastructures:
● HDFS
● PostgreSQL
Processing infrastructures:
● Spark
Other modules:
● A data connector that can ingest WRF points.
D5.6 – v. 1.00
Page
21
● A data connector that can access and ingest RES data from regulator.
● Data visualization that can visualize the production forecasts.
4.4 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
Table 6: Components needed to Third Second SC3 Pilot
Module Task Responsible
Spark9, HDFS10, Postgres11 BDI dockers made available by WP4. FH, TF, InfAI,
NCSR-D
RES and Terrain data
connector
To be developed for the pilot. CRES
WRF Data connector To be developed for the pilot. CRES
Data visualization To be extended for the pilot. CRES
9 https://github.com/big-data-europe/docker-spark 10 https://github.com/big-data-europe/docker-hadoop 11 https://github.com/big-data-europe/docker-postgres
D5.6 – v. 1.00
Page
22
5. Third SC4 Pilot Deployment
5.1 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart, Green and Integrated
Transport.
The pilot demonstrates how to implement the workflow for the ingestion, processing and
storage of near real-time and historical floating car data (FCD) in a distributed setting. The pilot
demonstrates the following workflows:
● The monitoring of the current traffic conditions in the city of Thessaloniki using the near
real-time floating car data available for 1200 vehicles.
● The forecasting of future traffic conditions based on a model trained from the historical
and the near real-time FCD data.
The data from each vehicle contains the vehicle’s identifier, valid only 24 hours, speed,
direction, orientation, latitude and longitude. In order to be used as a proxy for the traffic flow
in each road segment the location of the vehicle must be matched with the road segment in
which the vehicle is being driven. The near real-time Floating Car Data (FCD) stream is
generated by the taxi fleet and is provided via web services. The historical data is provided via
FTP and contains 40 GB of data. The road network is extracted from the OpenStreetMap
database. The map matching can be performed in parallel and is based on R scripts and stored
procedures that make queries on a geographical database using topological rules. The
matched records are aggregated in time windows in order to compute the average flow
(number of vehicles) and speed within each time window. The result of the aggregation are
records containing the identifier of a road segment, the traffic flow (number of vehicles in the
time window), the average speed and the timestamp. The result data is stored in a distributed
database.
The objective of the second pilot was on one hand to implement a traffic forecasting service
and on the other hand to deploy the pilot in a distributed setting using Docker swarm. The
D5.6 – v. 1.00
Page
23
forecasting service was based on a neural network to train a model for each road segment in
order to learn the pattern of the traffic flow.
The objective of the third pilot is to improve the prediction algorithm in order to better represent
the time correlation of successive counts of the vehicles and the spatial correlation between
the road segments since the number of vehicles in a road segment depends on the number of
vehicles in road segments that are connected. One more task for the 3rd cycle will be to
parallelize the training and validation of the models. In order to test the deployment of docker
containers and run them using a docker compose file we have developed a simplified version
of the pilot that uses a subset of the containers and HDFS for the storage of the FCD data and
the result of the processing. The simplified version of the pilot matches the vehicles to a cell
within a grid that covers the area of Thessaloniki and it doesn’t require the R scripts nor
PostGis. One last work will be the visualization of the forecasts.
5.2 Requirements
Table 7 lists the ingestion, storage, processing, and output requirements set by this pilot.
Table 7: Requirements of Third SC4 Pilot
Requirement Comment
R1 The pilot will processes the raw
FCD to map match the records
within temporal windows
The FCD map matched data are used to count the
number of vehicles in each road segment and the
average speed and to make predictions within
different time windows.
R2 Train a machine learning
algorithm to predict the short-
term traffic flow and average
speed
The algorithm will use the map matched records as
input
D5.6 – v. 1.00
Page
24
R3 The traffic predictions will be
saved in Elasticsearch or in a
distributed file system (HDFS)
Traffic condition and prediction will be used for
queries, statistics, evaluation of the quality of
predictions
R4 Visualize the short-term forecast
of a road segment
The visualization can be configured in Kibana
showing the time series of past and future counts
of the traffic flow and average speed in a road
segment
R5 The pilot can be started in two
configurations: single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes. The cluster configuration must provide
cluster of any components: messaging system
(Kafka), processing modules (Flink, PostGis, R
server), storage (Elasticsearch, HDFS)
D5.6 – v. 1.00
Page
25
Figure 4: Architecture of the Third SC4 pilot
5.3 Architecture
The architecture of the third pilot will be the same as used for the second pilot.
5.4 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
Table 8: Components needed to Deploy Third SC4 Pilot
Module Task Responsible
D5.6 – v. 1.00
Page
26
Kafka12, Flink13 BDI docker containers made
available by WP4 for the ingestion,
processing and message passing
NCSR-D, SWC,
TF, FhG.
PostGIS, R server Docker containers developed for the
SC4 pilot to implement the map-
matching algorithm
CERTH, FhG
Machine learning algorithm
for traffic flow forecasting
The algorithm will be trained and
validated using the FCD data
CERTH, FhG
HDFS14, Elasticsearch15 Docker containers to store the input
FCD data and the output of the
forecasting algorithm
FhG
Time series visualization Visualize the traffic flow pattern in a
road segment and the forecast. The
visualization can be based on Kibana
FhG
12 https://github.com/big-data-europe/docker-kafka 13 https://github.com/big-data-europe/docker-flink 14 https://github.com/big-data-europe/docker-hadoop 15 https://github.com/big-data-europe/docker-elasticsearch
D5.6 – v. 1.00
Page
27
6. Third SC5 Pilot Deployment
6.1 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action, Environment, Resource
Efficiency and Raw Materials.
The pilot demonstrates the following workflow: A hazardous substance is released in the
atmosphere, where the timely and reliable estimation of the release origin and of the expected
consequences facilitates informed decision making and timely response. The pilot focuses
specifically on situations where no information is known about the release itself except from
readings at monitoring stations. Under these circumstances, decision makers need a tool that
uses measurements and atmospheric conditions to estimate the source location and the
expected dispersion of the plume. Information about the dispersion is used to link against geo-
located data about people, infrastructure, industry and other production units, and any other
information relevant to potential effects on the population and the environment.
The user accesses user interface provided by the pilot to explore the effect of the release on
people, production and environment visualizing on a map dispersion modelling results for the
different pollution sources that are consistent with the observed measurements. Given these
choices, the application shows on a map the concentration plumes of each station, enriched
with numerical information relevant to people, environment, production, etc. about the areas
affected by the plume. The user can filter these results by moving a slidebar to, e.g., filter by
population to only show metropolitan areas.
The platform initiates
● a weather matching algorithm, that is a search for similarity of the current weather and
the pre-computed weather patterns, as well as
● a dispersion matching algorithm, that is a search for similarity of the current substance
dispersion patterns with the precomputed ones.
D5.6 – v. 1.00
Page
28
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past, while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions.
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request.
The following datasets are involved:
● NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF16)
● GRIB files from National Oceanic and Atmospheric Administration (NOAA17)
● GeoNames18 geographical database.
The following processing will be carried out:
● The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 6.3).
● The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather.
● The HYSPLIT (HYbrid Single Particle Lagrangian Integrated Trajectory model)
atmospheric dispersion model computes dispersion patterns given predominant
weather conditions.
The following outputs are made available for visualization or further processing:
● The dispersions produced by HYSPLIT.
● Effect of the dispersion on people, environment, production.
6.2 Requirements
Table 9 lists the ingestion, storage, processing, and output requirements set by this pilot.
16 http://apps.ecmwf.int/datasets 17 https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs/ 18 http://www.geonames.org/
D5.6 – v. 1.00
Page
29
Table 9: Requirements of Third SC5 Pilot
Requirement Comment
R1 Provide a means of accessing
weather data compatible with
HYSPLIT.
Data connector/interface. WPS normalization
performs the necessary transformations and
variable renamings to ensure compatibility.
R2 Dispersion visualization Weather and dispersion matching must produce
output that can be stored in Strabon/visualized on
Sextant.
R3 Joining dispersion with RDF data
relevant to population,
environment, production, etc.
It must be possible to join any geo-tagged dataset
(such as population size, land usage) with the
dispersion shapes.
D5.6 – v. 1.00
Page
30
Figure 5: Architecture of the Third SC5 pilot
6.3 Architecture
To satisfy the requirements described above the following components will be deployed:
Storage infrastructure:
● HDFS for storing NetCDF and GRIB files
● Strabon for storing dispersions
● Virtuoso for storing geo-tagged RDF data
● Semagrow for joining dispersion shapes with geo-tagged RDF data
Processing components:
● Scikit-learn to host the weather clustering algorithm.
Other modules:
● ECMWF and NOAA data connectors
D5.6 – v. 1.00
Page
31
● WPS normalization procedure
● WRF downscaling component
● HYSPLIT atmospheric dispersion model
● Weather and dispersion matching
● Sextant for visualizing the dispersion layer.
6.4 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
Table 10: Components needed to Deploy Third SC5 Pilot
Module Task Responsible
HDFS19, Sextant20,
Strabon21, Virtuoso,
Semagrow22
BDI dockers made available by WP4. TF, UoA, NCSR-D
Scikit-learn, HYSPLIT,
weather clustering and
matching, dispersion
matching
Developed or packaged for the
second SC5 pilot.
NCSR-D
19 https://github.com/big-data-europe/docker-hadoop 20 https://github.com/big-data-europe/docker-sextant 21 https://github.com/big-data-europe/docker-strabon 22 https://github.com/big-data-europe/docker-semagrow
D5.6 – v. 1.00
Page
32
ECMWF and NOAA data
connector
Developed for the second SC5 pilot. NCSR-D
Data visualization UI To be developed in the pilot. NCSR-D
7. Third SC6 Pilot Deployment
7.1 Use cases
The pilot is carried out by NCSR-D, and SWC in the frame of SC6 Europe in a changing world
- inclusive, innovative and reflective societies.
The first pilot demonstrated the ingestion, homogenization and visualization of municipality
economic data. The second pilot extended the first pilot by incorporating different input formats
by developing a modular parsing library.
The third pilot extends the previous pilots by including budget data from other municipalities
and focus on the mapping of different accounting information systems.
The datasets of the pilot will also enhanced including:
● Budget execution data of Municipality of Kalamaria, Greece,
● Budget execution data of Municipality of Vienna, Austria,
● Budget execution data of Municipality of Linz, Austria.
Statistical data will be described in the RDF DataCube23 vocabulary.
23 Cf. https://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/
D5.6 – v. 1.00
Page
33
The third pilot will also investigate the possibility to use Classification of Functions of
Government (COFOG24) hierarchy in the process of manual mapping the diverse budget item
scheme from different countries.
The following processing is carried out:
● Data ingestion and homogenization
● Aggregation, analysis, correlation over scientific data
The following outputs are made available for visualization or further processing:
● Structured information extracted from budget datasets exposed as a SPARQL endpoint
● Metadata for dataset searching and discovery
● Aggregation and analysis
7.2 Requirements
Table 11 lists the ingestion, storage, processing, and output requirements set by this pilot.
Table 11: Requirements of Third SC6 Pilot
Requirement Comment
R1 Storage of the manual mappings
between accounting systems
The triple store will be used for storing
manual mappings.
R2 User interface for editing the mappings SWC’s pool party will be used for the
editing and reviewing of the mappings
R3 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed (if needed) for
the added datasets.
24 Cf. https://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=4
D5.6 – v. 1.00
Page
34
R4 Intuitive, easy-to-use, interface for
searching and selecting relevant data
sources. The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible.
The GraphSearch UI will be used to create
visualizations from SPARQL queries.
D5.6 – v. 1.00
Page
35
Figure 6: Architecture of the Third SC6 pilot
7.3 Architecture
To satisfy the requirements above, the components needed are the following:
● PoolParty for manually editing the mappings between different accounting systems.
● Virtuoso or 4store for storing the mappings,
● GraphSearch for visualizing the homogenized budget data
● Data analysis that will be performed on demand by pre-defined queries in the
dashboard.
The architecture can remain unchanged since the components needed are already provisioned
from the previous pilot cycles.
7.4 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot.
D5.6 – v. 1.00
Page
36
Table 12: Components needed to Deploy Third SC6 Pilot
Module Task Responsible
4store25 Already developed in previous pilot
cycle.
NCSR-D, SWC
Metadata extraction Parsers for different data sources will
be developed if needed for the pilot.
SWC
GraphSearch GUI To be extended if needed for the third
pilot cycle.
SWC
25 https://github.com/big-data-europe/docker-4store
D5.6 – v. 1.00
Page
37
8. Third SC7 Pilot Deployment
8.1 Use cases
The pilot is carried out by SatCen, UoA, and NCSR-D in the frame of SC7 Secure societies –
Protecting freedom and security of Europe and its citizens.
The previous pilots demonstrated the event detection from social media and the change
detection from ESA Sentinels satellite images. Both detections work in a parallel fashion using
Big Data technologies.
The third cycle focused on improving the detection and enhancing the presentation of the
detected events. The following inputs will be used:
● Sentinel 2 satellite images that will add the possibility to visualize RGB images and will
also constitute a support to the ground truth identification of changes from the Sentinel
1 processing workflow.
● Flickr geotagged images that will be presented to the user along with a detected event.
8.2 Requirements
Table 13 lists the ingestion, storage, processing, and output requirements set by the second
cycle of the pilot. Since the present pilot cycle is an extension of the first pilot, the requirements
of the first pilot also apply. Table 13 lists only the new requirements.
Table 13: Requirements of Third SC7 Pilot
Requirement Comment
R1 Ingestion and storage of Flickr
geotagged images.
The NOMAD data connectors will be
adapted in order to access API of Flickr
D5.6 – v. 1.00
Page
38
and store to Cassandra.
R2 Ingestion of Sentinel 2 satellite images The ingestion procedure for Sentinel 1 data
will be adapted if needed to also support
Sentinel 2 data.
R3 Flickr images and Sentinel 2 images
must be visualized in the user interface
The user interface based on Sextant will be
extended to support the new information.
Figure 7: Architecture of the Third SC7 pilot
D5.6 – v. 1.00
Page
39
8.3 Architecture
To satisfy the requirements above, the components needed are the following:
● HDFS for storing Sentinel 2 images
● Cassandra for storing Flickr metadata
● Sextant for displaying the additional information
The architecture can remain unchanged since the components needed are already provisioned
from the previous pilot cycles.
8.4 Deployment
Table 14 lists the components provided to the pilot as part of BDI26 and components that will
be developed within WP6 in the context of executing the pilot.
Table 14: Components needed to Deploy Third SC7 Pilot
Module Task Responsible
Big Data Integrator:
HDFS/Hadoop27,
Cassandra28, Spark29,
Semagrow30, Strabon31,
SOLR32
Already developed in previous pilot
cycle.
FH, TF, InfAI,
NCSR-D, UoA,
SwC
26 Cf. https://github.com/big-data-europe/README/wiki/Components 27 https://github.com/big-data-europe/docker-hadoop 28 https://github.com/big-data-europe/docker-cassandra 29 https://github.com/big-data-europe/docker-spark 30 https://github.com/big-data-europe/docker-semagrow 31 https://github.com/big-data-europe/docker-strabon 32 https://github.com/big-data-europe/docker-solr
D5.6 – v. 1.00
Page
40
Cassandra and Strabon
stores
Already developed in previous pilot
cycle.
NCSR-D and
UoA
Change detection module Already developed in previous pilot
cycle.
UoA
Event Detection module Already developed in previous pilot
cycle.
NCSR-D
Sentinel 2 data connector To be extended to access Sentinel 2
data
UoA
Flickr data connector To be extended to access the Flickr
API.
NCSR-D
User interface To be enhanced for the pilot. UoA
D5.6 – v. 1.00
Page
41
9. Conclusions
This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the third piloting round.
The relevant work in this task is to ensure that the components are within the scope of what is
prepared in WP4 and that they interoperate and can be used in the same application.
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure and
provided to the piloting partners as a basis for their piloting applications, which will be
developed in WP6. The lessons learned from the pilot deployments will be reported in the
relevant WP6 evaluation deliverable (D6.8). As a result of this preliminary testing and the
interaction between the technical partners and the piloting partners, some of the original pilot
descriptions have been refined and fully specified and their usage of BDI components has
been clarified. This ensures that the pilot descriptions are consistent with the second public
release of the BDI platform (D4.3) and can be reproduced by interested third parties.