d2.2 euxdat e-infrastructure definition · 2018-11-12 · also, the consortium experience will...

53
This document is issued within the frame and for the purpose of the EUXDAT project. This project has received funding from the European Union’s Horizon2020 Framework Programme under Grant Agreement No. 777549. The opinions expressed and arguments employed herein do not necessarily reflect the official views of the European Commission. This document and its content are the property of the EUXDAT Consortium. All rights relevant to this document are determined by the applicable laws. Access to this document does not grant any right or license on the document or its contents. This document or its contents are not to be used or treated in any manner inconsistent with the rights or interests of the EUXDAT Consortium or the Partners detriment and are not to be disclosed externally without prior written consent from the EUXDAT Partners. Each EUXDAT Partner may use this document in conformity with the EUXDAT Consortium Grant Agreement provisions. (*) Dissemination level.-PU: Public, fully open, e.g. web; CO: Confidential, restricted under conditions set out in Model Grant Agreement; CI: Classified, Int = Internal Working Document, information as referred to in Commission Decision 2001/844/EC. D2.2 EUXDAT e-Infrastructure Definition Keywords: Data Analytics, Big Data, e-Infrastructure, Architecture, Design, EUXDAT Document Identification Status Final Due Date 30/04/2018 Version 1.0 Submission Date 31/05/2018 Related WP WP2 Document Reference D2.2 Related Deliverable(s) D2.1 Dissemination Level (*) PU Lead Participant F. Javier Nieto (ATOSES) Lead Author F. Javier Nieto (ATOSES) Contributors ATOSES, WRLS, USTUTT, ATOSFR, CERTH Reviewers Marcela Doubkova (PESSL) Karl G. Gutbrod (Meteoblue)

Upload: others

Post on 28-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

This document is issued within the frame and for the purpose of the EUXDAT project. This project has received funding from the European Union’s Horizon2020 Framework Programme under Grant Agreement No. 777549. The opinions expressed and arguments employed herein do not necessarily reflect the official views of the European Commission.

This document and its content are the property of the EUXDAT Consortium. All rights relevant to this document are determined by the applicable laws. Access to this document does not grant any right or license on the document or its contents. This document or its contents are not to be used or treated in any manner inconsistent with the rights or interests of the EUXDAT Consortium or the Partners detriment and are not to be disclosed externally without prior written consent from the EUXDAT Partners.

Each EUXDAT Partner may use this document in conformity with the EUXDAT Consortium Grant Agreement provisions.

(*) Dissemination level.-PU: Public, fully open, e.g. web; CO: Confidential, restricted under conditions set out in Model Grant Agreement; CI: Classified, Int = Internal Working Document, information as referred to in Commission Decision 2001/844/EC.

D2.2 EUXDAT e-Infrastructure Definition

Keywords:

Data Analytics, Big Data, e-Infrastructure, Architecture, Design, EUXDAT

Document Identification

Status Final Due Date 30/04/2018

Version 1.0 Submission Date 31/05/2018

Related WP WP2 Document Reference D2.2

Related Deliverable(s)

D2.1 Dissemination Level (*) PU

Lead Participant F. Javier Nieto (ATOSES) Lead Author F. Javier Nieto (ATOSES)

Contributors ATOSES, WRLS, USTUTT, ATOSFR, CERTH

Reviewers Marcela Doubkova (PESSL)

Karl G. Gutbrod (Meteoblue)

Page 2: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 2 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Document Information

List of Contributors

Name Partner

F. Javier Nieto ATOSES

Spiros Michalakopoulos ATOSES

Miguel Ángel Esbrí ATOSES

Karel Jedlicka WRLS

Nico Struckmann USTUTT

Fabien Castel ATOSFR

Dimitrios Moshou CERTH

Document History

Version Date Change editors Changes

0.1 01/03/2018 F. J. Nieto (ATOSES) Table of Contents

0.2 26/03/2018 F. J. Nieto (ATOSES) ToC update and section 2

0.3 09/04/2018 F. J. Nieto (ATOSES) Initial contributions to section 4 (architecture and components description)

0.4 13/04/2018 F. J. Nieto (ATOSES), M. A. Esbrí (ATOSES), K. Jedlicka (WRLS), F. Castel (ATOSFR), D. Moshou (CERTH)

Contributions to section 3

0.5 25/04/2018 F. J. Nieto (ATOSES), N. Struckmann (USTUTT)

Add section 6 contributions. Update content in section 3 (several features). Update architecture in section 4.

0.6 14/05/2018 F. J. Nieto (ATOSES) Update section 4 (architecture and initial sequence diagram) and section 5 (add several components description)

0.7 16/05/2018 F. J. Nieto (ATOSES), S. Michalakopoulos (ATOSES), M. A. Esbrí (ATOSES), K. Jedlicka (WRLS), F.

Complete section 3 (requirements table) and add content to section 5 (UML diagrams). Add conclusions. Update section 2 and section 6 (components deployment).

Page 3: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 3 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Castel (ATOSFR), D. Moshou (CERTH)

0.8 25/05/2018 F. J. Nieto (ATOSES), F. Castel (ATOSFR)

Updated sequence diagrams (section 4), list of users and completed section 5. Added executive summary.

0.9 31/05/2018 S. Palomares (ATOSES)

Quality review

1.0 31/05/2018 FINAL VERSION TO BE SUBMITTED

Quality Control Role Who (Partner short name) Approval Date

Deliverable leader F. Javier Nieto (ATOSES) 31/05/2018

Technical manager F.Castel (ATOSFR) 31/05/2018

Quality manager S. Palomares (ATOSES) 31/05/2018

Project Manager F. Javier Nieto (ATOSES) 31/05/2018

Page 4: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 4 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Table of Contents Document Information ............................................................................................................................ 2

Table of Contents .................................................................................................................................... 4

List of Tables ........................................................................................................................................... 6

List of Figures ......................................................................................................................................... 7

List of Acronyms ..................................................................................................................................... 8

1. Executive Summary ....................................................................................................................... 11

2. Introduction .................................................................................................................................... 12

2.1 Relation to other project work ................................................................................................ 12

2.2 Structure of the document ...................................................................................................... 13

3. EUXDAT Features ......................................................................................................................... 14

3.1 Initial Requirements Analysis ................................................................................................ 14

3.2 Main EUXDAT Features ....................................................................................................... 14

Support for Several Data Formats ..................................................................................... 15 3.2.1

Algorithms and Applications Management ....................................................................... 18 3.2.2

Data Management and Processing ..................................................................................... 20 3.2.3

Security and Users Management ....................................................................................... 21 3.2.4

Visualization and Interaction Capabilities ......................................................................... 22 3.2.5

Management of Computing and Storage Resources .......................................................... 23 3.2.6

Extreme Data Analytics as a Service ................................................................................. 24 3.2.7

3.3 Mapping Requirements and Features ..................................................................................... 25

4. EUXDAT Architecture .................................................................................................................. 31

4.1 High Level Architecture ......................................................................................................... 31

4.2 Main Actors ............................................................................................................................ 32

4.3 Main Components .................................................................................................................. 32

4.4 High Level Interactions .......................................................................................................... 33

Moving large data in EUXDAT ......................................................................................... 33 4.4.1

Defining a new data analysis in EUXDAT ........................................................................ 34 4.4.2

Running a data analysis in EUXDAT ................................................................................ 35 4.4.3

Page 5: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 5 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

4.5 Development Priorities and Roadmap .................................................................................... 37

5. Detailed Design of Main Components ........................................................................................... 39

5.1 EUXDAT Portal ..................................................................................................................... 39

5.2 Identity and Authorization Manager ...................................................................................... 40

5.3 Data and Algorithms Catalogue ............................................................................................. 41

5.4 Data and Algorithms Repository ............................................................................................ 42

5.5 Data Manager ......................................................................................................................... 43

5.6 SLA Manager ......................................................................................................................... 43

5.7 Orchestrator ............................................................................................................................ 44

5.8 Monitoring ............................................................................................................................. 46

6. EUXDAT Deployment................................................................................................................... 48

6.1 Deployment Infrastructure ..................................................................................................... 48

Deployment ........................................................................................................................ 48 6.1.1

Stages ................................................................................................................................. 49 6.1.2

Services .............................................................................................................................. 49 6.1.3

6.2 Components Deployment ....................................................................................................... 50

7. Conclusions .................................................................................................................................... 52

8. References ...................................................................................................................................... 53

Page 6: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 6 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

List of Tables Table 1: Requirements Traceability Matrix ____________________________________________________ 25

Page 7: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 7 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

List of Figures Figure 1: Files Management ________________________________________________________________ 21 Figure 2: Example of visualization ___________________________________________________________ 23 Figure 3: EUXDAT High Level Architecture ___________________________________________________ 31 Figure 4: Moving Large Data Sequence Diagram _______________________________________________ 33 Figure 5: Defining New Data Analysis Sequence Diagram ________________________________________ 35 Figure 6: Running Data Analysis Sequence Diagram ____________________________________________ 36 Figure 7: EUXDAT Portal High Level Architecture ______________________________________________ 39 Figure 8: I&A Manager High Level Architecture ________________________________________________ 40 Figure 9: D&A Catalogue High Level Architecture ______________________________________________ 41 Figure 10: D&A Repository High Level Architecture ____________________________________________ 42 Figure 11: Data Manager High Level Architecture ______________________________________________ 43 Figure 12: SLA Manager High Level Architecture _______________________________________________ 44 Figure 13: Orchestrator High Level Architecture _______________________________________________ 45 Figure 14: Monitoring High Level Architecture _________________________________________________ 46 Figure 15: EUXDAT Deployment ____________________________________________________________ 48 Figure 16: EUXDAT Deployment Stages ______________________________________________________ 49 Figure 17: EUXDAT Deployment Services _____________________________________________________ 50

Page 8: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 8 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

List of Acronyms

Abbreviation / acronym

Description

AMQP Advanced Message Queuing Protocol

API Application Programming Interface

ASTER Advanced Spaceborne Thermal Emission and Reflection Radiometer

AWS Amazon Web Services

CD Continuous Delivery

CEP Complex Event Processing

CI Continuous Integration

CoAP Constrained Application Protocol

C-SAR CARIS Spatial Archive

DEM Digital Elevation Model

DIAS Data and Information Access Services

Dx.y Deliverable number y belonging to WP x

EBDVF European Big Data Value Forum

EC European Commission

ECMWF European Centre for Medium-Range Weather Forecasts

EDI Electronic Data Interchange

EO Earth Observation

FTP File Transfer Protocol

GDPR General Data Protection Regulation

GRD Surfer Grid File

GUI Graphical User Interface

HPC High Performance Computing

HTTPS Hypertext Transport Protocol Secure

IoT Internet of Things

JPEG Joint Photographic Experts Group

Page 9: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 9 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Abbreviation / acronym

Description

JSON JavaScript Object Notation

JWE JSON Web Encryption

JWT JSON Web Token

KPI Key Performance Indicator

L1C Level-1C

LAI Leaf Area Index

LDAP Lightweight Directory Access Protocol

LPIS Land Parcel Identification System

MODIS Moderate Resolution Imaging Spectroradiometer

MQTT Message Queuing Telemetry Transport

MSI Mass Spectrometry Imaging

NDVI Normalized Difference Vegetation Index

OEM Object Exchange Model

OTM Open Transport Map

PDF Portable Document Format

QoS Quality of Service

Q&A Questions and Answers

REST Representational State Transfer

RGB Red, Green, Blue

RPAS Remotely Piloted Aircraft Systems

SLA Service Level Agreement

SLC SLiCe format

SOAP Simple Object Access Protocol

SSH Secure SHell

TIF Tagged Image File Format

TOSCA Topology and Orchestration Specification for Cloud Applications

UAV Unmanned Aerial Vehicle

UML Unified Modelling Language

VIIRS Visible Infrared Imaging Radiometer Suite

VM Virtual Machine

Page 10: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 10 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Abbreviation / acronym

Description

WMS Web Map Service

WP Work Package

XML eXtensible Markup Language

Page 11: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 11 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

1. Executive Summary This document provides a high level view of the EUXDAT e-Infrastructure. First of all, using the requirements collected, it describes the features that the project team has identified to be implemented in the e-Infrastructure. Taking into account such features, the document describes a high level architecture, together with the list of users that will make use of EUXDAT and the high level interactions among the identified components. In order to get more detail, the document also describes the internal design expected for each high level component, although this perspective is not very detailed yet. Finally, the document proposes a way to deploy the e-Infrastructure components, in order to have an operational version of EUXDAT.

Page 12: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 12 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

2. Introduction An e-Infrastructure such as EUXDAT has a lot of complexity in terms of its functionalities and characteristics. Therefore, it is necessary that we identify clearly the kind of functionalities to implement, the different actors that will make use of EUXDAT, the components that will be involved, how these components must interact to implement the functionalities and how EUXDAT will be deployed and made available in an operational environment. For doing so, the consortium has followed a top-down approach, starting with high level requirements in order to obtain low level designs. This document aims at analysing the different scenarios proposed and requirements identified in D2.1[1], in order to identify the main functionalities which are required (and who requires them). Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting some of the requested ones, so they will fit better with a generic usage of EUXDAT. Once the functionalities have been identified, the EUXDAT consortium has done an analysis in order to propose a high level architecture, which will articulate the implementation of the e-Infrastructure, taking into account functional areas, existing components, the original proposed architecture (at the Description of Action [2]) and expected evolution. This architecture is documented in this document, with especial attention on how high level components will collaborate. Finally, it is necessary to determine how the EUXDAT e-Infrastructure will operate and, therefore, this document includes an indication on how to deploy the e-Infrastructure.

2.1 Relation to other project work

The definition of the architecture is related to many other works done during the project. In fact, it is a centric activity, since such definition affects how things will work during the project and after it. The main relationships to take into account are the following:

• In WP2 (requirements gathering), which determines the needed functionality, and therefore constitutes the main input to the architecture definition. Moreover, gaps identified during the architecture definition may affect the way requirements are gathered in the future;

• In WP3 (detailed design and implementation of components), since the architecture will determine the main high level components belonging to this part of the implementation, and any design and implementation must be in line with the architecture;

• In WP4 (infrastructure platform), the relationship is the same as with WP3, requiring that detailed designs and implementations are in line with the architecture, which will also require alignment between architecture and available and/or feasible infrastructure;

• In WP5 (Integration and e-Infrastructure Provision to Pilots), activities related to the deployment of the e-Infrastructure are closely related with this document, since it describes

Page 13: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 13 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

how components should be deployed, so WP5 actions are expected to be in line and also provide feedback to the architecture document.

2.2 Structure of the document

The objective of this subsection is to present readers/reviewers the structure of the document, while presenting an insight before the core. This document is structured in five major chapters Chapter 3 presents the features that have been identified to be provided by the EUXDAT e-Infrastructure, based on the requirements collected in deliverable D2.1. Chapter 4 presents the high level architecture for EUXDAT, providing the first stone towards a design of the EUXDAT e-Infrastructure for implementing the identified features. Chapter 5 enters into more details in the high level architecture, providing some insights of the high level components identified. Chapter 6 addresses the deployment of the EUXDAT e-Infrastructure, presenting the kind of infrastructure needed and how the components will be deployed there. Chapter 7 presents the summary and conclusions derived from the definition of the EUXDAT architecture.

Page 14: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 14 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

3. EUXDAT Features This section explains the process followed from requirements to features and it describes the features that the consortium has identified for the EUXDAT e-Infrastructure.

3.1 Initial Requirements Analysis

The initial requirements analysis was done by a brainstorming during the kick-off meeting and subsequent teleconferences among partners related to pilots and WP2. The result of such analysis was the D2.1 deliverable [1], in which the requirements were listed and the scenarios of the pilots were defined. The process of D2.1 compilation followed an (incremental and repeated if necessary) agile approach:

• Three EUXDAT were pilots presented • Partners with agriculture domain knowledge described initial ideas related to EUXDAT Pilots. • Ideas were assessed and suitable ones elaborated into scenarios. • Above mentioned steps were described in D2.1 chapters 2 and 3. • Pilots and their related scenarios were then analysed from the perspective of informational

(data) requirements; functional and non-functional requirements (D2.1 chapter 4, section 4.1, and annex 1).

• The pilot specific requirements were then analysed and platform requirements were extracted (D2.1 chapter 4, section 4.2, and annex 2).

Once we had these requirements (categorized, prioritized and correctly described), an analysis was done about the features that should be provided (and how they should be provided) in order to fulfil the requirements received. We also took into account our own expertise in the field, thinking about features that would be necessary but were not requested by the partners involved in the pilots (perhaps because they do not expect to have certain issues solved by a concrete tool or because they just did not know that some useful features could be provided). For instance, this is the case, usually, of tools for community building (i.e. forums and Questions & Answers tools), which are important for sharing information and to enable interactions with stakeholders. The proposed features have been confirmed and validated with EUXDAT pilot owners, as a way to guarantee that what we propose is useful for them. In the future, this vision will be also validated with external stakeholders, who will be also engaged in order to retrieve more requirements and features (in the context of WP2 and WP6).

3.2 Main EUXDAT Features

This section analyses the current list of requirements, in order to extract those features that EUXDAT should provide to the scientists.

Page 15: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 15 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Support for Several Data Formats 3.2.1

The EUXDAT specific data types, formats and sources described below are key features of the data used in the project.

Structured data 3.2.1.1

Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases, spreadsheets, and data in forms of events such as sensor data. Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.; M or F).

Semi-structured data 3.2.1.2

Semi-structured data is a cross between structured and unstructured data. It is a type of structured data, but lacks the strict data model structure. With semi-structured data, tags or other types of markers are used to identify certain elements within the data, but the data doesn't have a rigid structure. For example, word processing software now can include metadata showing the author's name and the date created, with the bulk of the document just being unstructured text. Emails have the sender, recipient, date, time and other fixed fields added to the unstructured data of the email message content and any attachments. Photos or other graphics can be tagged with keywords such as the creator, date, location and keywords, making it possible to organize and locate graphics. XML and other markup languages are often used to manage semi-structured data. Semi-structured data is therefore a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi- structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi- structured data.

XML and other markup languages, email, and EDI are all forms of semi-structured data. OEM (Object Exchange Model) was created prior to XML as a means of self-describing a data structure. XML has been popularized by web services that are developed utilizing SOAP [3] principles. Some types of data described here as "semi-structured", especially XML, suffer from the impression that they are incapable of structural rigor at the same functional level as Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was referred to as "unstructured") has handicapped its use for a widening range of data-centric applications. Even documents, normally thought of as the epitome of semi- structure, can be designed with

Page 16: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 16 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

virtually the same rigor as database schema, enforced by the XML schema and processed by both commercial and custom software programs without reducing their usability by human readers.

In view of this fact, XML might be referred to as having "flexible structure" capable of human- centric flow and hierarchy as well as highly rigorous element structure and data typing. The concept of XML as "human-readable", however, can only be taken so far. Some implementations/dialects of XML, such as the XML representation of the contents of a Microsoft Word document, as implemented in Office 2007 and later versions, utilize dozens or even hundreds of different kinds of tags that reflect a particular problem domain - in Word's case, formatting at the character and paragraph and document level, definitions of styles, inclusion of citations, etc. - which are nested within each other in complex ways. Understanding even a portion of such an XML document by reading it, let alone catching errors in its structure, is impossible without a very deep prior understanding of the specific XML implementation, along with assistance by software that understands the XML schema that has been employed. Such text is not "human-understandable" any more than a book written in Swahili (which uses the Latin alphabet) would be to an American or Western European who does not know a word of that language: the tags are symbols that are meaningless to a person unfamiliar with the domain.

JSON or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON has been popularized by web services developed utilizing REST principles. There is a new breed of databases such as MongoDB and Couchbase that store data natively in JSON format, leveraging the pros of semi-structured data architecture.

Unstructured data 3.2.1.3

Unstructured data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in “field” form in databases or annotated (semantically tagged) in documents. Unstructured data can't be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis and word processing documents.

In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. Computer World states that unstructured information might account for more than 70%–80% of all data in organizations.

Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structures that exist in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address

Page 17: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 17 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analogue data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, etc…) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data".

Data Sources 3.2.1.4

3.2.1.4.1 Sensor data Within the EUXDAT pilots, several key parameters will be monitored through sensorial platforms and sensor data will be collected along the way to support the project activities. Two types of sensor data have been already identified and namely, a) IoT data from in-situ sensors and telemetric stations, b) sensors data from machinery and imagery data from unmanned aerial sensing platforms (drones).

In all the cases, this data is created in the concrete location which is going to be analysed and, therefore, it allows also some real-time analysis where it is generated, as a way to filter and pre-process information with the purpose to detect some context or to reduce the amount of information to store remotely.

We will need to take into account that these data sources may use different IoT-related protocols, such as CoAP, AMQP, MQTT, oneM2M and other proprietary interfaces.

3.2.1.4.2 Drone data A specific subset of sensor data generated and processed within EUXDAT project is images produced by cameras on-board drones or RPAS (Remotely Piloted Aircraft Systems). In particular, some EUXDAT pilots will use optical (RGB), thermal or multispectral images and 3D point-clouds acquired from RPAS. The information generated by drone-airborne cameras is usually Image Data (JPEG or JPEG2000).

The multispectral sensor acquires individual pictures from the 125 spectral channels in .RAW format, which are downloaded from the RPAS and processed into .TIF files (16 bits), which are then processed to produce a 125-bands .TIF mosaic (pixels contain Digital Numbers: 0-255).

3.2.1.4.3 Remote-Sensing/Geospatial data The EUXDAT pilots will collect earth observation (EO) data from a number of sources which will be refined during the project. Currently, it is confirmed that the following EO data will be collected and used as input data (from the Copernicus Open Access Hub at https://scihub.copernicus.eu/):

• Sentinel-1, C-SAR; SLC and GRD formats;

• Sentinel-2, MSI; L1C format.

Characteristics of such data are wide area covered, varying geographical projections and spatial resolution, semi-regular time intervals (with frequent missing of time steps or dates (partially caused by cloud coverage which may interrupt surface data collection).

Page 18: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 18 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

3.2.1.4.4 Land Use and Administrative Data In addition to EO data, EUXDAT will utilise other geospatial data from EU, national, local, private and open repositories including Land Parcel Identification System data, cadastral data, Open Land Use map [4], Urban Atlas and Corine Land Cover, Proba-V data (www.vito-eodata.be). Datasets are available as points, shapefiles, polygons, raster data and others.

3.2.1.4.5 Meteorological Model Data The meteorological data will be collected mainly from EO systems based and will be collected from European data sources such as COPERNICUS products, EUMETSAT H-SAF products, but also other EO data sources such as VIIRS and MODIS and ASTER will be considered. As complementary data sources, the weather forecast models output (ECMWF) and the regional weather services output usually based on ground weather stations can be considered according to the specific target areas of the pilots.

Spatial Resolution 3.2.1.5

Point data are available for individual positions on the Earth surface, characterised by geographic coordinates, as well as metadata, time of collection and certain variables. Weather station data are examples of point data with continuous time series of data being collected for certain variables, such as temperature, precipitation etc.

On the other hand, the polygon data are characterised by varying shapes (rectangle, ovals, circles, Polygons with multiple points and line intersections), and usually stored as vectors or shape-files.

Finally, raster data are stored in regular grids of a certain geographical projection, and density.

Algorithms and Applications Management 3.2.2

Data analytic algorithms 3.2.2.1

Proposed pilots in the D2.1 document imply configuration, deployment and execution of several data analytic algorithms. Describing in detail these algorithms is not the point of the present document, though we can mention:

• For Pilot 1 " Land Monitoring and Sustainable Management ": - Atmospheric correction of Multispectral Sentinel bands - Calculation of NDVI from the 12 Sentinel multispectral bands - Calculation of Hyperspectral indices relevant for stress and disease - Availability of Sentinel-2 and Sentinel-1m data at field scale/for a given polygon - Availability of the time-series of Sentinel-1 and Sentinel-2 data for a given

pixel/given polygon - Dynamic data analyses of EO data and in-situ data - 2D visualization of time-series over selected pixels, provision of interfaces, toolkits

• For Pilot 2 "Energy efficiency analysis":

Page 19: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 19 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

- Collecting and merging data from various source - Calculation of yield productivity zones

• For Pilot 3 "3D farming": - Calculation of zone related morphometric statistic for fields categorization into

different productivity zones - Calculation of the water influence to weather conditions

The first conclusion that can be drawn is that there are many algorithms, working on different kind of data, and performing very different activities. These algorithm will be implemented as several distinct applications that will used a wide range of tools, libraries and pre-existing pieces of code, most probably involving several languages (Python, R, Java, C…). Therefore, it is necessary that EUXDAT supports several kinds of applications written in different languages and with different ways to integrate analytics algorithms (i.e. plug-in systems). Additionally, such algorithms and applications should be exposed in such a way it will be easy for stakeholders to access their functionalities. This can be done with a tool like a marketplace, which can be used for commercializing the software while, at the same time, it facilitates the search and usage of applications.

Algorithms managed as containerized applications 3.2.2.2

The EUXDAT platform should not constrain the application developers to use any specific technical environment. Using containerization technology is the best way to achieve this goal. Encapsulating each application and all the technical artefacts it requires into a container allows them to coexist on the platform with a similar interface and clear manipulation procedures. Docker can be used as the containerization tool upon which the application management of the platform would be built. All algorithms are packaged into an image that is pushed to a private image registry. The role of the application management feature is to maintain a register from which application can be started on demand, monitored and stopped. It should allow easy retrieving of the data produced by the application it manages (results and logs). It should enable easy access to any service provided by the platform: data access, remote processing capability, utility service… Requests to the application manager could be issued by the platform users through the frontend module, or through a workflow manager. Thus it should expose a clear and open API. To fulfil all these requirements, we use a container management tool. Such tool should provide:

• Service discovery mechanisms to easily retrieve images from a central register; • A container manipulation REST API to dynamically start and stop containers; • Ways to easily configuring shared folder between containers or between containers and the

host machine so that data can be exchanged • Network configuration in the container ecosystem so that communication between them can

be easily configured and dynamically adapted Kubernetes can be used as the container orchestration tool of the EUXDAT platform. A web service layer is built on top of Kubernetes to fulfil the specific requirement of the platform.

Page 20: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 20 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Data Management and Processing 3.2.3

Remote Data 3.2.3.1

3.2.3.1.1 Data Cataloguing One of the EUXDAT platform main goals is to provide to data analytics algorithms an easy access to environmental Copernicus data [5]. To do so the EUXDAT platform has to connect to several data sources. We can mention:

• The Copernicus Land Monitoring Service for Copernicus core data on the land domain • The Copernicus Atmosphere Monitoring Service for Copernicus core data on the

meteorological domain • The Mundi service for earth observation Sentinel data and potentially core service data (data

offer is not fixed at the moment this document is written) These data providers provide access to data catalogues, providing metadata about all the offered dataset. The EUXDAT platform hosts its own data catalogue. A subset of targeted dataset is defined and imported from the data provider catalogues to the EUXDAT catalogue.

It is important to highlight that Mundi is part of the Copernicus Data and Information Access Services (DIAS), so EUXDAT will interact with such platform in order to simplify data access and data management.

3.2.3.1.2 Data Ingestion All remote data is not stored permanently on the platform. When a user sends to execution a pilot application that requires remote data over a certain period of space and time, it is the responsibility of the platform to retrieve the data and make it available for the application. To do so, proper download requests have to be sent prior to the application execution.

In order to avoid repetitive download of the same data, a data cache mechanism is required. The downloaded files are stored on a local disk. Their description (dataset ID, spatial and temporal coverage, location on the local disk…) have to be stored in a database. This database is used to avoid redundant downloading. It is interrogated to find which cached files can be reused to answer a data request.

Local Data 3.2.3.2

Some data used by pilot application are property of the users (UAV data for instance). These data have to be permanently stored locally, thus users need to be able to upload, store and manage files on the platform. A file manager in the backend should enable users to do so by implementing list, upload, download and delete functions.

Page 21: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 21 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Figure 1: Files Management

Users on the platform need to be able to keep private data, access public data and share data with other users. The platform should give to an identified user access to three virtual volumes:

• a private workspace with read/write rights; • a public data volume with only read rights; • a shared data volume where other users can push and write data to.

The user private workspace stores two kinds of data: the data to be processed and the results produced by the pilot application executions. The identified user accesses the folder storing them it through the frontend where he can upload, download, and delete files.

The public static data volume contains test files that can be use by new user of the platform willing to test the proposed algorithms. It is available to any identified user with download-only rights.

Security and Users Management 3.2.4

User Management 3.2.4.1

The EUXDAT e-infrastructure is composed of several components distributed on different environments (Cloud and HPC environments). Users should be able to log in one single time on the platform and be able from that point to access any functionality on the e-infrastructure. If specific login has to be performed on a component, it should be transparent for the user. In order to enable this feature (also known as Single-Sign-On), we will check different alternatives, in order to select one which is compatible with all the software pieces to be integrated during the project.

To centralize the user management across the platform we use a LDAP server that store a list of users with their login, password and their role and privileges. We will also check which other information should be stored about the users, so it will be shared through the different components of EUXDAT. We will also check how to store sensitive information, such as credentials for using HPC and Cloud resources (i.e. Amazon accounts information). All these open questions will be discussed in the context of WP4, taking into account the new European GDPR regulation, that EUXDAT must be compliant with.

Page 22: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 22 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Security 3.2.4.2

Security in the EUXDAT platform is based on the JWT (JSON Web Token) technology. A token is a means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object encrypted using JSON Web Encryption (JWE). The token encapsulates the user credentials and a validity date.

Privacy 3.2.4.3

The EUXDAT platform can store private data that owners do not want to share with the other platform users. The platform has to strictly avoid access to this private data to user that do not have the adequate access rights. Mechanisms are built using the centralized user base and the file manager module to avoid direct access to private data. A focus is also put on the application management so that applications cannot access on the background private data during their execution.

Visualization and Interaction Capabilities 3.2.5

Our visualisations are built upon a HSLayers library (http://ng.hslayers.org ) with a Cesium plugin (https://cesiumjs.org ). See details on Github in folders examples/3d-oluand components/cesium at https://github.com/hslayers/hslayers-ng repository. A sample app: Open Land Use Perspective Visualization

(http://ng.hslayers.org/examples/3d-olu)

This application visualises the Open Land Use Map (http://sdi4apps.eu/open_land_use) on top of the EU-DEM terrain model in a perspective view. Displaying the data in 3D environment helps the user to explore an area of interest in more natural way than a traditional map. Moreover, the user can explore other datasets, such as OTM (http://opentransportmap.info) or Smart Points Of Interests (http://sdi4apps.eu/spoi) or even add a web map service of his choice (as long as the WMS is in WGS84 coordinate system) to make custom 3D mash-ups. The technical details can be found at http://www.plan4all.eu/2017/08/3d-open-land-use-team. The application was developed during the INSPIRE HACK 2017 (http://www.plan4all.eu/inspire-hack-2017) and further customized for EBDVF (http://www.european-big-data-value-forum.eu).

Page 23: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 23 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Figure 2: Example of visualization

Management of Computing and Storage Resources 3.2.6

There are several aspects that must be taken into account when performing computing and storage resources management. First of all, EUXDAT has committed to use Cloud and HPC resources together, since it is expected to be more useful than using only HPC or only Cloud resources. Secondly, EUXDAT is required to optimize the usage of the resources with the support of monitoring and profiling. Critical Big-Data environments need some especial aspects to be addressed and guaranteed, such as performance (speed of data retrieval and processing), load-balancing and operation in a distributed computing environments. In the first case, EUXDAT has to provide an orchestration mechanism that will allow sending tasks to the underlying infrastructure in a transparent way to EUXDAT users. That means that it is necessary to hide the complexity of the resources usage. That can be done by using some high level language for expressing the tasks to be carried out and the resources that might be required, like TOSCA [6]. Such language will facilitate the description of how applications work, so developers can prepare parametrized TOSCA recipes and users will need only to fill in parameters and submit the corresponding recipe to the orchestrator, enabling the Extreme Data Analytics as a Service feature. From the orchestration perspective, EUXDAT will be able to send jobs to HPC systems (like the HLRS one, managed with Torque) and to use VMs in different Cloud solutions (such as OpenStack). We will use containers in both contexts, in order to facilitate an integrated vision of Cloud+HPC, so the software can be used without taking care of that. In the case of HPC resources, it must be considered that batch systems are not intended to run 24/7 services but to compute intensive workloads. There is a command line-based access to batch-system front-ends via SSH, while data can also be staged in/out via GridFTP. Usually, workloads are submitted by users via the batch-system’s cli tool, e.g. qsub. HLRS’ system (depicted in Annex 1), for instance, is managed with PBS/Torque and Moab as sophisticated scheduler capable to respect SLAs defined by administrators. Inside the HPC environment, on the compute nodes users cannot run any applications which require root permissions, all is solely executed in the user space, something that represents a limitation to be taken into account.

Page 24: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 24 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

In the second case, as optimizing requires information, EUXDAT will set up a monitoring system that will be retrieving information about the resources available (load, availability, queues status, time to run tasks, etc.) and about how they are used by the software running on top of them (saturation of resources, impact in the network, scalability achieved, etc.). Such information will be used in order to understand the behaviour of applications and their impact in the infrastructure, in such a way it will be possible to categorize which tasks perform better running in Cloud and which tasks perform better running in HPC. Then, an algorithm will take care of those profiles for splitting tasks adequately, without user intervention. In the case of the monitored information, it will be possible for users to access it through graphical user interfaces, since it might be useful for application owners and for resources providers. Depending on the user profile, it will be possible to access certain data or another one (as we do not want end users to be able to see HPC queues, for instance). Finally, we aim at managing SLAs as a way to guarantee certain quality for end users when running data analyses in the EUXDAT e-Infrastructure, even having the possibility to select different levels of quality (with different costs associated). The monitoring information will be crucial in order to manage SLAs and to detect potential SLA breaches, which might require some adaptation from the resources management perspective.

Extreme Data Analytics as a Service 3.2.7

The idea behind Extreme Data Analytics as a Service is that end users will be able to execute complex Big Data analyses without needing to know about the complexities related to data sources connection, data movement and resources management. EUXDAT aims at offering a web portal which will facilitate users preparing their analyses and launching them. Such portal will facilitate the selection of the data to use, the algorithms to run and visualization of results. For doing so, we will identify and define data analytics workflows, implementing templates that end users just need to fill in with the data and tools they want to use. The GUI will show them how the analytics is progressing and the results that are being generated, also linked to the EUXDAT monitoring feature. Additionally, as a way to provide a complete ‘service’ such portal should include tools that can facilitate interaction between the EUXDAT community members, the development team and the people operating the EUXDAT e-Infrastructure. Moreover, such portal may include information about training, tutorials, etc… so interested parties will be able to learn about how to use EUXDAT. Also, it should provide some interfaces that allow contacting EUXDAT providers such as, for instance, entities providing consultancy services (this could be a kind of yellow pages functionality). The main aspect to consider in the case of this feature is the support to big data analytics tools, so it will be possible to use them as needed. EUXDAT will support, at least, two kinds of tools: data in-motion analysis (real-time analytics, with a CEP-like tool) and data at-rest analysis (typical Big Data loads with tools such as Spark). EUXDAT will perform any required adaptation, so they will be compatible with the HPC+Cloud approach, also evolving the capabilities of the tools whenever necessary and/or convenient (i.e. add new analytics operations to a parallelized CEP).

Page 25: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 25 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

In order to enable a service-like exploitation approach, the needed interfaces will be provided, so it will be easy to use the analytics tools, and with enough flexibility so they can be used in a similar way for addressing different problems.

3.3 Mapping Requirements and Features

This section describes how requirements captured in EUXDAT D2.1 Description of Proposed Pilots and Requirements match to Main EUXDAT Features described in chapter 3.2 of this deliverable. The legend is as follows: green colour marks a full match, yellow colour means partial match and white colour goes for no match.

Table 1: Requirements Traceability Matrix

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

EUXDAT-REQ-Pilot1-DATA-001

Level-1C multi-spectral imaging products from the Sentinel-2

EUXDAT-REQ-Pilot1-DATA-002

UAV-enabled hyperspectral imagery

EUXDAT-REQ-Pilot1-DATA-003 Climate data

EUXDAT-REQ-Pilot1-DATA-004

Sentinel-1 GRD data products from the Sentinel-1 available at field scale

EUXDAT-REQ-Pilot1-DATA-005

Dynamic cropland mask, crop type map and LAI from Sen2-Agri system

EUXDAT-REQ-Pilot1-DATA-006

Copernicus European Digital Elevation Model (EU-DEM), version 1.1

EUXDAT-REQ-Pilot1-DATA-007 Land use map

Page 26: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 26 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

EUXDAT-REQ-Pilot1-DATA-008 Soil map

EUXDAT-REQ-Pilot1-001

Atmospheric correction of Multispectral Sentinel bands

EUXDAT-REQ-Pilot1-002

Calculation of NDVI from the 12 Sentinel multispectral bands

EUXDAT-REQ-Pilot1-003

Calculation of Hyperspectral indices relevant for stress and disease

EUXDAT-REQ-Pilot1-004

Availability of Sentinel-2 and Sentinel-1m data at field scale/for a given polygon

EUXDAT-REQ-Pilot1-005

Availability of the time-series of Sentinel-1 and Sentinel-2 data for a given pixel/given polygon.

EUXDAT-REQ-Pilot1-006

Dynamic data analyses of EO data and in-situ data

EUXDAT-REQ-Pilot1-007

2D visualization of time-series over selected pixels, provision of interfaces, toolkits

EUXDAT-REQ-Pilot1-008

Installation of Sen2Agri system and provision of Dynamic cropland mask, crop type map and LAI

EUXDAT-REQ-Pilot2- Open Land Use Map

Page 27: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 27 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

DATA-001

EUXDAT-REQ-Pilot2-DATA-002

Land Parcel Identification System (LPIS)

EUXDAT-REQ-Pilot2-DATA-003 Copernicus Sentinel 2 data

EUXDAT-REQ-Pilot2-DATA-004 EU-DEM

EUXDAT-REQ-Pilot2-001

Collecting machinery tracking data

EUXDAT-REQ-Pilot2-002

Collecting of agro meteorological data

EUXDAT-REQ-Pilot2-003

Calculation of yield productivity zones

EUXDAT-REQ-Pilot3-DATA-001 EU-DEM

EUXDAT-REQ-Pilot3-DATA-002 Hydrology for EU

EUXDAT-REQ-Pilot3-DATA-003 Actual weather

EUXDAT-REQ-Pilot3-DATA-004 Historic weather

EUXDAT-REQ-Pilot3-001

Zone related morphometric statistic

EUXDAT-REQ-Pilot3-002

Water influence to weather conditions

EUXDAT-REQ-Pilot3-003 3D visualization

EUXDAT-REQ-PLATF-001

Support for various HPC and Cloud providers

Page 28: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 28 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

EUXDAT-REQ-PLATF-002

Monitor HPC and Cloud resources

EUXDAT-REQ-PLATF-003

Applications monitoring and profiling

EUXDAT-REQ-PLATF-004

Adequate operation of the platform

EUXDAT-REQ-PLATF-005

Optimize data movement

EUXDAT-REQ-PLATF-006

Support security and privacy in data management

EUXDAT-REQ-PLATF-007

Automated deployment and execution of applications

EUXDAT-REQ-PLATF-008

API access to pilots' data and services

EUXDAT-REQ-PLATF-009

User management

EUXDAT-REQ-PLATF-010

Access sensor observations

EUXDAT-REQ-PLATF-011

Support information modelling

EUXDAT-REQ-PLATF-012

Support integration of meta-information

EUXDAT-REQ-PLATF-013

Compliance with INSPIRE specifications

EUXDAT-REQ-PLATF-014

Compliance with GEO/GEOSS specifications

Page 29: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 29 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

EUXDAT-REQ-PLATF-015

Integrate Web map services

EUXDAT-REQ-PLATF-016

Multiple Data Centers in the Cloud

EUXDAT-REQ-PLATF-017

Cloud Data Storage

EUXDAT-REQ-PLATF-018

Dependability

EUXDAT-REQ-PLATF-0219

Big Data Management

EUXDAT-REQ-PLATF-020

Identity Management & Access control

EUXDAT-REQ-PLATF-021

Scalability – Users growth

EUXDAT-REQ-PLATF-022

Scalability – Data growth and complex analytics

EUXDAT-REQ-PLATF-023

Data decentralization

EUXDAT-REQ-PLATF-024

Parallel data stream processing

EUXDAT-REQ-PLATF-025

Reduction in energy consumption by improved processing algorithms

EUXDAT-REQ-PLATF-026

Use of efficient hybrid architectures

EUXDAT-REQ-PLATF-027

Visualization of large amounts of data

EUXDAT-REQ-PLATF-028

Support of different formats for visualization

Page 30: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 30 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Requirement ID Requirement Name Data

For

mat

s Su

ppor

t

Alg

orith

ms

and

App

licat

ions

Mgt

. Da

ta M

gt. a

nd

Proc

essin

g Se

curit

y an

d Us

ers

Mgt

. Vi

sual

izatio

n an

d In

tera

ctio

n

Man

agem

ent o

f Re

sour

ces

Extre

me

Data

A

naly

tics

as a

Ser

vice

EUXDAT-REQ-PLATF-029

Provide rich user interfaces for the interactive visualization

EUXDAT-REQ-PLATF-030

Render high resolution data in N arbitrary dimensions

EUXDAT-REQ-PLATF-031

Personalised end-user-centric reusable data visualisation

EUXDAT-REQ-PLATF-032

Detection of abnormal sensor measurements

EUXDAT-REQ-PLATF-033

Use of high performance computing techniques to the processing of extremely huge amounts of data

EUXDAT-REQ-PLATF-034

Heterogeneous data aggregation and normalization

EUXDAT-REQ-PLATF-035

Verification of data integrity and veracity

Page 31: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 31 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

4. EUXDAT Architecture This section introduces the high level architecture of the EUXDAT e-Infrastructure, based on the defined features. It also describes the interactions among high level component in order to implement some of the most important features.

4.1 High Level Architecture

Taking into account the identified features, we have defined a high level architecture. These components can be mapped with the features mentioned. Basically, the aim of the proposed architecture is to group together similar features in order to get a modular and extensible architecture.

Figure 3: EUXDAT High Level Architecture

The different colours aim at indicating the nature of the components. The green colour is used for the component related to the web interfaces, while the blue one is for the security-oriented component. On the other hand, the yellow colour is used for those components heavily involved in the data-related features, while the red colour is for the components implementing features related to the resources provision and management.

Page 32: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 32 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

4.2 Main Actors

Taking into account the list of features available and the components already identified, we have defined a list of users for the EUXDAT e-Infrastructure, representing roles to be played when using EUXDAT.

• Administrator: It refers to those people in charge of the administration of the EUXDAT e-Infrastructure. They are able to change configurations, support users and manage their accounts. They also have access to all the features of the e-Infrastructure.

• Developer: This role represents those people who are providing content to the e-Infrastructure, meaning that they will upload and publish applications, algorithms and data. They will be able also to access certain monitoring information related to their creations.

• End User: This role is, basically, for searching data and for running data analyses through the EUXDAT Portal. They will have no access to publishing mechanisms.

4.3 Main Components

The main components identified are the following:

• EUXDAT Portal: It is the entrance point to EUXDAT functionalities. It provides a web GUI which gives access to the workflow execution tool, monitoring of data analytics execution, the catalogue and other useful tools;

• Identity and Authorization Manager: This component is responsible of managing user accounts and managing access to the functionalities and data in EUXDAT, according to security policies and to the rights granted to each user;

• Data & Algorithms Catalogue: It keeps a record of all the algorithms, applications and datasets which are available in EUXDAT;

• Data & Algorithms Repository: This component deals with the storage of datasets, algorithms and images, in general, that will be used for running data analyses;

• Data Manager: It is the component in charge of moving data to the proper location. It will configure and operate extraction APIs for accessing several data sources. For doing so, it also has all the data connectors that are necessary;

• SLA Manager: It agrees on quality attributes to fulfil and the values to be meet for each attribute. It also retrieves information about the monitoring of such attributes in order to detect SLA breaches;

• Orchestrator: It deals with the management of resources, mainly from the functional perspective, deploying the algorithms and the corresponding data in the optimal location. It also deals with the application profiles generation and management;

• Monitoring: It retrieves information about the resources execution and status, as well as about the algorithms execution and datasets status.

Page 33: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 33 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

4.4 High Level Interactions

In this subsection, we have described the interactions between high level components for implementing a few key features of the EUXDAT e-Infrastructure: moving large datasets (from a location to another one), define a new application or algorithm for data analytics and, finally, the execution of such kind of algorithm using input data selected by the user.

Moving large data in EUXDAT 4.4.1

This feature allows a user to move some dataset from a location to a different one. Such movement is considered a copy of the original dataset, which might be done for backup or for analysis purposes. It may happen when a user wants to keep a copy of the data to be used in a certain location. For instance, a user may request time-series of Sentinel data from last year for a given field. First of all, the user will enter the EUXDAT Portal, in order to access the data catalogue interface. In this case, we assume the user is already logged in, so there is no need to describe the steps for login (already included in section 4.4.3).

Figure 4: Moving Large Data Sequence Diagram

Once the user has access to the D&A Catalogue, this component checks the credentials of the user with the I&A Manager, in order to provide access to private datasets, if necessary. Then, the user performs a search of datasets, getting a list. The user will select one of the datasets and access the metadata available in the D&A Catalogue. Such metadata will show the location of the different copies and their format. Then, the user will request to

Page 34: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 34 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

create a copy of the dataset in another location, indicating the target location. It may also request to get part of it, as in the mentioned example about a concrete piece of Sentinel data. The D&A Catalogue will, then, request the Data Manager to create a copy of the data, transferring all the information to the target location. If such location is the EUXDAT local repository, the Data Manager will store it in the target location. Also, considering its utility, the Data Manager may decide to create a local copy in order to keep it as cache, at the D&A Repository. At least, the EUXDAT e-infrastructure will provide one cloud data storage and one HPC data storage but we can imagine other storage solutions according to the needs of the scenarios (local FTP storage for very big datasets, different cloud solutions…). Eventually, the user may provide additional parameters in order to move only a subset of the dataset. The nature of the parameters depends on the dataset (latitude, longitude, depth/altitude, time, variable…). The system automatically validates these parameters according to the dataset metadata (available values, range, type…). Finally, after the operation has succeeded, the Data Manager will inform the D&A Catalogue, which will update its metadata, indicating the existence of the new copy of the data and letting now to the user that the operation was successful.

Defining a new data analysis in EUXDAT 4.4.2

When adding new algorithms and application to EUXDAT, we assume that the user wants to make available new code for the e-Infrastructure. This may happen by uploading source code (so the CI and CD capabilities will be used) or uploading a complete image of the application (generated with executable files). In any case, the user will include the information required by the Orchestrator for enabling its right deployment and execution. In this case, we will assume that the end user is uploading the source code of the application/algorithm. The first step is to interact with the EUXDAT Portal, in order to get access to the Marketplace functionality (by means of the D&A Catalogue). Once the user has access to the right interface, it will request the D&A Catalogue to create a new product. Such action will require the I&A Manager to confirm the user has the required rights to do so. In that moment, the D&A Catalogue will show the interface for providing all related data and for the upload of the code. Once provided by the user, the D&A Catalogue will upload the code to the D&A Repository, where CI/CD will be activated, performing code compilation and the generation of the images. Finally, when this operation is completed, the user will have the opportunity to publish the new application/algorithm, so it will be accessible for the rest of users of EUXDAT, interacting with the D&A Catalogue. The D&A Catalogue will allow to exhibit the kind of offer available for users (i.e. provided by free, require some fee when using it, etc.). From that moment on, other users will have the opportunity to use it.

Page 35: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 35 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Figure 5: Defining New Data Analysis Sequence Diagram

Running a data analysis in EUXDAT 4.4.3

When executing data analyses in EUXDAT, all the main components identified in the high level architecture are involved. It is important to highlight that this case assumes that the input data is already available in the location where the computation will be carried out, so no data movement of input datasets is required (such movement could be done by the user as stated in section 4.4.1). The process begins when the user logs in in the system through the EUXDAT Portal. The credentials provided are checked by the I&A Manager and the user is validated, granting access to the portal features. Once the user is logged in, he/she will select the option to run a data analytics algorithm from the interface, providing the configuration required (i.e. quality constraints such as response time or queuing time, how many nodes and cores to use, security constraints to be applied, required parameters of the selected algorithms, etc…). Once such parametrization is done, it is necessary to select the datasets to be used as inputs. The EUXDAT Portal will access the D&A Catalogue in order to enable data search. The user will use the D&A Catalogue interface (which provides also metadata about the datasets) for finding and selecting the datasets to be used. Once everything is ready, the User may launch the analysis using the EUXDAT Portal interface. When the analysis is launched, the EUXDAT Portal requests the Orchestrator to take the control of the execution. The Orchestrator will check that the user is authorized to perform the operation and it will retrieve the resources usage credentials (Cloud and HPC infrastructures credentials) from the I&A

Page 36: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 36 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Manager. Before allocating the right resources, the Orchestrator will require the Monitoring to provide information about the current status and historical information of the infrastructures to be used. With such information, the Orchestrator will determine which tasks will run in which infrastructure (mixing Cloud and HPC resources). Taking into account the quality constraints provided by the User and the internal information, it will agree with the SLA Manager on the QoS KPIs to fulfil. Once such agreement exists, the SLA Manager will require the Monitoring to activate the measurement of the required metrics, in order to check whether the SLA is met. Just before the execution, the Orchestrator will request the Data Manager component to move the data to be used to the adequate locations, so the computation can start. The Data Manager will retrieve the datasets from the D&A Repository and it will move them to the adequate location. Usually, we can assume that the images with the software to be run are already in the target infrastructure. If this is not the case, the Orchestrator would retrieve the images from the D&A Repository and would deploy them adequately.

Figure 6: Running Data Analysis Sequence Diagram

Page 37: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 37 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Once everything is ready, all the tasks are executed step by step and in the locations required by the Orchestrator. Any additional data movement will be managed by the Orchestrator and the Data Manager. At the end of the execution, the Orchestrator will take the produced results (datasets) and it will require the Data Manager to move the data to the corresponding location (as selected by the User) of the D&A Repository. After such step, the EUXDAT Portal will retrieve feedback from the Orchestrator confirming the status of the execution and the result generated. Additionally, it is possible to retrieve periodic information about the execution. For doing so, the EUXDAT Portal may request information periodically to the Orchestrator and to the Monitoring components. In fact, the EUXDAT Portal will integrate an interface which will show monitoring information, in case Users want to access to the information.

4.5 Development Priorities and Roadmap

The EUXDAT e-Infrastructure will have three releases (M12, M24 and M32), so we have prioritized features implementation according to the importance of the functionalities and the end users’ needs (expressed in the requirements). As a result, there is a roadmap that indicates how development actions should be organized during the following months, depending on the release. In the case of the release v1 (M12), the proposed features for implementation are:

• Initial version of the EUXDAT Portal, with users management able to launch data analyses with simple interfaces;

• Initial version of the Orchestrator, able to run tasks in the HLRS infrastructure; • Availability of a catalogue for datasets, so it will be possible to publish and retrieve metadata; • Initial version of the repository for code, datasets and images; • Set up the Monitoring infrastructure and take some simple metrics from the resource

providers; • Set up the I&A Manager, so it will be possible to manage users.

In the case of the release v2 (M24), the proposed features for implementation are:

• Second version of the EUXDAT Portal, improving the tool for launching data analytics, and including a marketplace and the link with the D&A Catalogue for searching and accessing information;

• Improved version of the Orchestrator, using Cloud + HPC infrastructures and with a simple algorithm for providers selection;

• Enable more monitoring metrics, being able to retrieve information for creating application profiles (i.e. resources used);

• First version of the Data Manager, able to move data using several infrastructures and protocols (at least GridFTP or similar);

• Complete D&A Catalogue, for the datasets and applications/algorithms. Finally, in the case of the release v1 (M32), the proposed features for implementation are:

Page 38: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 38 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

• Final version of the EUXDAT Portal, with the complete version of the tool for launching data analytics and integrating community tools (i.e. forums) and monitoring interfaces;

• Populated D&A Catalogue and Repository; • Improved Orchestrator, able to generate profiles and to use them to allocate resources; • Complete Data Manager, including the datasets evaluation mechanism for improving data

movement; • Complete list of monitoring probes (i.e. add metrics from the applications running).

Page 39: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 39 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

5. Detailed Design of Main Components This section provides a deeper view about the high level components identified, giving an idea of their internal composition. It is not the purpose of this document to enter into the details of the implementation of each high level components, since that will be defined in WP3 and WP4 which, later on, will also implement the components. It is important to highlight that the proposed diagrams include not only the subcomponents that we identify, but also how these are related to other high level components, in order to specify which parts are expected to interact.

5.1 EUXDAT Portal

The EUXDAT Portal is the one-stop-shop of the EUXDAT e-Infrastructure, which provides access to all the functionalities which are available for the end users. It is a web-based graphical user interface which allows navigating through different features, such as finding datasets (and their metadata), launching data analytics, accessing applications and algorithms or accessing a support system and forums for enabling interaction among the members of the EUXDAT community.

Figure 7: EUXDAT Portal High Level Architecture

The EUXDAT Frontend centralizes all the web interfaces. It provides the basic part of the web in which all the rest elements are integrated through the corresponding menus. Therefore, it communicates with other tools which provide additional interfaces and business logic to be used by the e-Infrastructure users.

Page 40: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 40 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

The Users Manager component will provide the interfaces for creating users, modifying their data, accessing current data (including credentials for resources providers) and deleting users. It interacts with the I&A Manager component for managing users data and for performing login operations. The Support forums component refers to tools such as AskBot, a Q&A software, which can be used for enabling interactions between the e-Infrastructure users or users and developers/operators. The Monitoring Interface will be a GUI for the monitoring information (such as Grafana) which will allow to access information from the Monitoring component, so users will know about the status of their applications and other parameters (i.e. infrastructures available). The Data Analytics Launcher component allows configuring the analytics tasks, their parametrization and their execution through the Orchestrator. It will retrieve the input information from the end users and provide this information for the right orchestration of resources and tasks, showing the results to the user when these are available. Finally, the Marketplace and the Data Browser allow to access information about applications/algorithms and datasets, including metadata, also enabling the possibility to ‘buy’ data and applications that can be used later on. The Data Browser is focused on datasets metadata access and search, while the Marketplace is focused on the interface for selling products. Both of them interact with the D&A Catalogue, which manages the information about the applications and datasets.

5.2 Identity and Authorization Manager

The Identity and Authorization Manager component is the one in charge of identifying users and managing their permissions and credentials.

A specific component on the platform backend, the Authentication Service, deals with token encrypting/unencrypting tokens transmitted from other modules (the frontend or another backend module). Once the token is unencrypted, the service checks the credentials validity in the LDAP user base. This means that any other component of the e-Infrastructure will use this component to control the access rights to functionalities and data.

Figure 8: I&A Manager High Level Architecture

The Authentication Service will be invoked by other components in the EUXDAT e-Infrastructure, in order to guarantee that the token used by the user is valid. The Authentication Proxy component will

Page 41: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 41 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

act as interface, managing all the invocations. The Authentication Service will be, in the end, responsible of the Single-Sign-On feature, supporting several standards, in such a way that the complexity in the integration with other components will be as low as possible.

As a way to make the interaction secure, the communication will be done using the HTTPS protocol. This applies to the communication between the Authentication Proxy and the Authentication Service, as well as for the Authentication Service with the LDAP component, taking care of the users database. We will take into account the new European GDPR, since the users database will contain information that is affected by such regulation, and some data (such as credentials for using computational resources) is sensitive and needs to be protected carefully.

Taking into account the criticality of the Authentication Service functionality, we consider it should be deployed in a specific container, so it will be possible to isolate the component and also apply some techniques which guarantee its availability and scalability.

5.3 Data and Algorithms Catalogue

The Data and Algorithms Catalogue takes care of keeping a list of applications/algorithms and datasets available, together with their corresponding metadata.

Figure 9: D&A Catalogue High Level Architecture

It will have, basically, two main sub-components: a Data Catalogue and a Marketplace. While the Data Catalogue is focused on managing the list of datasets, the Marketplace deals with the applications/algorithms which can be used and it also allows for selling datasets. The Data Catalogue can be implemented with CKAN, together with some specialized extensions which may improve the features provided by the catalogue, such as Disqus (for allowing comments), DataStore and DataPusher (for storing certain data), Google Analytics (for information about datasets statistics), etc. Other alternatives that are under study are Open Micka and GeoNetwork, which are focused on managing data catalogues in the context of resources linked to geographical spaces. All the discussions related to this topic happen in the context of WP3.

Page 42: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 42 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

The Marketplace can be implemented by using a tool such as the Fiware Business Framework, which allows organizing elements in categories, accepts any kind of element (i.e. datasets, algorithms, complete applications, etc). It keeps some metadata about the published elements that can be used for understanding the purpose of the elements and the applied business model. It also manages payments, even splitting them whenever necessary. The policies and permissions granted to the users for using the Marketplace will be defined in the context of WP3, which will define roles and configurations. These two components offer interfaces which should be integrated in the EUXDAT Portal, in order to be accessible to end users easily. Of course, they will require an integration with the I&A Manager in order to check users’ credentials. Additionally, they will interact with the D&A Repository, since it is the one storing code, images and datasets and the catalogue needs to keep the right links to the elements. Finally, the Data Manager will, eventually, interact with the Data Catalogue for retrieving information about the datasets location and other metadata that might be useful.

5.4 Data and Algorithms Repository

The Data and Algorithms Repository has two main parts: one dedicated to the storage of datasets and another one dedicated to the storage of code and generation of images.

Figure 10: D&A Repository High Level Architecture

The role of the application management module is to maintain a register from which application can be started on demand, monitored and stopped. It should allow easy retrieving of the data produced by the application it manages (results and logs). It should enable easy access to any service provided by the platform: data access, remote processing capability, utility service… Requests to the application manager could be issued by the platform users through the frontend module, or through a workflow manager. Thus, it should expose a clear and open API.

Page 43: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 43 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

5.5 Data Manager

The Data Manager is the component supporting data movement tasks, trying to optimize the way in which data are transferred to the place where the computation is performed.

Figure 11: Data Manager High Level Architecture

In this component, we will have a Data Mover component, which will be the one managing the process of moving data, by locating the data, deciding the data source to use and requesting the corresponding data connector to move the data from one storage solution to another. It will also take decisions about caching mechanisms to improve the optimization (i.e. maintain several local copies of certain datasets that are used many times). The Datasets Evaluator will analyse the metadata and other information about the datasets available, in order to evaluate which ones are preferred to be used (taking into account the data quality mechanism mentioned in the DoA). It will retrieve information from the D&A Catalogue and the Monitoring components. Finally, the Data Storage Connector will be a component with multiple connectors which implement protocols for connecting to different data sources and transferring data (i.e. GridFTP). Such component will interact with the Data Repository component and with external storage solutions (i.e. from Cloud and HPC).

5.6 SLA Manager

The SLA Manager is in charge of the negotiation of SLAs and, after such negotiation, of the adequate management and fulfilment of the agreements.

Page 44: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 44 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Figure 12: SLA Manager High Level Architecture

Therefore, the SLA Manager is expected to have three main blocks. The first one is devoted to the negotiation itself (the SLA Negotiator) and it will provide the needed APIs for performing the negotiation (in one or more rounds) and for accessing any required information about the agreements. It will interact with the Orchestrator for doing the negotiation, based on some pre-defined levels (i.e. gold, silver, etc.) as a way to simplify the automatic negotiation activity. In principle, we consider the WS-Agreement specification in order to define the negotiation process, interfaces and document formats. Once the SLA negotiation has been finished, the SLA Negotiator will provide the SLAs Repository, so the SLA will be properly stored in case it is necessary. The SLAs Repository will use a database for storing the information, bearing in mind the original format of the SLAs (XML files). The SLA Negotiator will also notify the SLAs Monitor, so it will be aware that another SLA requires the monitoring of certain metrics. Such component will retrieve information from the Monitoring component ad it will check the metrics with the agreed boundary conditions, in order to trigger alerts when a SLA breach is detected.

5.7 Orchestrator

The Orchestrator is the component that is going to decide how to execute data analyses, as configured by the end users. As it is one of the most important components, it has a close relationship with other components of the architecture. The Orchestrator Interface is the component receiving the requests to run applications from the EUXDAT Portal. The Orchestrator Interface receives the request and analyses the configuration

Page 45: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 45 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

parameters in order to trigger SLA negotiations with the SLA Manager. The negotiation will take into account pre-defined levels of Quality of Service and the received parameters. Once the SLA is agreed, the interface requests the Orchestration Engine to run the corresponding application with the selected data.

Figure 13: Orchestrator High Level Architecture

The Orchestration Engine will implement the algorithm which will perform the selection of resources providers and will manage the execution of tasks. In principle, we consider to use an adapted version of the Cloudify orchestrator, with modifications to support HPC systems. It is able to use TOSCA [6] files as input (representing the application workflow) and then to split the execution among different resource providers. Such input files will also indicate constraints, such as licensed software, that the Orchestrator will deal with (guaranteeing that licenses will be used as expected). In order to enable the interaction with different Cloud and HPC providers, the Orchestration Engine retrieves credentials from the I&A Manager and it manages the interaction through a set of Infrastructure Connectors. Such connectors implement concrete mechanisms for supporting several Cloud providers (OpenStack, Open Nebula, AWS, etc.) and HPC workload managers (Slurm, torque, etc.). As a way to support decision making, the Orchestration Engine makes use of the input from the Monitoring Connector (which retrieves information from the Monitoring component) and from the Profiles Manager (which takes monitoring information for generating application profiles that provide a clue about the behaviour of the applications and the kind of resources to allocate).

Page 46: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 46 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Finally, the Orchestration Manager also interacts with the Data Manager, since it will need to move data, as defined in the application workflow, especially for bringing input datasets and for sending result datasets.

5.8 Monitoring

The Monitoring component is a key component which collects and stores information which is useful for several components of the EUXDAT e-Infrastructure and also for end users. It retrieves information about the resources status, as well as about the applications running at the e-Infrastructure.

Figure 14: Monitoring High Level Architecture

The Monitoring component will provide an interface which allows to activate and retrieve concrete metrics, so other components will be able to get the information they need. It will also provide a user friendly interface (i.e. Grafana) in order to visualize the monitored metrics, depending on users’ profiles. The metrics can be collected in two ways: they can be sent directly to the Monitoring Collector (i.e. external probes which send information periodically to be stored) or they can be requested from the Monitoring component, through Monitoring Connectors (they ping the monitored entity in order to retrieve a concrete metric in a synchronous way). In the second case, the Monitoring Connectors component represents a set of custom probes which perform the monitoring activity, providing the information to the Monitoring Collector. The Monitoring Collector (representing solutions such as Prometheus or Snap Telemetry) will retrieve all the information and will provide it to the Monitoring Storage component. The Monitoring Collector is expected to be able to perform some analysis and management, such as defining alerts and detecting when such alerts must be triggered.

Page 47: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 47 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Finally, the Monitoring Storage will represent a time-series database, which will store all the received metrics with the purpose of creating a historic of the monitored information (as it is expected to be requested by other components such as the Orchestrator or the SLA Manager). For the storage, we will also consider any constraint derived from the application of the GDPR, since some information about billing could be sensitive and be linked to personal data that must be protected.

Page 48: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 48 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

6. EUXDAT Deployment This section explains how the EUXDAT e-Infrastructure components will be deployed, so it will be possible to obtain an operative environment.

6.1 Deployment Infrastructure

The deployment infrastructure takes different aspects in consideration, such as the overall distribution of resources and services, described in 6.1.1, the staging concept for development and testing, described in 6.1.2 and the required infrastructure services managing component deployments.

Deployment 6.1.1

The deployment infrastructure is divided into 3 major parts, there is the portal environment where all portal related and global services run, e.g. Monitoring. On this level all components resides that are intended to steer workflows going through the EUXDAT platform. The actual computation is carried out either in a Cloud or HPC environment, or in both, depending on the workload definition and how the orchestrator decides under consideration of current load and SLAs.

Figure 15: EUXDAT Deployment

Page 49: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 49 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Stages 6.1.2

In addition to the general deployment setup, there are platform stages required, in terms of development environment, integration environment and production environment. These deployment stages solely concern the EUXDAT portal and all involved components, however not the computation environments which can be considered to be in production, only.

Figure 16: EUXDAT Deployment Stages

There are three stages, first one is the development environment where developers deploy and test their components during development phases, in this environment simple test data sets will serve as input to process. As soon as a component can be considered stable and working, it is staged to the integration stage where the interaction with other components of the platform is tested with real data. Components that passed QA successfully are staged into the production environment. All three stages use the same Cloud and HPC backend.

Services 6.1.3

For the deployment, there are services in place, taking care of automated deployments, triggered by updates. There is a git review tool (gerrit) and code repository (gitlab), as well as a continuous integration tool (Jenkins). These tools in combination provide the required services for an automated management of commits and their deployment onto the corresponding stage.

Page 50: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 50 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

Figure 17: EUXDAT Deployment Services

In a first step an update is committed to gerrit, where the code review takes place. Upon successful passing, code is merged and stored in the gitlab repository. Jenkins is triggered each time a commit reaches the repository. From there Jenkins checks it out, and deploys it on the corresponding stage. There are 3 repositories intended, one for each stage, where in-between code gets staged upwards, only. Further, Jenkins is capable to execute regression test suites automatically after component deployment and also provides logs for investigation in case of issues.

6.2 Components Deployment

In principle, we propose to deploy every component of the EUXDAT e-Infrastructure in VMs, looking at them as micro services. We also consider using containers with a solution such as Kubernetes for managing them correctly. In any case, it is necessary to think about the best way to do the deployment, taking into account how each component works, as a way to get a good performance. In the ideal case, each component will be deployed in one VM, but we can deploy some components together, in order to save resources and minimize communication during their interaction. For instance, the SLA Manager, Orchestrator and Monitoring could be deployed together or, at least, in the same physical machine, since they will interact closely and the SLA Manager and Orchestrator are not expected to be active continuously. The D&A Repository and D&A Catalogue could be located as well. In this case, it is important that the storage capability will be high, so the repository will have enough capacity. As the D&A Catalogue reflects what is contained in the repository, we can assume they will collaborate closely. The Data Manager could be deployed next to them as well, since it will need to collaborate continuously with the repository.

Page 51: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 51 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

In the case of the I&A Manager, it may be deployed alone or with another component, since it is not expected to require a lot of resources. For instance, it could be deployed together with the EUXDAT Portal, although this component may need scalability capabilities in case a lot of users try to access the Portal at the same time (taking into account that it will have a web server and other tools). Both of them are closely related because of the users management feature. Since all the tools to be used in each component have not been identified yet (as this will be done in the detailed design of WP3 and WP4), it is not possible to determine, yet, the concrete configuration required for the VMs. Therefore, the deployment will be analysed again once there is a first version of the detailed design.

Page 52: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 52 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

7. Conclusions This document presents a list of the features identified for the EUXDAT e-Infrastructure, as well as a high level architecture. Such architecture includes the definition of some high level components and their interactions. Additionally, the document presents a high level design of the mentioned components and a proposal for deploying the whole e-Infrastructure, so it can become operational. When identifying the features, we followed a top down approach, taking into account users’ requirements, but also our previous experiences, which showed to be useful, since we were able to include some additional features and details to already identified ones, completing the picture of what is needed for EUXDAT. The proposed architecture is modular and flexible, in such a way it maps with the current features while it allows for future adaptations without radical changes in the proposed architecture. This also applies to the interactions and the high level designs, which allow using different tools for the same purpose (i.e. Cloudify or other existing solutions could be used as Orchestration Engine). This flexibility is also positive for the evolution and update of the e-Infrastructure in the future, if required. In any case, we are aware that there are still a lot of aspects to be detailed and decided in the context of other WPs (WP3 and WP4 mainly), and the architecture will be adapted as needed once the corresponding decisions are made. This is the reason because the deployment aspects are also open, since we still do not know the system requirements of the software to be used. Therefore, deployment will be re-analysed once WP3 and WP4 have a clear idea of the implementation. Finally, as mentioned before, the current version of the architecture is not written in stone and it will be adapted in the subsequent versions, according to any new requirement identified and the changes required by the way that the components can be implemented. Next versions of the architecture document will reflect any required change.

Page 53: D2.2 EUXDAT e-Infrastructure Definition · 2018-11-12 · Also, the consortium experience will support such analysis, maybe identifying new features not requested yet, or even adapting

Document name: D2.2 EUXDAT e-Infrastructure Definition Page: 53 of 53

Reference: D2.2 Dissemination: PU Version: 1.0 Status: Final

8. References [1] EUXDAT; “D2.1 Description of Proposed Pilots and Requirements”; Jedlička, Karel; 2018.

[2] European e-Infrastructure for Extreme Data Analytics in Sustainable Development (EUXDAT). Grant Agreement. Nieto, Francisco Javier. 2017.

[3] W3C; SOAP Version 1.2 Part 0: Primer (Second Edition); http://www.w3.org/TR/2007/REC-soap12-part0-20070427/ (27 April 2007, retrieved 2018-05-25)

[4] SDI4Apps; Open Land Use Map; http://sdi4apps.eu/open_land_use/; retrieved 2018-05-25

[5] Copernicus; Copernicus Data Access; http://copernicus.eu/data-access; retrieved 2018-05-25

[6] OASIS; TOSCA Simple Profile in YAML Version 1.1; http://docs.oasis-open.org/tosca/TOSCA-Simple-Profile-YAML/v1.1/TOSCA-Simple-Profile-YAML-v1.1.html; 30th January 2018; retrieved 2018-05-25