big data europe transport pilot case, luigi selmi

22
Pilot SC4 L. Selmi - BDE - SC4 Webinar BDE SC4 02.12.2016

Upload: bigdataeurope

Post on 15-Jan-2017

775 views

Category:

Technology


2 download

TRANSCRIPT

Pilot SC4 L. Selmi - BDE - SC4 Webinar BDE SC4

02.12.2016

Objective of the Pilot SC4

L. Selmi - BDE - SC4 Webinar

A scalable, fault-tolerant and flexible platform based on open source frameworks that can process unbounded data sets and graphs.

Microservice Architecture

L. Selmi - BDE - SC4 Webinar

Message Broker

L. Selmi - BDE - SC4 Webinar

Apache Kafka is a high-throughput distributed durable messaging system

Apache Kafka

Kafka Cluster

L. Selmi - BDE - SC4 Webinar

Apache Kafka

Stream and Batch Processor

L. Selmi - BDE - SC4 Webinar

Apache Flink is an open source platform for distributed stream and batch data processing.

Apache Flink

Flink Cluster

L. Selmi - BDE - SC4 Webinar

Apache Flink

Storage and Indexing

L. Selmi - BDE - SC4 Webinar

PostGis is a spatial database that stores the road network data. Elasticsearch is a distributed open source document database built on top of Apache Lucene. It stores the result of the workflow.

Elasticsearch Cluster

L. Selmi - BDE - SC4 Webinar

Pilot Architecture

L. Selmi - BDE - SC4 Webinar

BDE Components

L. Selmi - BDE - SC4 Webinar

The FCD Pipeline

L. Selmi - BDE - SC4 Webinar

Visualization

L. Selmi - BDE - SC4 Webinar

The pilot SC4 can process real-time FCD data for map-matching and classify a road segment according to the traffic level.

Distributed computing: the theoretical minimum

L. Selmi - BDE - SC4 Webinar

Minimum requirement for fault-tolerance and scalability

● Cluster of 3 nodes (Docker swarm)

● 4 CPU cores x node● 1 (Flink) worker x node● 1 (Flink) slot x CPU core

Max parallelism = 12

Parallelization: map-match subtasks

L. Selmi - BDE - SC4 Webinar

1. source()2. mapMatch() 3. keyBy()/window()/apply()4. sink()

The subtasks can be distributed in slots with different parallelism (e.g. from 1 to 12)

Parallelization: Flink dataflow

L. Selmi - BDE - SC4 Webinar

A slot can process all the subtasks in a pipeline

Parallelization: input and output data

L. Selmi - BDE - SC4 Webinar

device_id timestamp lat lon speed orientation transit

The mapMatch subtask keeps the time order so that the next task keyBy(road_seg)/window(15’)/apply() will return the correct average speed and number of vehicles within the time window for each road segment.road_seg_id start_date num_vehicles avg_speed

Pilot Cycle 2 Targets

L. Selmi - BDE - SC4 Webinar

● Extend the functionalities● Improve the technology● Lower the boundaries

Cycle 2 - Extend the functionalities

L. Selmi - BDE - SC4 Webinar

Short-term traffic forecasts1. Map-match 44 Gb of historical

Floating Car Data from CERTH (Thessaloniki)

2. Train a model (using ANN)3. Make predictions using the

model and the near real-time data

Cycle 2 - Improve the technology

L. Selmi - BDE - SC4 Webinar

● Improve the map-matching algorithm

● Parallelize the processing of the historical data

● Finalizing the “dockerization” of the components

Cycle 2 - Lower the boundaries

L. Selmi - BDE - SC4 Webinar

● Set up different visualizations for traffic monitoring and forecasting

● Visualize the traffic pattern in a road segment

● Visualize a location of a vehicle and the matched road segment (for tests)

Thanks

L. Selmi - BDE - SC4 Webinar

BDE project website:https://www.big-data-europe.eu/Code repository: https://github.com/big-data-europeContact:[email protected]