aflux: big data analytics in iot mashup tools• introduction to new jvm based mashup tool: aflux...

Tanmaya Mahapatra

Technische Universität München

Fakultät für Informatik

Lehrstuhl für Software & Systems Engineering

Garching, 05.07.2018

aFlux: Big Data Analytics in IoT Mashup Tools

• IoT Application Development Landscape & Challenges• IoT Mashup Tools: Examples, Limitations• Introduction to new JVM based Mashup Tool: aFlux• aFlux Internals• Supporting Big Data Analytics• Towards a Unified Approach for querying Big Data• Stream Analytics in aFlux• aFlux: Supporting Spark• aFlux: Supporting Flink• Road ahead

2Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra

Slide Contents

IoT Application Development:Landscape & Challenges

IoT has been defined as the interconnection of ubiquitous computing devices for the realization of value to end users.

• The true potentials of IoT are realized only by its application landscape.• Data collection from devices:

• For analysis to better understand the environmental context.• Task automation for time optimization and enhancing the quality of

human life.


IoT & Application Landscape

Insights from data can allow the creation of sophisticated & high-impact applications. E.g. Traffic congestion can be avoided by using learned traffic

patterns.


Usage of IoT Data

Mining individual and group preferences

Mining patterns of end-users (mobility models, …)

Analyzing the state of engineering structures (structural health monitoring, …)

Predicting the future state of the physical environment (flood prediction in rivers, …)

• Unfortunately, the development of software application for IoT is notsimple.


Difficulties in IoT Application Development

Write complex boiler plate codes to access devices

Perform data mediation

Device identification & Co-ordination

A mashup is a composite application developed starting from reusable data, application logic, and/or user interfaces typically sourced from the web.


Streamlining Application Development for IoT: Mashups

User Interface

Logic

Data

IoT Mashup Tools: Examples & Limitations

• Mashup Tools typically offer graphical interfaces for specifying the data flow between sensors, actuators & services which lowers the barrier of creating IoT applications for end-users.


IoT Mashup Tools

State of the Art of Existing Mashup Tools

Device Management

• Provide simplified mechanisms to register IoT devices to the platform

• Devices can be named, grouped, deleted

Mashup Creation

• Provide a visual programming environmentto combine different services

• Provide support for pseudo codes/code-snippets for enhancing the business logic

Mashup Deployment

• Provide a run-time environment to execute the mashups.• Deployed mashups can be accessed by REST or can be re-

mashed.

Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra 10

• Node-RED is an open-source mashup tool developed by IBM.• Provides a GUI where users drag-and-drop blocks that represent

components of a larger system which can either be devices, software platforms or web services that are to be connected.

• These blocks are called nodes.• A node is a visual representation of a block of JavaScript code designed

to carry out a specific task. • Additional blocks(nodes) can be placed in between these components to

represent software functions that house business logic/data mediation logic.

• With Node-RED the time and effort spent on writing boilerplate code is greatly reduced, and the developer can focus on the business logic.


IBM Node-RED


Flow of a Mashup App in Node-RED

Data from Sensors

Data Mediation

Business Logic Services

Mashup Design of a custom Intrusion Detection System from firewall logs

Strengths of Mashup Tools: At a Glance

Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra 13

Assist the user in application development

Abstracts the low-level complexity

Flexible & Intuitive

Support Integrated administration

Developmental methodology: Mashup + Model-based


Research Direction: Why Integration is Potentially Beneficial? Mashup

ToolsBig Data

Tools

Complex & massive

parallelism

Good for analysing huge data-sets

Operate on more abstract level & heterogeneous

datasets

Simple & single threaded

Good for specifying control

flow

Operate on specific datasets

Full fledged integration would enable mashup developers to specify Big Data analytics jobs and consume their results within a single

application model.


Existing Limitations of Mashup Tools

Blocking execution and synchronous communication in mashups

Single-threaded mashups

Visual Programming Limitations

End-user focus in mashup tools

aFlux:New JVM based Mashup Tool


aFlux

IBM Node-RED aFlux

Asynchronous Execution

Multi-threaded Mashup

Support for Big Data Analytic

Towards Unified Big Data querying

Easy Creation of Domain Specific Apps for end-users

• In the actor model, an actor is the foundation of concurrency or rather like an agent which does the actual work.

• It is analogous to a process or thread. • Actors are very different from objects

• an object can interact directly with another object i.e. changing its values or invoking a method.

• This causes synchronization issues in multi-threaded programs • In an actor system there is no direct way to invoke or interact with an

actor. • They respond to messages.

• An actor may change its internal state, perform some computation, fork new actors or send messages to new actors.


aFlux: Conceptual Approach


How actors behave?

• In designing aFlux, we decided to go with akka, a popular actor system. • The main intuition is that when a user designs a flow we model this flow

internally in terms of actors i.e. an actor is a basic execution unit of the mashup tool for designing a flow.

• we have three actors namely A, B and C where the computation starts with actor A. On completion it sends a message to actor B and so on.

• Since the akka system can be configured in many different ways for parallel and distributed operations and it abstracts away how the actors execute within it, it makes the user created program also parallel and distributed.


aFlux

• In aFlux, a user can create a mashup flow which is called a “flux”. • A flux is analogous to a flow in IBM Node-RED. • The only requirement for designing a flux is that it should have a start

node and an end node.


Execution Pattern


Asynchronous Execution Patterns

• The synchronous components typically block the execution flow, i.e. when they receive a message on their input port they start execution and pass the message through their output ports on their completion.

Synchronous

• On the other hand, asynchronous-capable components have two different types of output ports namely, blocking and non-blocking ports.

Asynchronous


Asynchronous Execution Patterns

• To abstract away independent logic within a main application flow, the system supports logical structuring units called sub-flows.

• A sub-flow encompasses a complete business logic and is independent from other parts of the mashup.

• The sub-flows in the system are like normal asynchronous-capable components. They have input and two sets of output ports (i.e. blocking and non-blocking).

• They encompass within themselves a complete flow of graphical components used for Big Data querying.


Logical Structuring Units in aFlux


A Sub-Flow can house a complete Flux

Supporting Big Data Analytics in aFlux

• aFlux has graphical components specific to Big Data query languages.



• The last component used in a Big Data query is the Big Data executor• This executor component parses all the prior connections and generates

the final query to be passed on to the Big Data system for execution.




A Pig Flux in aFlux

Towards a Unified Approach for querying Big Data

• The graphical components for Big Data queries are tightly coupled to the semantics of the target query language.

• The target languages may evolve and hence we need to create more components• This complicates end-user development

• The idea is to have a set of uniform graphical components for query framing.

• Develop a unified query language to permit end-users to formulate their data queries.

• Internally use translation models to generate queries in different target languages depending on the need as well as context.

• It introduces a layer of abstraction between the language elementsshown on the user interface, i.e. visual elements, and the target querylanguage which is executed on Big Data systems


Towards Unified Analytics


Unified Analytics: Internals

Stream Analytics in aFlux


Data Processing

• Streams are NOT collections• Stream is a sequence of ongoing events ordered in time• Streams have no beginning and no end (unbounded data)• Ever-changing sets of data and values always in motion• Traversable only once


IoT and Stream Analytics

Stream Analytics platforms

36

• Stream analytics is important in the IoT business but− IoT mashups are not optimized for stream analytics− IoT mashup tools (e.g. Node-Red) are single-threaded and

implement a synchronous model


Solution

Benefits:−No complicated set ups−No coding to analyze data streams−Non-experts can use this tool−Short learning curve−Easy to use & user-friendly UI

Goals:1. Implement a stream analytics suite2. Parameterize stream analytics properties3. Evaluate our stream analytics tool


Potential Benefits & Goals

• Akka: Library used for stream analytics (Scala & Java)• Higher-level abstraction over Akka’s existing actor model• Implementation of the Reactive Streams specifications


Solution

3 component types:1.Processing: Filter, Count, Moving Average, Max, Min2.Fan-in: Merge, Zip3.Fan-out: Broadcast


Stream Analytics in aFlux

Stream analytics component architecture

Akka component

Akka Streams component

Akka component

Akka Streams component

41

• Ensure that the receiving side is not forced to buffer arbitrary amounts of data. (overflow strategy defined by user, e.g. back-pressure)

• Non-blocking asynchronous communication between components• Allow the queues which mediate between threads to be bounded.

(queue size defined by user, no failures due to big workload of data)− The benefits of asynchronous processing would be negated if the

communication of back-pressure were synchronous (like the single-threaded Node-Red)

Property ParameterizationBuffer size, Overflow strategy, Window parameters


Benefits & Property Parameterization


Property Parameterization


Property Parameterization • Buffer size of the source queue• Overflow strategies:

• dropHead• dropTail• dropNew• dropBuffer

• Window parameters:• Window type (content/time)• Window method

(tumbling/sliding)• Window size• Sliding step

Scenario:• A9 – highway in Munich • All cars run in the same direction (South to North)• 3 loop detectors on 3 lanes• 4th lane (shoulder lane) is initially closed• 500th tick à accident happens• When average occupancy rate of loop detectors > 30% à open shoulder-

lane


Evaluation

• aFlux flow for stream processing, • SUMO traffic simulator (Traffic simulation A9 project)

− TraCI + Kafka


Experiment Setup

• Testing parameters:1. Loop detectors’ occupancy2. Mean speeds of cars3. Lane state4. Time

• Analytics methods tested:1. One-by-one processing (no window)2. Content-based tumbling window processing of size 503. Content-based tumbling window processing of size 3004. Content-based tumbling window processing of size 5005. Content-based sliding window processing of size 500


Testing Parameters & Analytics Methods Used


Average mean speed of cars: No Window


Average mean speed of cars: Diff. Windows

Method Responsiveness Settling Time ScalabilityNo Window Very Slow Long LowTumbling 50 Very Fast Very Long Very LowTumbling 300 Slow Very Short HighTumbling 500 Slow None Very HighSliding 500 Slow None Very High


Summary of Results

aFlux: Supporting Spark


Enabling Spark Queries in aFlux

Sample Spark Application


Spark Flow in aFlux


Typical Structure of Big Data Jobs


Spark Jobs in aFlux: Concepts


Zooming into the Core Idea

aFlux: Supporting Apache Flink


Apache Flink: Overview

Smart Santander

City-scale experimental research facility• 3000 IEEE 802.15.4 devices• 200 GPRS modules• 2000 joint RFID tag/QR code labels• Static locations (streetlights, façades, bus stops) as well as on-board

of mobile vehicles (buses, taxis)

Integration of Big Data Analytics in IoT Mashup Tools | MahapatraSource: http://smartsantander.eu

59 of 20

Smart Santander

Use cases• Traffic Intensity Monitoring−Around 60 devices located at the main entrances of the city of

Santander • Environmental Monitoring−Around 2000 IoT devices installed mainly at the city center−Sensors installed in around 150 public vehicles, including buses,

taxis and police cars.• Outdoor parking area management• Parks and gardens irrigation

Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra60 of 20

Smart Santander

Enabling Stream Analytics in Smart Santander with FlinkMappings of the API required:• DataStream API for stream processing• CEP Library for event processing

Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra

Source: https://flink.apache.org/

61 of 20

Steps: Overview

Task #1: SmartSantander connector for Flink

Task #2: aFlux actors

Task #3: Flink API mapper

Task #4: Front-end Validation

Integration of Big Data Analytics in IoT Mashup Tools | Mahapatra 62 of 20

Implementation Tasks: aFlux Actors


Actors are in charge of:• Generating Java code from the user-defined properties• Generating final Flink job• Exchanging messages with previous/posterior actors → FlinkFlowMessage

Set of aFlux tools developed:

63 of 20

Implementation Tasks: aFlux Actors


Structure of a Flink job

64 of 20

Implementation Tasks: Front-End Validation

Semantics between nodes:

An array of ToolSemanticsCondition can be defined when creating a new toolErrors are shown to the user when creating the flow


<Tool A>should

come (immediately)before

<Tool B>must after

65 of 20

Evaluation

Goal → prove how easy it is to create Flink jobs from aFluxEvaluation based on two use cases:


UC1: Real Time Data Processing

AggregateFunction, AllWindowedStream, DataStream, FilterFunction, MapFunction, RichSourceFunction, SlidingWindow, StreamExecutionEnvironment, TumblingWindow

Code Description

UC1E1 Temperature vs. air quality in a certain area in relation with the average of the city

UC1E2 Air quality vs. traffic charge in the city center

UC1E3 Noise vs. traffic charge in the city center

UC1E4 Max/min monitor

UC2: Pattern Detection

DataStream, Pattern, PatternSelectFunction, PatternStream

Code Description

UC2E1 Traffic increasing in a certain area

UC2E2 Heatwave in the city

Evaluation

UC1E1: live data from SmartSantander API (6th June 2018 9am-10am @ 5-min Wind.)


Road Ahead …


What’s Next?Current Agenda:

1. Paper in Review: Stream Analytics in IoT Mashup Tools2. 2 Papers in Pipeline: Spark and Flink3. Development of a Uniform Methodology, Thesis Writing4. Thesis Submission Planned in June, 2019.

Thanks -- Questions? Ideas?

aflux: big data analytics in iot mashup tools• introduction to new jvm based mashup tool: aflux...

Documents