big data management, scalable data science, and apache ...€¦ · • first results – alexander...

45
1 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl Big Data Management, Scalable Data Science, and Apache Flink: Challenges and (some) Solutions Prof. Dr. Volker Markl http://www.user.tu-berlin.de/marklv/ http://www.dima.tu-berlin.de http:// www.dfki.de/web/forschung/iam http://bbdc.berlin [email protected]

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl

Big Data Management,

Scalable Data Science, and Apache Flink:

Challenges and (some) SolutionsProf. Dr. Volker Markl

http://www.user.tu-berlin.de/marklv/

http://www.dima.tu-berlin.de

http://www.dfki.de/web/forschung/iam

http://bbdc.berlin

[email protected]

Page 2: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

2 © Volker Markl2 © 2013 Berlin Big Data Center • All Rights Reserved

2 © Volker Markl

ML

ML

alg

orith

ms

alg

orith

ms

data too uncertain Veracity Data Mining MATLAB, R, Python

Predictive/Prescriptive MATLAB, R, Python

Data & Analysis: Increasingly Complex!

data volume too large Volume

data rate too fast Velocity

data too heterogeneous Variability

Data

Reporting aggregation, selection

Ad-Hoc Queries SQL, XQuery

ETL/ELT MapReduce

Analysis

DM

DM

sca

lab

ility

sca

lab

ility

Page 3: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved

3 © Volker Markl

Application

Data

Science

Control Flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Decoupling

Convergence

Monte Carlo

Mathematical Programming

Linear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra / SQL

Scalability

Data Analysis Language

Compiler

Memory Management

Memory Hierarchy

Data Flow

Hardware Adaptation

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)

Real-Time

New Technology to the Rescue!

Page 4: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

4 © Volker Markl4

4 © Volker Markl

Big Data Analytics Requires Systems Programming

R/Matlab:

3 million users

Hadoop:

100,000

users

Data Analysis

Statistics

Algebra

Optimization

Machine Learning

NLP

Signal Processing

Image Analysis

Audio-,Video Analysis

Information Integration

Information Extraction

Data Value Chain

Data Analysis Process

Predictive Analytics

Indexing

Parallelization

Communication

Memory Management

Query Optimization

Efficient Algorithms

Resource Management

Fault Tolerance

Numerical StabilityBig Data is now where database systems were in the

70s (prior to relational algebra, query optimization

and a SQL-standard)!

People with Big Data

Analytics Skills

Declarative languages to the rescue!

Page 5: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

5 © Volker Markl5 © 2013 Berlin Big Data Center • All Rights Reserved

5 © Volker Markl

Deep Analysis of „Big Data“ is Key!

Small Data Big Data (3V)

Deep A

naly

tics

Sim

ple

Analy

sis

Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.

Page 6: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

6 © Volker Markl6 © 2013 Berlin Big Data Center • All Rights Reserved

6 © Volker Markl

Challenge: Technologies for Data Science at the Intersection of

Data Management and Machine Learning

aggregation relational algebra UDF iteration/recursionlinear algebragraph algebraeetc.

DM MLFeature Engineering

Representation

Algorithms (SVM, EM, etc.)

Declarative Languages

Automatic Adaption

Scalable processing

Think ML-algorithms

in a scalable way

Process (iterative)

algorithms

in a scalable way

declarative

Page 7: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

7 © Volker Markl7 © Volker Markl

Apache Flink –

Big Data Batch

and Stream Processing

Page 8: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

8 © Volker Markl

• Relational Algebra

• Declarativity

• Query Optimization

• Robust Out-of-core

• Scalability

• User-defined

Functions

• Complex Data Types

• Schema on Read

• Iterations

• Advanced Dataflows

• General APIs

• Native Streaming

8

Draws on

Database Technology

Draws on

MapReduce Technology

Adds

Stratosphere: General Purpose

Programming + Database Execution

Page 9: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

9 © Volker Markl9

9 © Volker Markl

Apache Flink is an open source platform for scalable batch and stream

data processing.

What is Apache Flink?

http://flink.apache.org

• The core of Flink is a distributed

streaming dataflow engine.

• Executing dataflows in parallel on

clusters

• Providing a reliable foundation for

various workloads

• DataSet and DataStream programming

abstractions are the foundation for user

programs and higher layers

Page 10: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

10 © Volker Markl10 © 2013 Berlin Big Data Center • All Rights Reserved

10 © Volker Markl

Technology inside Flink

case class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Cost-based

optimizer

Type extraction

stack

Task

scheduling

Recovery

metadata

Pre-flight (Client)

MasterWorkers

DataSourc

eorders.tbl

Filter

MapDataSourc

elineitem.tbl

JoinHybrid Hash

build

HTprobe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

Dataflow

Graph

Memory

manager

Out-of-core

algos

Batch &

Streaming

State &

Checkpoints

deploy

operators

track

intermediate

results

Page 11: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

11 © Volker Markl11 © 2013 Berlin Big Data Center • All Rights Reserved

11 © Volker Markl

Effect of optimization

11

Run on a sampleon the laptop

Run a month laterafter the data evolved

Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C

Page 12: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

12 © Volker Markl12 © 2013 Berlin Big Data Center • All Rights Reserved

12 © Volker Markl

Why optimization ?

Do you want to hand-tune that?

Page 13: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

13 © Volker Markl13 © 2013 Berlin Big Data Center • All Rights Reserved

13 © Volker Markl

DATA STREAMING ANALYSIS

Page 14: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

14 © Volker Markl14

14 © Volker Markl

Life of data streams

• Create: create streams from event sources (machines, databases, logs, sensors, …)

• Collect: collect and make streams available for consumption (e.g., Apache Kafka)

• Process: process streams, possibly generating derived streams (e.g., Apache Flink)

14

Page 15: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

15 © Volker Markl15 © 2013 Berlin Big Data Center • All Rights Reserved

15 © Volker Markl

Stream Analysis in Flink

More at: http://flink.apache.org/news/2015/02/09/streaming-example.html

Page 16: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

16 © Volker Markl16

16 © Volker Markl

Defining windows in Flink

• Trigger policy

– When to trigger the computation on current window

• Eviction policy

– When data points should leave the window

– Defines window width/size

• E.g., count-based policy

– evict when #elements > n

– start a new window every n-th element

• Built-in: Count, Time, Delta policies

Page 17: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

17 © Volker Markl17

17 © Volker Markl

Checkpointing / Recovery

• Flink acknowledges batches of records

– Less overhead in failure-free case

– Currently tied to fault tolerant data sources (e.g., Kafka)

• Flink operators can keep state

– State is checkpointed

– Checkpointing and record acks go together

• Exactly one semantics for state

Page 18: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

18 © Volker Markl18 © 2013 Berlin Big Data Center • All Rights Reserved

18 © Volker Markl

Checkpointing / Recovery

18

Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)

Page 19: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

19 © Volker Markl19 © 2013 Berlin Big Data Center • All Rights Reserved

19 © Volker Markl

ITERATIONS IN DATA FLOWS

MACHINE LEARNING

ALGORITHMS

Page 20: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

20 © Volker Markl20

20 © Volker Markl

Iterate by looping

• for/while loop in client submits one job per iteration step

• Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Page 21: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

21 © Volker Markl21

21 © Volker Markl

Iterate in the Dataflow

partial

solution partial

solution X

other

datasets

Y initial

solution

iteration

result

Replace

Step function

Page 22: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

22 © Volker Markl22 © 2013 Berlin Big Data Center • All Rights Reserved

22 © Volker Markl

Large-Scale Machine Learning

Factorizing a matrix with28 billion ratings forrecommendations

(Scale of Netflixor Spotify)

More at: http://data-artisans.com/computing-recommendations-with-flink.html

Page 23: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

23 © Volker Markl23

23 © Volker Markl

Optimizing iterative programs

Caching Loop-invariant DataPushing work„out of the loop“

Maintain state as index

Page 24: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

24 © Volker Markl24 © 2013 Berlin Big Data Center • All Rights Reserved

24 © Volker Markl

STATE IN ITERATIONS

GRAPHS AND MACHINE

LEARNING

Page 25: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

25 © Volker Markl25

25 © Volker Markl

Iterate natively with deltas

partial

solution

delta

setX

other

datasets

Y initial

solution

iteration

result

workset A B workset

Merge deltas

Replace

initial

workset

Page 26: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

26 © Volker Markl26

26 © Volker Markl

Effect of delta iterations…

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

1 6 11 16 21 26 31 36 41 46 51 56 61

# o

f e

lem

en

ts u

pd

ate

d

iteration

Page 27: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

27 © Volker Markl27 © 2013 Berlin Big Data Center • All Rights Reserved

27 © Volker Markl

… very fast graph analysis

… and mix and matchETL-style and graph analysisin one program

Performance competitivewith dedicated graph

analysis systems

More at: http://data-artisans.com/data-analysis-with-flink.html

Page 28: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

28 © Volker Markl28 © Volker Markl

Current Benchmark Results

Performed by Yahoo! Engineering,

Dec 16, 2015

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and

Flink 0.10.1 show sub- second latencies

at relatively high throughputs[..]. Spark

streaming 1.5.1 supports high

throughputs, but at a relatively higher

latency.

Flink achieves highest throughput

with competitive low latency!

Source: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 29: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

29 © Volker Markl29

29 © Volker Markl

Timeline

Page 30: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

30 © Volker Markl30

30 © Volker Markl

(Strictly) Flink European Meetups with

Member Totals (as of 30.5.16)

Country Total Members

Berlin 758

Paris 500

Madrid 384

Stockholm 313

Brussels 279

London 190

Munich 98

Istanbul 56

Page 31: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

31 © Volker Markl31

31 © Volker Markl

Meetups By Country Concerning Flink

Apache Flink Meetups Worldwide (Data accurate as of 30.5.16) 6326 members strictly focused on Apache Flink (comprising 57%)4771 members broader in scope, including Flink (comprising 43%)

Page 32: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

32 © Volker Markl32

32 © Volker Markl

Distribution of (Strictly) Flink Meetup Group

Members by Country (as of 30.5.16)

Country Total Members

USA 3184

Germany 856

France 500

Spain 384

Sweden 313

Belgium 279

Brazil 233

UK 190

Taiwan 142

India 139

Turkey 56

Mexico 50

Page 33: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

33 © Volker Markl33

33 © Volker Markl

> 13 Companies Using Flink

Page 34: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

34 © Volker Markl34

34 © Volker Markl

> 6 Software Projects Using Flink

Page 35: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

35 © Volker Markl35

35 © Volker Markl

> 10 Research Institutions

Using Flink

Page 36: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

36 © Volker Markl36

36 © Volker Markl

Fault tolerance

Pessimistic Recovery:

• Write intermediate state to stable storage

• Restart from such a checkpoint in case of a failure

Problematic:

• High overhead, checkpoint must

be replicated to other machines

• Overhead always incurred, even if no

failures happen!

How can we avoid this overhead in failure-free casesru

nti

me

per

iter

atio

n (

sec)

} actual work

} checkpointingoverhead

20

120

Page 37: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

37 © Volker Markl37

37 © Volker Markl

Optimistic recovery

• Many data mining algorithms are fixpoint algorithms

• Optimistic Recovery: jump to a different state in case of a failure,

still converge to solution

• No checkpoints No overhead in absense of failures!

• algorithm-specific compensation function must restore state

pessimistic recovery optimistic recoveryfailure-free

Page 38: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

38 © Volker Markl38

38 © Volker Markl

All Roads lead to Rome

If you are interested more, read our CIKM 2013 paper:

Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, Volker

Markl: "All roads lead to Rome": optimistic recovery for

distributed iterative data processing. CIKM 2013: 1919-1928

Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios

Katsifodimos, Stephan Ewen, Kostas Tzoumas, Volker Markl:

Optimistic Recovery for Iterative Dataflows in Action.

To appear in SIGMOD 2015

Page 39: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

39 © Volker Markl39 © Volker Markl

Declarative Data

Processing

and Big Data

Page 40: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

40 © Volker Markl40

40 © Volker Markl

A Billion $$$ Mantra...

Declarative Data Processing

SQL Relations RDBMS

A simple, high-level language for querying data (Chamberlin ’74).

An effective, formal foundation based on relational algebra and calculus (Codd ’71).

An efficient, low-level execution environment tailored towards the data (Selinger ’79).

Page 41: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

41 © Volker Markl41

41 © Volker Markl

With 40+ years of success...

Declarative Data Processing

SQL Relations RDBMS

Page 42: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

42 © Volker Markl42

42 © Volker Markl

Is Being Revised

Declarative Data Processing

SQL Relations RDBMS

DistributedCollections

Parallel DataflowEngines

Second-OrderFunctions

Page 43: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

43 © Volker Markl43

43 © Volker Markl

Overall Vision & Next Steps

• First results

– Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016

– Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, Volker Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)

• Next Steps (Fall 2016)– Open-Source Release

• Vision (Frontend): Multi-model DSL based on type contracts– Collection Processing DataBag[A]

– Linear Algebra Matrix[A], Vector[A]

– Stream Processing Stream[A]

• Vision (Backend): Target more execution engines– Column Stores

– GPUs

Page 44: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

44 © Volker Markl44

44 © Volker Markl

Thanks to my team members and students

• Dr. Stephan Ewen

• Sebastian Schelter

• Dr. Kostas Tzoumas

• Dr. Asterios Katsifodimos

• Fabian Hüske

• Alexander Alexandrov

• Max Heimel

and many more members of the Stratosphere Project, the

Berlin Big Data Center, and the Apache Flink community

Page 45: Big Data Management, Scalable Data Science, and Apache ...€¦ · • First results – Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl:

45 © Volker Markl

Evolution of Big Data Platforms

4G

3G

2G

1G

Relational Databases

Hadoop

Flink

Scale-out, Map/Reduce, UDFs

Spark

In-memory Performance and Improved Programming Model

In-memory + Out of Core Performance, Declarativity, Optimisation of iterative Algorithms, True Streaming/Lambda