dataflows: the abstraction that powers big data by raul castro fernandez at big data spain 2014
TRANSCRIPT
![Page 1: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/1.jpg)
THE ABSTRACTION THAT POWERS THE BIG DATA
RAÚL CASTRO FERNÁNDEZCOMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE
![Page 2: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/2.jpg)
Data!ows: The Abstraction that Powers Big Data
Raul Castro Fernandez Imperial College London
[email protected] @raulcfernandez
![Page 3: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/3.jpg)
“Big Data needs Democra:za:on”
![Page 4: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/4.jpg)
3
Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.
Democratization of Data
![Page 5: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/5.jpg)
4
Decision makers, domain scien:sts, applica:on users, journalists, crowd workers, and everyday consumers, sales,
marke:ng…
Democratization of Data
Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.
![Page 6: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/6.jpg)
5
+ Everyone has data
![Page 7: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/7.jpg)
6
+ Everyone has data
+ Many have interes:ng ques:ons
![Page 8: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/8.jpg)
7
+ Everyone has data
+ Many have interes:ng ques:ons
-‐ Not everyone knows how to analyze it
![Page 9: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/9.jpg)
8
+ Everyone has data
+ Many have interes:ng ques:ons
-‐ Not everyone knows how to analyze it
![Page 10: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/10.jpg)
9
Bob Local Expert
![Page 11: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/11.jpg)
10
Bob Local Expert
![Page 12: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/12.jpg)
11
Bob Local Expert
-‐ Barrier of human communica:on -‐ Barrier of professional rela:ons
![Page 13: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/13.jpg)
12
Bob Local Expert
-‐ Barrier of human communica:on -‐ Barrier of professional rela:ons
The limits of my language mean the limits of my world.
Ludwig WiWgenstein “Tractatus Logico-‐Philosophicus 1922”
![Page 14: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/14.jpg)
13
First step to democra:ze Big Data: to offer a familiar programming interface
![Page 15: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/15.jpg)
• Mo>va>on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-‐based fault tolerance for SDGs • Experimental evalua:on
14
Outline
? ?
![Page 16: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/16.jpg)
Mutable State in a Recommender System
15
Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); Item-‐A Item-‐B
User-‐A 4 5
User-‐B 0 5
Item-‐A Item-‐B
Item-‐A 1 1
Item-‐B 1 2
User-‐Item matrix (UI)
Co-‐Occurrence matrix (CO)
![Page 17: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/17.jpg)
Mutable State in a Recommender System
16
Matrix userItem = new Matrix(); Matrix coOcc = new Matrix();
void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); }
Item-‐A Item-‐B
User-‐A 4 5
User-‐B 0 5
Item-‐A Item-‐B
Item-‐A 1 1
Item-‐B 1 2
User-‐Item matrix (UI)
Co-‐Occurrence matrix (CO)
Update with new ra:ngs
![Page 18: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/18.jpg)
Mutable State in a Recommender System
17
Matrix userItem = new Matrix(); Matrix coOcc = new Matrix();
void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); }
Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; }
Item-‐A Item-‐B
User-‐A 4 5
User-‐B 0 5
Item-‐A Item-‐B
Item-‐A 1 1
Item-‐B 1 2
User-‐Item matrix (UI)
Co-‐Occurrence matrix (CO)
Update with new ra:ngs
Mul:ply for recommenda:on
User-‐B 1 2 x
![Page 19: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/19.jpg)
18
Challenges When Executing with Big Data
Big Data Problem: Matrices
become large
> Mutable state leads to concise algorithms but complicates parallelism and fault tolerance
Matrix userItem = new Matrix(); Matrix coOcc = new Matrix();
> Cannot lose state aRer failure
> Need to manage state to support data-‐parallelism
![Page 20: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/20.jpg)
19
Using Current Distributed Data"ow Frameworks
Input data
Output data
> No mutable state simplifies fault tolerance
> MapReduce: Map and Reduce tasks > Storm: No support for state > Spark: Immutable RDDs
![Page 21: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/21.jpg)
20
> Programming distributed dataflow graphs requires learning new programming models
Imperative Big Data Processing
![Page 22: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/22.jpg)
21
Our Goal: Run Java programs with mutable state but with
performance and fault tolerance of distributed dataflow systems
> Programming distributed dataflow graphs requires learning new programming models
Imperative Big Data Processing
![Page 23: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/23.jpg)
22
> @Annota>ons help with transla>on from Java to SDGs > Mutable distributed state in dataflow graphs
Stateful Data"ow Graphs: From Imperative Programs to Distributed Data"ows
Program.java
SDGs: Stateful Dataflow Graphs
> Checkpoint-‐based fault tolerance recovers mutable state aRer failure
![Page 24: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/24.jpg)
• Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-‐based fault tolerance for SDGs • Experimental evalua:on
23
Outline
Program.java
![Page 25: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/25.jpg)
SDG: Data, State and Computation
> SDGs separate data and state to allow data and pipeline parallelism
24
Task Elements (TEs) process data
State Elements (SEs) represent state
Dataflows represent
data
> Task Elements have local access to State Elements
![Page 26: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/26.jpg)
State Elements support two abstrac:ons for distributed mutable state – Par>>oned SEs: task elements always access
state by key – Par>al SEs: task elements can access
complete state
25
Distributed Mutable State
![Page 27: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/27.jpg)
26
Distributed Mutable State: Partitioned SEs
Dataflow routed according to hash func:on
Item-‐A Item-‐B
User-‐A 4 5
User-‐B 0 5 Access by key
State par::oned according to par>>oning key
> Par>>oned SEs split into disjoint par::ons
User-‐Item matrix (UI)
hash(msg.id)
Key space: [0-‐N] [0-‐k]
[(k+1)-‐N]
![Page 28: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/28.jpg)
27
Distributed Mutable State: Partial SEs
Local access: Data sent to one
Global access: Data sent to all
> Par>al SE gives nodes local state instances
> Par>al SE access by TEs can be local or global
![Page 29: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/29.jpg)
28
Merging Distributed Mutable State
Merge logic
> Requires applica:on-‐specific merge logic
> Reading all par:al SE instances results in set of par>al values
![Page 30: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/30.jpg)
29
Merging Distributed Mutable State
Mul:ple par:al values
Merge logic
> Requires applica:on-‐specific merge logic
> Reading all par:al SE instances results in set of par>al values
![Page 31: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/31.jpg)
30
Merging Distributed Mutable State
Mul:ple par:al values
Collect par:al values
Merge logic
> Requires applica:on-‐specific merge logic
> Reading all par:al SE instances results in set of par>al values
![Page 32: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/32.jpg)
31
Outline
> @Annota>ons
• Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla>ng Java programs to SDGs • Checkpoint-‐based fault tolerance for SDGs • Experimental evalua:on
Program.java
![Page 33: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/33.jpg)
32
From Imperative Code to Execution
SEEP
Annotated program
> SEEP: data-‐parallel processing plaborm
• Transla:on occurs in two stages: – Sta<c code analysis: From Java to SDG – Bytecode rewri<ng: From SDG to SEEP [SIGMOD’13]
Program.java
![Page 34: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/34.jpg)
Program.java
33
Extract TEs, SEs and accesses
Live variable analysis
TE and SE access code assembly
SEEP runnable
SOOT Framework
Javassist
> Extract state and state access paderns through sta:c code analysis
> Genera:on of runnable code using TE and SE connec:ons
Translation Process
![Page 35: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/35.jpg)
Program.java
34
Extract TEs, SEs and accesses
Live variable analysis
TE and SE access code assembly
SEEP runnable
SOOT Framework
Javassist
> Extract state and state access paderns through sta:c code analysis
> Genera:on of runnable code using TE and SE connec:ons
Translation Process
Annotated Program.java
![Page 36: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/36.jpg)
35
@Par>>oned Matrix userItem = new SeepMatrix(); Matrix coOcc = new Matrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; }
Partitioned State Annotation
> @Par>>on field annota>on indicates par<<oned state
hash(msg.id)
![Page 37: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/37.jpg)
36
@Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(@Global coOcc, userItem); }
Partial State and Global Annotations
> @Global annotates variable to indicate access to all par:al instances
> @Par>al field annota>on indicates par<al state
![Page 38: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/38.jpg)
37
@Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); Vector getRec(int user) { Vector userRow = userItem.getRow(user); @Par>al Vector puRec = @Global coOcc.mul:ply(userRow); Vector userRec = merge(puRec); return userRec; } Vector merge(@Collec>on Vector[] v){ /*…*/ }
Partial and Collection Annotations
> @Collec>on annota:on indicates merge logic
![Page 39: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/39.jpg)
38
Outline
> Failures
• Mo:va:on • SDG: Stateful Dataflow Graphs • Handling distributed state in SDGs • Transla:ng Java programs to SDGs • Checkpoint-‐Based fault tolerance for SDGs • Experimental evalua:on
Program.java
![Page 40: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/40.jpg)
39
Challenges of Making SDGs Fault Tolerant
Physical deployment of SDG > Node failures may lead to state loss
> Task elements access local in-‐memory state
![Page 41: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/41.jpg)
40
Challenges of Making SDGs Fault Tolerant
RAM RAM
Physical deployment of SDG > Node failures may lead to state loss
> Task elements access local in-‐memory state
Physical nodes
![Page 42: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/42.jpg)
41
RAM RAM
Physical deployment of SDG > Node failures may lead to state loss
Checkpoin>ng State • No updates allowed while state
is being checkpointed • Checkpoin:ng state should not
impact data processing path
> Task elements access local in-‐memory state
Physical nodes
Challenges of Making SDGs Fault Tolerant
![Page 43: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/43.jpg)
42
RAM RAM
Physical deployment of SDG
• Backups large and cannot be stored in memory
• Large writes to disk through network have high cost
State Backup
> Node failures may lead to state loss
Checkpoin>ng State • No updates allowed while state
is being checkpointed • Checkpoin:ng state should not
impact data processing path
> Task elements access local in-‐memory state
Physical nodes
Challenges of Making SDGs Fault Tolerant
![Page 44: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/44.jpg)
43
Checkpoint Mechanism for Fault Tolerance
1. Freeze mutable state for checkpoin:ng 2. Dirty state supports updates concurrently 3. Reconcile dirty state
Asynchronous, lock-‐free checkpoin>ng
Dirty state
![Page 45: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/45.jpg)
44
Distributed M to N Checkpoint Backup
M to N distributed backup and parallel recovery
![Page 46: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/46.jpg)
45
Distributed M to N Checkpoint Backup
M to N distributed backup and parallel recovery
![Page 47: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/47.jpg)
46
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 48: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/48.jpg)
47
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 49: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/49.jpg)
48
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 50: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/50.jpg)
49
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 51: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/51.jpg)
50
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 52: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/52.jpg)
51
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 53: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/53.jpg)
52
M to N distributed backup and parallel recovery
Distributed M to N Checkpoint Backup
![Page 54: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/54.jpg)
How does mutable state impact performance? How efficient are translated SDGs? What is the throughput/latency trade-‐off?
Experimental set-‐up: – Amazon EC2 (c1 and m1 xlarge instances) – Private cluster (4-‐core 3.4 GHz Intel Xeon servers with 8 GB RAM ) – Sun Java 7, Ubuntu 12.04, Linux kernel 3.10
53
Evaluation of SDG Performance
![Page 55: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/55.jpg)
54
0
5
10
15
20
1:5 1:2 1:1 2:1 5:1
100
1000
Thro
ughp
ut(1
000
requ
ests
/s)
Late
ncy
(ms)
Workload (state read/write ratio)
ThroughputLatency
Combines batch and online processing to serve fresh results over large mutable state
Processing with Large Mutable State
> addRa:ng and getRec func:ons from recommender algorithm, while changing read/write ra:o
![Page 56: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/56.jpg)
55
0
10
20
30
40
50
60
25 50 75 100
Th
rou
gh
pu
t (G
B/s
)
Number of nodes
SDGSpark
Translated SDG achieves performance similar to non-‐mutable dataflow
> Batch-‐oriented, itera:ve logis:c regression
E#ciency of Translated SDG
![Page 57: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/57.jpg)
56
SDGs achieve high throughput while main>ng low latency
Latency/Throughput Tradeo$
> Streaming word count query, repor:ng counts over windows
0
50
100
150
200
250
10 100 1000 10000Thro
ughput (1
000 r
equest
s/s)
Window size (ms)
SDGNaiad-LowLatency
![Page 58: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/58.jpg)
57
SDGs achieve high throughput while main>ng low latency
Latency/Throughput Tradeo$
> Streaming word count query, repor:ng counts over windows
0
50
100
150
200
250
10 100 1000 10000Thro
ughput (1
000 r
equest
s/s)
Window size (ms)
SDGNaiad-LowLatency
0
50
100
150
200
250
10 100 1000 10000Thro
ughput (1
000 r
equest
s/s)
Window size (ms)
Naiad-HighThroughputSDG
Streaming Spark
![Page 59: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/59.jpg)
58
SDGs achieve high throughput while main>ng low latency
Latency/Throughput Tradeo$
> Streaming word count query, repor:ng counts over windows
0
50
100
150
200
250
10 100 1000 10000Thro
ughput (1
000 r
equest
s/s)
Window size (ms)
SDGNaiad-LowLatency
0
50
100
150
200
250
10 100 1000 10000Thro
ughput (1
000 r
equest
s/s)
Window size (ms)
Naiad-HighThroughputSDG
Streaming Spark0
50
100
150
200
250
10 100 1000 10000Th
rou
gh
pu
t (1
00
0 r
eq
ue
sts/
s)
Window size (ms)
Naiad-HighThroughputSDG
Streaming SparkNaiad-LowLatency
![Page 60: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/60.jpg)
Running Java programs with the performance of current distributed dataflow frameworks
SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access
– Checkpoint-‐based fault tolerance mechanism
59
Summary
![Page 61: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/61.jpg)
Running Java programs with the performance of current distributed dataflow frameworks
SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access
– Checkpoint-‐based fault tolerance mechanism
60
Summary
Thank you! Any Ques>ons?
@raulcfernandez [email protected]
hEps://github.com/lsds/Seep/ hEps://github.com/raulcf/SEEPng/
![Page 62: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/62.jpg)
BACKUP SLIDES
61
![Page 63: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/63.jpg)
62
0
0.5
1
1.5
2
50 100 150 200 1
10
100
1000
Th
rou
gh
pu
t (m
illio
n r
eq
ue
sts/
s)
La
ten
cy (
ms)
Aggregated memory (GB)
ThroughputLatency
Support large state without compromising throughput or latency while staying fault tolerant
Scalability on State Size and Throughput
> Increase state size in a mutated KV store
![Page 64: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/64.jpg)
63
Itera:on in SDG
> Local itera>on supported by one node
> Itera>on across TEs requires cycle in the dataflow
![Page 65: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/65.jpg)
• Par::on • Par:al • Global • Par:al • Collec:on • Data annota:ons – Batch – Stream
64
Types of Annota:ons
![Page 66: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/66.jpg)
Overhead of SDG Fault Tolerance
65
1
10
100
1000
10000
No FT 1 2 3 4 5
Late
ncy
(ms)
State size (GB)
1
10
100
1000
2 4 6 8 10 No FT
Late
ncy
(ms)
Checkpoint frequency (s)
Fault Tolerance mechanism impact on performance and
latency is small.
State size and checkpoin>ng Frequency do not affect the performance
![Page 67: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/67.jpg)
66
0
2
4
6
8
10
10 100 1000 2000 0
20
40
60
80
100
Thro
ughput (1
0,0
00 r
equest
s/s)
Late
ncy
(m
s)
Aggregated memory (MB)
SDGNaiad-NoDiskNaiad-DiskSDG (latency)Naiad-NoDisk (latency)
Fault Tolerance Overhead
![Page 68: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/68.jpg)
0
5
10
15
20
25
30
35
40
1 2 4
Reco
very
tim
e (
s)
State size (GB)
1-to-1 recovery2-to-1 recovery1-to-2 recovery2-to-2 recovery
67
Recovery Times
![Page 69: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/69.jpg)
68
0
5
10
15
20
25
30
0 10 20 30 40 50 60 0
1
2
3
4
5
Th
rou
gh
pu
t (1
00
0 r
eq
ue
st/s
)
Nu
mb
er
of
no
de
s
Time (s)
ThroughputNodes
Stragglers
![Page 70: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/70.jpg)
69
0
50
100
150
200
250
1 2 3 40.001
0.01
0.1
1
10
Thro
ughp
ut(1
000
requ
ests
/s)
Late
ncy
(s)
State size (GB)
T'put (Sync)Latency (Sync)T'put (Async)
Fault Tolerance Sync. Vs. Async.
![Page 71: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/71.jpg)
System Large State Mutable State Low Latency Itera>on
MapReduce n/a n/a No No
Spark n/a n/a No Yes
Storm n/a n/a Yes No
Naiad No Yes Yes Yes
SDG Yes Yes Yes Yes
70
Comparison to State-‐of-‐the-‐Art
SDGs are first stateful fault tolerant model; enabling execu:on of impera:ve code with explicit state
![Page 72: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/72.jpg)
71
Characteris:cs of SDGs
> Run>me Data Parallelism (elas>city)
> Support for Cyclic Graphs
> Low Latency
Adapta:on to varying workloads and mechanism against stragglers
Efficiently represent itera:ve algorithms
Pipelining tasks decreases latency
![Page 73: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/73.jpg)
72
Bob Local Expert
Hi, I have a query to run on “Big Data”
Ok, cool, tell me about it
I want to know sales per employee on Saturdays
… well … ok, come in 3 days
Well, this is actually preWy urgent…
… 2 days, I’m preWy busy
2 Days Ayer
Hi! You have the results?
Yes, here you have your sales last Saturday
My sales? I meant all employee sales, and not only last Saturday
ups, sorry for that, give me 2 days…
![Page 74: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014](https://reader030.vdocuments.net/reader030/viewer/2022020307/55a1e1ac1a28ab21778b48ba/html5/thumbnails/74.jpg)
17TH ~ 18th NOV 2014MADRID (SPAIN)