gelly-stream: single-pass graph streaming analytics with apache flink

71
@GraphDevroom Single-pass Graph Stream Analytics with Apache Flink Vasia Kalavri <[email protected]> Paris Carbone <[email protected]> 1

Upload: vasia-kalavri

Post on 16-Apr-2017

1.603 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Single-pass Graph Stream Analytics with Apache Flink

Vasia Kalavri <[email protected]> Paris Carbone <[email protected]>

1

Page 2: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Outline

• Why Graph Streaming?

• Single-Pass Algorithms Examples

• Apache Flink Streaming API

• The GellyStream API

Page 3: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Real Graphs are dynamic

Graphs are created from events happening in real-time

3

Page 4: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

4

Page 5: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Batch Graph Processing

5

We create and analyze a snapshot of the real graph

• the Facebook social network on January 30 2016

• user web logs gathered between March 1st 12:00 and 16:00

• retweets and replies for 24h after the announcement of the death of David Bowie

Page 6: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Streaming Graph Processing

We consume events in real-time

• Get results faster • No need to wait for the job to finish

• Sometimes, early approximations are better than late exact answers

• Get results continuously • Process unbounded number of events

6

Page 7: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Challenges

• Maintain the graph structure • How to apply state updates efficiently?

• Result updates • Re-run the analysis for each event? • Design an incremental algorithm? • Run separate instances on multiple snapshots?

• Computation on most recent events only

7

Page 8: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Single-Pass Graph Streaming

• Each event is an edge addition

• Maintains only a graph summary

• Recent events are grouped in graph windows

8

Page 9: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

Streaming Degrees Distribution#v

ertic

es

degree

9

Page 10: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

10

Page 11: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

11

Page 12: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

12

Page 13: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

13

Page 14: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

14

Page 15: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

15

Page 16: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

16

Page 17: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

17

Page 18: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

18

Page 19: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Graph Summaries

• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties

graph summary

algorithm algorithm~R1 R2

19

Page 20: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Window Aggregations

Neighborhood aggregations on windows

20

Page 21: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Examples

21

Page 22: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Batch Connected Components

• State: the graph and a component ID per vertex (initially equal to vertex ID)

• Iterative Computation: For each vertex:

• choose the min of neighbors’ component IDs and own component ID as new ID

• if component ID changed since last iteration, notify neighbors

22

Page 23: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

8

i=0

Batch Connected Components

23

Page 24: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

11

2

2

6

6

6

i=1

Batch Connected Components

24

Page 25: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

11

1

5

6

6

6

i=2

Batch Connected Components

25

1

Page 26: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

11

1

1

6

6

6

i=3

Batch Connected Components

26

Page 27: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Stream Connected Components

• State: a disjoint set data structure for the components

• Computation: For each edge

• if seen for the 1st time, create a component with ID the min of the vertex IDs

• if in different components, merge them and update the component ID to the min of the component IDs

• if only one of the endpoints belongs to a component, add the other one to the same component

27

Page 28: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

31

52

54

76

86

ComponentID Vertices

1

43

2

5

6

7

8

28

Page 29: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

1 1, 3

1

43

2

5

6

7

8

29

Page 30: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

43

2 2, 5

1 1, 3

1

43

2

5

6

7

8

30

Page 31: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

31

52

54

76

86

42

43

87

ComponentID Vertices

2 2, 4, 5

1 1, 3

1

43

2

5

6

7

8

31

Page 32: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

31

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7

1

43

2

5

6

7

8

32

Page 33: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

33

Page 34: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

54

76

86

42

43

87

41 ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

34

Page 35: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

35

Page 36: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

36

Page 37: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

37

Page 38: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

38

Page 39: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom Distributed Stream Connected

Components

39

Page 40: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Stream Bipartite Detection

Similar to connected components, but

• Each vertex is also assigned a sign, (+) or (-)

• Edge endpoints must have different signs

• When merging components, if flipping all signs doesn’t work => the graph is not bipartite

40

Page 41: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

41

Stream Bipartite Detection

Page 42: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

42

Stream Bipartite Detection

Page 43: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

43

Stream Bipartite Detection

Page 44: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)

3 5

44

Stream Bipartite Detection

Page 45: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

3 7

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)Can’t flip signs and stay consistent

=> not bipartite!

45

Stream Bipartite Detection

Page 46: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

API Requirements

• Continuous aggregations on edge streams

• Global graph aggregations

• Support for windowing

Page 47: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

47

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Blocking Operations • Structured Iterations

• Unbounded Data Sources • Continuous Operations • Asynchronous Iterations

Page 48: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Unifying Data Processing

Job Manager• scheduling tasks • monitoring/recovery

Client

• task pipelining • blocking

• execution plan building • optimisation

48

DataStreamDataSet

Distributed Dataflow

Deployment

HDFS

Kafka

DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)…

DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…

Page 49: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Data Streams as ADTs

• Tasks are long running in a pipelined execution.

• State is kept within tasks.

• Transformations are applied per-record or window.

• Transformations: map, flatmap, filter, union…

• Aggregations: reduce, fold, sum

• Partitioning: forward, broadcast, shuffle, keyBy

• Sources/Sinks: custom or Kafka, Twitter, Collections…

49

DataStream

Page 50: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Working with Windows

50

Why windows? We are often interested in fresh data!

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS));

1) Sliding windows

2) Tumbling windowsmyKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS));

window buckets/panes

Page 51: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Example

51

myTextStream .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning

.window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation

.setParallelism(4); counts.print();

10:48 - “cool, gelly is cool”

printwindow sumflatMap

11:01 - “dataflow is cool too”

<“gelly”,1>… <“cool”,2>

<“dataflow”,1>… <“cool”,1>

Page 52: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Gelly on Streams

52

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

• Static Graphs • Multi-Pass Algorithms • Full Computations

• Dynamic Graphs • Single-Pass Algorithms • Approximate Computations

DataStream

Page 53: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Introducing Gelly-Stream

53

Gelly-Stream enriches the DataStream API with two new additional ADTs:

• GraphStream: • A representation of a data stream of edges.

• Edges can have state (e.g. weights).

• Supports property streams, transformations and aggregations.

• GraphWindow: • A “time-slice” of a graph stream.

• It enables neighbourhood aggregations

Page 54: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

GraphStream Operations

54

.getEdges()

.getVertices()

.numberOfVertices()

.numberOfEdges()

.getDegrees()

.inDegrees()

.outDegrees()

GraphStream -> DataStream

.mapEdges();

.distinct();

.filterVertices();

.filterEdges();

.reverse();

.undirected();

.union();

GraphStream -> GraphStream

Property Streams Transformations

Page 55: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Graph Stream Aggregations

55

result aggregate

property streamgraph stream

(window) fold

combine

fold

reduce

local summaries

global summary

edges

agg

global aggregates can be persistent or transient

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

Page 56: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Graph Stream Aggregations

56

result aggregate

property streamgraph stream

(window) fold

combine transform

fold

reduce map

local summaries

global summary

edges

agg

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

Page 57: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

Connected Components

57

graph stream

31

52

1

43

2

5

6

7

8

#components

Page 58: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

58

graph stream

{1,3}

{2,5}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 59: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

59

graph stream

{1,3}

{2,5}

54

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 60: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

60

graph stream

{1,3}

{2,5}

{4,5}76

86

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 61: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

61

graph stream

{1,3}

{2,5}

{4,5}

{6,7}

{6,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

windowtriggers

Page 62: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

62

graph stream

{2,5}{6,8}

{1,3}{4,5}

{6,7}

3

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 63: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

63

graph stream

{1,3}{2,4,5}

{6,7,8}

3

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 64: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

64

graph stream

{1,3}{2,4,5}

{6,7,8}42

43

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 65: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

65

graph stream

{1,3}{2,4,5}

{6,7,8}{2,4}

{3,4}

41

87

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 66: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

66

graph stream

{1,3}{2,4,5}

{6,7,8}{1,2,4}

{3,4}{7,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

windowtriggers

Page 67: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

67

graph stream

{1,2,4,5}{6,7,8}

2

{3,4}{7,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 68: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Connected Components

68

graph stream

{1,2,3,4,5}{6,7,8}

2

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

Page 69: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Aggregating Slices

69

graphStream.slice(Time.of(1, MINUTE), direction)

.reduceOnEdges();

.foldNeighbors();

.applyOnNeighbors();

• Slicing collocates edges by vertex information

• Neighbourhood aggregations are now enabled on sliced graphs

source

target

Aggregations

Page 70: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Finding Matches Nearby

70

graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs())

slice

GraphStream :: graph geek check-ins

wendy checked_in soap_bar steve checked_in soap_bar

tom checked_in joe’s_grill sandra checked_in soap_bar

rafa checked_in joe’s_grill

wendy

steve

sandra

soapbar

tom

rafa

joe’sgrill

FindPairs

{wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}

GraphWindow :: user-place

Page 71: Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

@GraphDevroom

Feeling Gelly?• Gelly-Stream: https://github.com/vasia/gelly-streaming

• Apache Flink: https://github.com/apache/flink

• An interesting read: http://users.dcc.uchile.cl/~pbarcelo/mcg.pdf

• A cool thesis: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170425

• Twitter: @vkalavri , @senorcarbone