giraph at hadoop summit 2014

46
Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <[email protected]> @claudiomartella Hadoop Summit @ Amsterdam - 3 April 2014

Post on 22-Sep-2014

7 views

Category:

Technology


2 download

DESCRIPTION

Presentation of Apache Giraph at Hadoop Summit 2014 Amsterdam

TRANSCRIPT

Page 1: Giraph at Hadoop Summit 2014

Apache GiraphLarge-scale Graph Processing on Hadoop

Claudio Martella <[email protected]> @claudiomartella

Hadoop Summit @ Amsterdam - 3 April 2014

Page 2: Giraph at Hadoop Summit 2014

2

Page 3: Giraph at Hadoop Summit 2014

Graphs are simple

3

Page 4: Giraph at Hadoop Summit 2014

A computer network

4

Page 5: Giraph at Hadoop Summit 2014

A social network

5

Page 6: Giraph at Hadoop Summit 2014

A semantic network

6

Page 7: Giraph at Hadoop Summit 2014

A map

7

Page 8: Giraph at Hadoop Summit 2014

Graphs are huge

•Google’s index contains 50B pages

•Facebook has around1.1B users

•Google+ has around 570M users

•Twitter has around 530M users

VERY rough estimates!

8

Page 9: Giraph at Hadoop Summit 2014

9

Page 10: Giraph at Hadoop Summit 2014

Graphs aren’t easy

10

Page 11: Giraph at Hadoop Summit 2014

Graphs are nasty.

11

Page 12: Giraph at Hadoop Summit 2014

Each vertex depends on its

neighbours, recursively.

12

Page 13: Giraph at Hadoop Summit 2014

Recursive problems are nicely solved iteratively.

13

Page 14: Giraph at Hadoop Summit 2014

PageRank in MapReduce

•Record: < v_i, pr, [ v_j, ..., v_k ] >

•Mapper: emits < v_j, pr / #neighbours >

•Reducer: sums the partial values

14

Page 15: Giraph at Hadoop Summit 2014

MapReduce dataflow

15

Page 16: Giraph at Hadoop Summit 2014

Drawbacks

•Each job is executed N times

•Job bootstrap

•Mappers send PR values and structure

•Extensive IO at input, shuffle & sort, output

16

Page 17: Giraph at Hadoop Summit 2014

17

Page 18: Giraph at Hadoop Summit 2014

Timeline

•Inspired by Google Pregel (2010)

•Donated to ASF by Yahoo! in 2011

•Top-level project in 2012

•1.0 release in January 2013

•1.1 release in days 2014

18

Page 19: Giraph at Hadoop Summit 2014

Plays well with Hadoop

19

Page 20: Giraph at Hadoop Summit 2014

Vertex-centric API

20

Page 21: Giraph at Hadoop Summit 2014

BSP machine

21

Page 22: Giraph at Hadoop Summit 2014

BSP & Giraph

22

Page 23: Giraph at Hadoop Summit 2014

Advantages

•No locks: message-based communication

•No semaphores: global synchronization

•Iteration isolation: massively parallelizable

23

Page 24: Giraph at Hadoop Summit 2014

Architecture

24

Page 25: Giraph at Hadoop Summit 2014

Giraph job lifetime

25

Page 26: Giraph at Hadoop Summit 2014

Designed for iterations

•Stateful (in-memory)

•Only intermediate values (messages) sent

•Hits the disk at input, output, checkpoint

•Can go out-of-core

26

Page 27: Giraph at Hadoop Summit 2014

A bunch of other things

•Combiners (minimises messages)

•Aggregators (global aggregations)

•MasterCompute (executed on master)

•WorkerContext (executed per worker)

•PartitionContext (executed per partition)

27

Page 28: Giraph at Hadoop Summit 2014

Shortest Paths

28

Page 29: Giraph at Hadoop Summit 2014

Shortest Paths

29

Page 30: Giraph at Hadoop Summit 2014

Shortest Paths

30

Page 31: Giraph at Hadoop Summit 2014

Shortest Paths

31

Page 32: Giraph at Hadoop Summit 2014

Shortest Paths

32

Page 33: Giraph at Hadoop Summit 2014

Composable API

33

Page 34: Giraph at Hadoop Summit 2014

Checkpointing

34

Page 35: Giraph at Hadoop Summit 2014

No SPoFs

35

Page 37: Giraph at Hadoop Summit 2014

Giraph is fast

• 100x over MR (Pr)

• jobs run within minutes

• given you have resources ;-)

37

Page 38: Giraph at Hadoop Summit 2014

Serialised objects

38

Page 39: Giraph at Hadoop Summit 2014

Primitive types

•Autoboxing is expensive

•Objects overhead (JVM)

•Use primitive types on your own

•Use primitive types-based libs (e.g. fastutils)

39

Page 40: Giraph at Hadoop Summit 2014

Sharded aggregators

40

Page 41: Giraph at Hadoop Summit 2014

Many stores with Gora

41

Page 42: Giraph at Hadoop Summit 2014

And graph databases

42

Page 43: Giraph at Hadoop Summit 2014

Current and next steps

•Out-of-core graph and messages

•Jython interface

•Remove Writable from < I V E M >

•Partitioned supernodes

•More documentation

43

Page 44: Giraph at Hadoop Summit 2014

Giraph in Action• Published by Manning

• MEAP now

• Complete Q3 2014 (well...)

• Part 1: Graphs and Algorithms

• Part 2: Giraph Basic Topics

• Part 3: Giraph Advanced Topics

• http://www.manning.com/martella

44

Page 45: Giraph at Hadoop Summit 2014

Okapi

• Apache Mahout for graphs

•Graph-based recommenders: ALS, SGD, SVD++, etc.

•Graph analytics: Graph partitioning, Community Detection, K-Core, etc.

45

Page 46: Giraph at Hadoop Summit 2014

Thank you

Claudio Martella <[email protected]> @claudiomartella

http://giraph.apache.org

Some figures gently borrowed from Nitay Joffe:http://www.slideshare.net/nitayj/20130910-giraph-at-london-

hadoop-users-group