graph-based data integration and analysis for big data … · graph-based data integration and...

62
www.scads.de GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM

Upload: lamdan

Post on 21-Feb-2019

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

www.scads.de

GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA

ERHARD RAHM

Page 2: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Big is changing quickly Gigabytes Terabytes (1012) Petabytes (1015) Exabytes (1018) Zettabytes (1021), Yottabytes (1024), Brontobytes (1027), …

by 2020 about 40 ZB of data will be generated

HOW BIG IS BIG DATA?

Source: IDC

data growth(in Exabytes)

Page 3: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

DATA CENTER

Source: Google Inc.

Page 4: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

BIG DATA CHALLENGES

Big Data

VolumePetabytes / exabytes of data

Velocityfast analysis of data streams

Varietyheterogeneousdata of different kinds

Veracityhigh data quality

Valueuseful analysisresults

Page 5: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

BIG DATA ANALYSIS PIPELINE

5

Dataintegration/annotation

Data extraction /

cleaning

Dataaquisition

Data analysis andvisualization

Inter-pretation

Varie

ty

Volu

me

Velo

city

Vera

city

Priva

cy

Page 6: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

„GRAPHS ARE EVERYWHERE“

6

Facebookca. 1.3 billion usersca. 340 friends per user

Twitterca. 300 million usersca. 500 million tweets per day

Internetca. 2.9 billion users

Gene (human)20,000-25,000ca. 4 million individuals

Patients> 18 millions (Germany)

Illnesses> 30.000

World Wide Webca. 1 billion Websites

LOD-Cloudca. 90 billion triples

Social science Engineering Life science Information science

Page 7: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

,

“GRAPHS ARE EVERYWHERE”

7

Page 8: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

,

“GRAPHS ARE EVERYWHERE”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

8

Page 9: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

,

“GRAPHS ARE EVERYWHERE”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

9

Page 10: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

“GRAPHS ARE HETEROGENEOUS”

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

∪ , ∪

10

Page 11: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

0.2

0.28

0.26

0.33

0.25

0.26

“GRAPHS CAN BE ANALYZED”

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

3.62.82

∪ , ∪

11

Page 12: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

“GRAPHS CAN BE ANALYZED“

Assuming a social network1. Determine subgraph2. Find communities3. Filter communities4. Find common subgraph

12

Page 13: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

all V challenges (volume, variety, velocity, veracity) ease-of-use high cost-effectiveness powerful but easy to use graph data model support for heterogeneous, schema-flexible vertices and edges support for collections of graphs (not only 1 graph) powerful graph operators

graph-based integration of many data sources

versioning and evolution (dynamic /temporal graphs)

interactive, declarative graph queries

scalable graph mining

comprehensive visualization support

GRAPH DATA ANALYTICS: REQUIREMENTS

13

Page 14: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

COMPARISON

14

Graph Database SystemsNeo4j, OrientDB

data model rich graph models(PGM)

focus queries

query language yes

graph analytics no

scalability vertical

Workflows no

persistency yes

dynamic graphs / versioning

no

data integration no

visualization (yes)

Page 15: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

COMPARISON (2)

15

Graph Database SystemsNeo4j, OrientDB

Graph Processing Systems(Pregel, Giraph)

data model rich graph models(PGM)

genericgraph models

focus queries analytic

query language yes no

graph analytics no yes

scalability vertical horizontal

Workflows no no

persistency yes no

dynamic graphs / versioning

no no

data integration no no

visualization (yes) no

Page 16: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

COMPARISON (3)

16

Graph Database SystemsNeo4j, OrientDB

Graph Processing Systems(Pregel, Giraph)

Distributed DataflowSystems (Flink Gelly, Spark GraphX)

data model rich graph models(PGM)

genericgraph models

genericgraph models

focus queries analytic analytic

query language yes no no

graph analytics no yes yes

scalability vertical horizontal horizontal

Workflows no no yes

persistency yes no no

dynamic graphs / versioning

no no no

data integration no no no

visualization (yes) no no

Page 17: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

An end-to-end framework and research platform forefficient, distributed and domain independent graph

data management and analytics.

WHAT‘S MISSING?

17

Page 18: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Data Volume and Problem Complexity

Ease

-of-u

se

Graph Processing Systems

Graph Databases

Graph Dataflow Systems Gelly

18

Page 19: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Intro Graph Analytics Graph data Requirements Graph database vs graph processing systems

Gradoop architecture and data integration

Extended Property Graph Model (EPGM) Data organization and operators Implementation

Performance Evaluation

Summary/Outlook

AGENDA

19

Page 20: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Hadoop-based framework for graph data management and analysis persistent graph storage in scalable distributed store (Hbase) utilization of powerful dataflow system (Apache Flink) for parallel,

in-memory processing

Extended property graph data model (EPGM) operators on graphs and sets of (sub) graphs support for semantic graph queries and mining

Declarative specification of graph analysis workflows Graph Analytical Language - GrALa

End-to-end functionality graph-based data integration, data analysis and visualization

Open-source implementation: www.gradoop.org

GRADOOP CHARACTERISTICS

20

Page 21: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

integrate data from one or more sources into a dedicated graph storewith common graph data model

definition of analytical workflows from operator algebra

result representation in meaningful way

END-TO-END GRAPH ANALYTICS

Data Integration Graph Analytics Visualization

21

Page 22: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

HIGH LEVEL ARCHITECTURE

HDFS/YARNCluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSLRepresentation

Data flow

Control flow

Graph Analytics Representation

22

Page 23: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

BIIIG: Business Intelligence on Integrated Instance Graphs

heterogeneous data sources are integrated within an instance graph by preserving original relationships between data objects transactional and master data

largely automated extraction of metadata and instance data andtransformation into graphs fusion of matching entities and relations

extraction of subgraphs (business transaction graphs) related to interrelated business activities

analysis of graphs/subgraphs with aggregation queries, pattern mining etc.

GRAPH-BASED BUSINESS INTELLIGENCE

23

Page 24: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

DATA INTEGRATION AND ANALYSIS WORKFLOW

24

„Business Intelligence on Integrated Instance Graphs (BIIIG)“ (PVLDB 2014)

Business Transaction Graphs

(3) SubgraphIsolation

(2) Graph integration

Integrated Instance Graph

Domain expert

metadata

(1) Graph transformation

Dat

a So

urce

s

Page 25: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

SAMPLE INSTANCE GRAPH

25

Page 26: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

SCREENSHOT OF NEO4J IMPLEMENTATION

26

Page 27: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Intro Graph Analytics Graph data Requirements Graph database vs graph processing systems

Gradoop architecture and data integration

Extended Property Graph Model (EPGM) Data organization and operators Implementation

Performance Evaluation

Summary/Outlook

AGENDA

27

Page 28: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

includes PGM as special case

support for collections of logical graphs / subgraphs can be defined explicitly can be result of graph algorithms / operators

support for graph properties

powerful operators on both graphs and graph collections

Graph Analytical Language – GrALa domain-specific language (DSL) for EPGM flexible use of operators with application-specific UDFs plugin concept for graph mining algorithms

EXTENDED PROPERTY GRAPH MODEL (EPGM)

28

Page 29: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

• Vertices and directed Edges

29

Page 30: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

• Vertices and directed Edges• Logical Graphs

30

Page 31: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

• Vertices and directed Edges• Logical Graphs• Identifiers

1 3

4

5

21 2

3

4

5

1

2

31

Page 32: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

• Vertices and directed Edges• Logical Graphs• Identifiers• Type Labels

1 3

4

5

21 2

3

4

5Person Band

Person

Person

Band

likes likes

likes

knows

likes

1|Community

2|Community

32

Page 33: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

• Vertices and directed Edges• Logical Graphs• Identifiers• Type Labels• Properties

1 3

4

5

21 2

3

4

5Personname : Aliceborn : 1984

Bandname : Metallicafounded : 1981

Personname : Bob

Personname : Eve

Bandname : AC/DCfounded : 1973

likessince : 2014

likessince : 2013

likessince : 2015

knows

likessince : 2014

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

33

Page 34: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Operators

34

Page 35: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Operators

Unary BinaryAlgorithms

* auxiliary

Graph

 Collection

LogicalG

raph

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

IntersectionDifference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

DistinctSort

Apply *Reduce *Call *

Adaptive Partitioning

Subgraph

35

Page 36: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Combination

Overlap

Exclusion

LogicalGraph graph3 = graph1.combine(graph2);LogicalGraph graph4 = graph1.overlap(graph2);LogicalGraph graph5 = graph1.exclude(graph2);

BASIC BINARY OPERATORS

1 34

52

3

1 2

1 34

52

12 4

5

3

36

Page 37: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

udf = (graph => graph[‘vertexCount’] = graph.vertices.size())graph3 = graph3.aggregate(udf)

AGGREGATION

1 34

52

3

1 34

52

3 | vertexCount: 5

UDF

37

Page 38: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

LogicalGraph graph4 = graph3.subgraph((vertex => vertex[:label] == ‘green’))LogicalGraph graph5 = graph3.subgraph((edge => edge[:label] == ‘blue’))LogicalGraph graph6 = graph3.subgraph(

(vertex => vertex[:label] == ‘green’),(edge => edge[:label] == ‘orange’))

SUBGRAPH

3

1 34

52

3

4

1 2

5

35

2UDF

UDF

UDF 3

6

1 2

38

Page 39: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

GraphCollection collection = graph3.match(“(:Green)‐[:orange]‐>(:Orange)”);

PATTERN MATCHING

3

1 34

52 Pattern

4 5

1 34

2

Graph Collection

39

new: support of Cypher query language for pattern matching*

* Junghanns et al.: Cypher-based Graph Pattern Matching in Gradoop. Proc. GRADES 2017

q =  "MATCH (p1: Person ) ‐[e: knows *1..3] ‐>( p2: Person)WHERE p1.gender <> p2 .gender RETURN *"

GraphCollection matches = g.cypher (q)

Page 40: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

LogicalGraph grouped = graph3.groupBy([:label], // vertex keys[:label]) // edge keys

LogicalGraph grouped = graph3.groupBy([:label], [COUNT()], [:label], [MAX(‘a’)])

GROUPING

Keys

3

1 34

52

+Aggregate

3

a:23 a:84

a:42

a:12

1 34

52

a:13

a:21

4

count:2 count:3

max(a):42

max(a):84

max(a):13 max(a):21

6 7

4

6 7

40

Page 41: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

SAMPLE GRAPH

[0] Tagname : Databases

[1] Tagname : Graphs

[2] Tagname : Hadoop

[3] Forumtitle : Graph Databases

[4] Forumtitle : Graph Processing

[5] Personname : Alicegender : fcity : Leipzigage : 23

[6] Personname : Bobgender : mcity : Leipzigage : 30

[7] Personname : Carolgender : fcity : Dresdenage : 30

[8] Personname : Davegender : mcity : Dresdenage : 42

[9] Personname : Evegender : fcity : Dresdenage : 35speaks : en

[10] Personname : Frankgender : mcity : Berlinage : 23IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knowssince : 2014

knowssince : 2014

knowssince : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModeratorhasModeratorhasMember hasMember

hasMember hasMember

hasTag hasTaghasTag hasTag

knowssince : 2013

knowssince : 2014

knowssince : 2014

knowssince : 2015

knowssince : 2015

knowssince : 2015

knowssince : 2013

Page 42: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

GROUPING: TYPE LEVEL (SCHEMA GRAPH)

vertexGrKeys = [:label]edgeGrKeys = [:label]sumGraph = databaseGraph.groupBy(vertexGrKeys, [COUNT()], edgeGrKeys, [COUNT()])

[11] Person

count : 6

[12] Forum

count : 2

[13] Tag

count : 3

hasMembercount : 4

knowscount : 10

hasInterestcount : 4

hasTagcount : 4

hasModeratorcount : 2

24

26

28

27

25

42

Page 43: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

personGraph = databaseGraph.subgraph((vertex => vertex[:label] == ‘Person’),(edge => edge[:label] == ‘knows’))

vertexGrKeys = [:label, “city”]edgeGrKeys = [:label]sumGraph = personGraph.groupBy(vertexGrKeys, [COUNT()], edgeGrKeys, [COUNT()])

GROUPING: PROPERTY-SPECIFIC

1 3

[11] Person

city : Leipzigcount : 2

[12] Person

city : Dresdencount : 3

[13] Person

city : Berlincount : 1

24

25

26

27

28

knowscount : 3

knowscount : 1 knows

count : 2

knowscount : 2

knowscount : 2

43

Page 44: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

GraphCollection filtered = collection.select((graph => graph[‘vertexCount’] > 4));

SELECTION

UDF

vertexCount > 4

1 | vertexCount: 5

2 | vertexCount: 4

0 23

41

5 7 86

1 | vertexCount: 5

0 23

41

44

Page 45: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

GraphCollection frequentPatterns = collection.callForCollection(new TransactionalFSM(0.5))

CALL (E.G. FREQUENT SUBGRAPHS)

FSM

Threshold: 50%

1

0 1 23

4

5 6 78

9

1013

14

2

3

11 12

15 16

17 18

19 20

4

5

6

21 2322

25 2624

7

8

45

Page 46: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Implementation

46

Page 47: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

GRAPH REPRESENTATION

Id Label Properties Graphs

Id Label Properties SourceId TargetId Graphs

EPGMGraphHead

EPGMVertex

EPGMEdge

Id Label Properties POJO

POJO

POJO

DataSet<EPGMGraphHead>

DataSet<EPGMVertex>

DataSet<EPGMEdge>

Id Label Properties Graphs

EPGMVertex

GradoopId := UUID128‐bit

String PropertyList := List<Property>Property      := (String, PropertyValue)PropertyValue := byte[]

GradoopIdSet := Set<GradoopId>

47

Page 48: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Id Label Properties

1 Community {interest:Heavy Metal}

2 Community {interest:Hard Rock}

Id Label Properties Graphs

1 Person {name:Alice, born:1984} {1}

2 Band {name:Metallica,founded:1981} {1}

3 Person {name:Bob} {1,2}

4 Band {name:AC/DC,founded:1973} {2}

5 Person {name:Eve} {2}

Id Label Source Target Properties Graphs

1 likes 1 2 {since:2014} {1}

2 likes 3 2 {since:2013} {1}

3 likes 3 4 {since:2015} {2}

4 knows 3 5 {} {2}

5 likes 5 4 {since:2014} {2}

likessince : 2014

likessince : 20131 3

4

5

2

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

Personname : Aliceborn : 1984

Bandname : Metallicafounded : 1981

Personname : Bob

Personname : Eve

Bandname : AC/DCfounded : 1973likes

since : 2015

knows

likessince : 20141 2

3

4

5

DataSet<EPGMGraphHead>

DataSet<EPGMVertex> DataSet<EPGMEdge>

GRAPH REPRESENTATION: EXAMPLE

48

Page 49: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

// input: firstGraph (G[1]), secondGraph (G[2])

1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()2: .map(new Id<G>());3: 4: DataSet<V> newVertices = firstGraph.getVertices()5:  .filter(new NotInGraphBroadCast<V>())6: .withBroadcastSet(graphId, GRAPH_ID);7:8: DataSet<E> newEdges = firstGraph.getEdges()9:  .filter(new NotInGraphBroadCast<E>())

10: .withBroadcastSet(graphId, GRAPH_ID)11: .join(newVertices)12: .where(new SourceId<E>().equalTo(new Id<V>())13: .with(new LeftSide<E, V>())14: .join(newVertices)15: .where(new TargetId<E>().equalTo(new Id<V>())16: .with(new LeftSide<E, V>());

Exclusion

OPERATOR IMPLEMENTATION

likessince : 2013

likessince : 20141 3

4

5

2

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

Personname : Aliceborn : 1984

Bandname : Metallicafounded : 1981

Personname : Bob

Personname : Eve

Bandname : AC/DCfounded : 1973likes

since : 2015

knows

likessince : 20141 2

3

4

5

49

Page 50: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

IMPLEMENTATION OF GRAPH GROUPING ( PROC. BTW2017)

GroupBy(1,2,3) +GC + GR* + MapAssign edges to groupsCompute aggregatesBuild super edges

Filter + MapExtract super vertex tuplesBuild super vertices

GroupBy(1) + GroupReduce*Assign vertices to groupsCompute aggregatesCreate super vertex tuplesForward updated group members

V

E

MapExtractattributes

Filter + Map Extract group membersReduce memory footprint

Join*Replace Source/TargetIdwith corresponding super vertex id

MapExtractattributes

*requires worker communication

V1 V2

V3

V‘

E1 E2 E‘

50

Page 51: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

ITERATIVE COMPUTATION OF FREQUENT SUBGRAPHS

51

n‐edge3‐edge2‐edge1‐edge

result

collecting intermediate iteration results

searchspace

1‐edgeR C F

2‐edgeG R C F

3‐edgeG R C F

n‐edgeG R C F

G : grow frequent patternsR : report supported patternsC : count global frequencyF : filter by min frequency

Page 52: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Intro Graph Analytics Graph data Requirements Graph database vs graph processing systems

Gradoop architecture and data integration

Extended Property Graph Model (EPGM) Data organization and operators Implementation

Performance Evaluation

Summary/Outlook

AGENDA

52

Page 53: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

TEST WORKFLOW: SUMMARIZED COMMUNITIES

http://ldbcouncil.org/

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

53

Page 54: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

TEST WORKFLOW: SUMMARIZED COMMUNITIES

https://git.io/vgozj

1. Extract subgraph containing only Persons

and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

54

Page 55: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

BENCHMARK RESULTS

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

16x Intel(R) Xeon(R) 2.50GHz (6 Cores) 16x 48 GB RAM 1 Gigabit Ethernet Hadoop 2.6.0 Flink 1.0-SNAPSHOT

0

200

400

600

800

1000

1200

1 2 4 8 16

Runtim

e [s]

Number of workers

RuntimeGraphalytics.100

1

2

4

8

16

1 2 4 8 16

Speedu

p

Number of workers

SpeedupGraphalytics.100 Linear

55

Page 56: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

BENCHMARK RESULTS 2

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

1

10

100

1000

10000

Runtim

e [s]

Datasets

16x Intel(R) Xeon(R) 2.50GHz (6 Cores) 16x 48 GB RAM 1 Gigabit Ethernet Hadoop 2.6.0 Flink 1.0-SNAPSHOT

56

Page 57: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

EVALUATION OF GROUPING: SCALABILITY

Speedup for grouping on type Runtime for grouping on type

57

Page 58: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Intro Graph Analytics Graph data Requirements Graph database vs graph processing systems

Gradoop architecture and data integration

Extended Property Graph Model (EPGM) Data organization and operators Implementation

Performance Evaluation

Summary/Outlook

AGENDA

58

Page 59: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Big Graph Analytics Hadoop-based graph processing frameworks based on generic graphs Spark/Flink: batch/streaming-oriented workflows (rather than interactive OLAP) graph collections not generally supported generally missing: graph-based data integration, built-in support for dynamic graph data

GraDoop (www.gradoop.org) open-source infrastructure for entire processing pipeline: graph acquisition, storage,

integration, transformation, analysis (queries + graph mining), visualization extended property graph model (EPGM) with powerful operators (e.g., grouping, pattern

matching) and support for graph collections leverages Hadoop ecosystem Apache HBase for permanent graph storage Apache Flink to implement operators

ongoing implementation

SUMMARY

59

Page 60: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

COMPARISON

60

Graph Database SystemsNeo4j, OrientDB

Graph Processing Systems(Pregel, Giraph)

DistributedDataflow Systems (Flink Gelly, Spark GraphX)

data model rich graphmodels (PGM)

genericgraph models

genericgraph models

Extended PGM

focus queries analytic analytic analytic

query language yes no no (yes)

graph analytics no yes yes yes

scalability vertical horizontal horizontal horizontal

Workflows no no yes yes

persistency yes no no yes

dynamic graphs / versioning

no no no no

data integration no no no (yes)

visualization (yes) no no limited

Page 61: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

Graph-based data integration refined operators for data import, matching and fusion holistic data integration for many sources (clusters of matching

entities instead of binary „sameAs“ links)

Graph analytics optimized graph partitioning approaches automatic load balancing techniques visualization of graphs and analysis results interactive graph analytics dynamic graph data

OUTLOOK / CHALLENGES

61

Page 62: GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA … · GRAPH-BASED DATA INTEGRATION AND ANALYSIS FOR BIG DATA ERHARD RAHM Big is changing quickly Gigabytes Terabytes (1012)

M. Junghanns, M. Kießling, A. Averbuch, A. Petermann, E. Rahm: Cypher-based Graph Pattern Matching in Gradoop. Proc. ACM SIGMOD workshop on Graph Data Management Experiences and Systems (GRADES), 2017

M. Junghanns, A. Petermann, K. Gomez, E. Rahm: GRADOOP - Scalable Graph Data Management and Analytics with Hadoop. Tech. report (Arxiv), Univ. of Leipzig, 2015

M. Junghanns, A. Petermann, M. Neumann, E. Rahm: Management and Analysis of Big Graph Data: Current Systems and Open Challenges. In: Big Data Handbook (eds.: S. Sakr, A. Zomaya) , Springer, 2017

M. Junghanns, A. Petermann, N. Teichmann, K. Gomez, E. Rahm: Analyzing Extended Property Graphs with Apache Flink. Proc. ACM SIGMOD workshop on Network Data Analytics (NDA), 2016

M. Junghanns, A. Petermann, E. Rahm: Distributed Grouping of Property Graphs with GRADOOP. In: Proc. BTW, March 2017

A. Petermann; M. Junghanns: Scalable Business Intelligence with Graph Collections. it - Information Technology Special Issue: Big Data Analytics, 2016

A. Petermann, M. Junghanns, S. Kemper, K. Gomez, N.Teichmann, E. Rahm: Graph Mining for Complex Data Analytics. Proc. ICDM 2016 (Demo paper)

A. Petermann, M. Junghanns, R. Müller, E. Rahm: BIIIG : Enabling Business Intelligence with Integrated Instance Graphs. Proc. 5th Int. Workshop on Graph Data Management (GDM 2014)

A. Petermann, M. Junghanns, R. Müller, E. Rahm: Graph-based Data Integration and Business Intelligence with BIIIG. Proc. VLDB Conf., 2014

A. Petermann, M. Junghanns, R. Müller, E. Rahm: FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics. Proc. 5th Int. Workshop on Big Data Benchmarking (WBDB), 2014

A. Petermann, M. Junghanns, E. Rahm: DIMSpan - Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems. arXiv 2017

REFERENCES

62