b 4 gravty

1 What Is Gravty? 2 The Internals of Gravty 3 Fine-Tuning Gravty 4 Future Plans

A Graph Database Is “A graph database is a database that uses

graph structures for semantic queries with nodes, edges and properties to represent and store data.” (Wikipedia)

Stores objects (vertices) and relationships (edges)

Provides graph search capabilities

Vertices and Edges in a Graph Database

Fr iends

Fr iends L ikes

Use Cases of a Graph Database

Facebook Social Graph

Social networks

Google PageRank

Ranking websites

Walmart and eBay

Product recommendation

Need for a Large Graph Database System

Social Graph LINE Timeline

LINE Talk Ranking

Recommendation

LINE Friends Shop

LINE News

Gravty

Need for a Large Graph Database System

Social Graph LINE Timeline

LINE Talk Ranking

Recommendation

LINE Friends Shop

LINE News

Gravty

7 billion vertices 100 billion edges 200 billion indexes 5 billion writes a day (create / update / delete)

Gravty Is A scalable graph database to search

relational information efficiently by searching through a large pool of data

using the graph search technique.

Requirements for Gravty

Easy to scale out

• To support ever-increasing data

Easy to develop

• Add, modify, and remove features as necessary

• Tailored to the LINE development environment

• Not dependent on LINE-specif ic components

Full control over everything!

Easy to use

• Graph query language • REST API


Technology Stack and Architecture Data Model

Technology Stack and Architecture

Application

TinkerPop3 Gremlin-Console

TinkerPop3 Graph API

Graph Processing Layer

Storage Layer

MySQL (config, meta)

HBase Kafka

Gravty

MySQL (config, meta)

Kafka

Application


TinkerPop 3.2.0 Graph API

Graph Processing Layer (OLTP only)

HBase

Storage Layer

Gravty

HBase 1.1.x Local Memory Kafka 0.10.0.0 Phoenix 4.8.0

Application


TinkerPop3 Graph API

Gravty Storage Layer (Abstract Interface)

Phoenix Repository (Default)

Memory Repository (Standalone)

Graph Processing Layer

• Row key: vertex-id • Edges are stored in columns • Disadvantages

Data Model Flat-Wide Table

Column scan is slow Columns cannot be split

Row Column

vertex- id1 property property edge edge edge edge edge edge

ver tex- id2 …

vertex- id3 …

• Row key: edge-id

Data Model Tall-Narrow Table (Gravty)

SrcVertexId-Label-TgtVertexId

Row Column

svtxid1-label-tvtxid2 edge property

edge property

svtxid1-label-tvtxid3 …

…

• Edges are stored in rows • Advantages

More effective edge scan Parallel execution

Fr iends

Flat-Wide vs Tall-Narrow

g . V ( “ b r o w n ” ) . o u t ( “ f r i e n d s ” ) . i d ( ) . l i m i t ( 3 )

Brown

Cony

Moon

Sal ly

[cony, moon, sally]

Flat-Wide vs Tall-Narrow Flat-Wide Model

Brown edge edge edge edge edge edge

(1) Row scan

2 operations

(2 ) Co lumn scan

[cony, moon, sally]

‘likes’ ‘friends’

Flat-Wide vs Tall-Narrow Tall-Narrow Model (Gravty)

brown-friends-sally

(1) Row scan

1 operation

[cony, moon, sally]

brown-friends-moon

brown-friends-cony

• Can split by rows (region) • Can isolate hotspot rows • Can scan in parallel

Flat-Wide vs Tall-Narrow

g . V ( “ b r o w n ” ) . o u t ( “ f r i e n d s ” ) . o u t ( “ f r i e n d s ” ) .i d ( ) . l i m i t ( 1 0 )

4 searches in total • Flat-Wide = 8 operat ions • Tall-Narrow (Gravty) = 4 operat ions


Faster, Compact Querying Avoiding Hot-Spotting Efficient Secondary Indexing

Faster, Compact Querying

g .V ( b r own ) . h asL ab e l ( " u se r " ) . o u t ( " f r i e n d s ” ) . o rd e r ( ) . b y ( “ n ame ” , O rde r. i n c r ) . l i m i t ( 5 )

Reducing graph traversal steps

GraphStep VertexStep FilterStep RangeStep FilterStep

GGraphStep GVertexStep

Faster, Compact Querying

g . V ( b r o w n ) . o u t E ( " f r i e n d s ” ) . l i m i t ( 5 ) . i n V ( ) . o r d e r ( ) . b y ( " n a m e " , O r d e r. i n c r ) . p r o p e r t i e s ( " n a m e " )

inV(): Pipelined iterator from outE() • TinkerPop: Sequential consuming • Gravty: Paral lel querying + pre-loading ver tex property

Querying in parallel and pre-loading vertex properties

outE( ) “name” : “Boss”

l imi t 5

f r iends

inV()

“na me ” : “ Edw ar d”

“name” : “Moon”

“name” : “ James”

“na me ” : “ J es s i c a”

“name” : “Cony”

“name” : “Sa l l y ”

Row keys that have sequential orders may cause RegionServers to suffer:

Hot-spotting problem with HBase RegionServer

EDGE TABLE

SrcVertexId Label TgtVertexId

u000001 1 u000002

u000001 1 u000003

u000002 1 u000001

u000003 1 u000001

u000004 2 u000009

• Heavy loads of writes or reads • Inefficient region splitting

Avoiding Hot-Spotting

Solutions to the hot-spotting problem - Pre-splitt ing regions - Salting row keys with a hashed prefix (Salting tables by Apache Phoenix)

But, there is a scan performance issue with the LIMIT clause SELECT * FROM index … LIMIT 100;

Avoiding Hot-Spotting

Avoiding Hot-Spotting Phoenix Salted Table

Scan 100 rows

Client side merge sort

Phoenix Client

Result

Scan 100 rows

Scan 100 rows

Scan 100 rows

Scan maximum 400 rows

Avoiding Hot-Spotting Custom Salting + Pre-splitting

hash (source-ver t ex - id )

Result

Phoenix Client

Scan 100 rows sequentially

Row Key Prefix

Indexed graph view for faster graph search

Asynchronous index processing using Kafka

Efficient Secondary Indexing

Tools for failure recovery

Default Phoenix IndexCommitter

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Put

Dele te

Pu t

Indexer Coprocessor

Phoenix Driver

numConnections = regionServers * regionServers * needConnections

Index update

Index update Too many connections on each RegionServer (Network is heavily congested)

Synchronous processing of index update requests

Gravty IndexCommitter

HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Put

Dele te

Pu t

Indexer Coprocessor

Phoenix Driver

numConnections = indexers * regionServers * needConnect ions

Muta t ions

Asynchronous processing using Kafka

Kafka

Indexer

Indexer

Index update

Default Phoenix IndexCommitter

1. Phoen ix c l ien t UPSERT

INDEX 1

Phoenix Coprocessor

Region Server

Primary Table

Phoenix Coprocessor

Region Server

INDEX 2

Phoenix Coprocessor

Region Server

PUT

PUT / DELETE

PUT / DELETE 2. Reques t HBase muta t ions fo r indexes in para l le l

RETURN 3. Phoen ix c l ien t re tu rns

Gravty IndexCommitter

INDEX 1

Phoenix Coprocessor

Region Server

Primary Table

Phoenix Coprocessor

Region Server

INDEX 2

Phoenix Coprocessor

Region Server

1.PUT 2. HBase mutations for INDEX 1, 2

4. Consume 3.RETURN

Kafka Index Consumer

5. PUT / DELETE

5. PUT / DELETE

Secondary Indexing Metrics

Server TPS RegionServer Number of connections

3x 1/8

Reentrant event processing

Every row is versioned in HBase (timestamp)

Logging failures and replaying

failed requests

Time machine to resume at

certain runtime Resetting runtime offset

of Kafka consumers

Best-Effort Failover Fail fast, fix later

Monitoring Tools for Failure Recovery Setting alerts and displaying metrics

• Prometheus • Dropwizard metrics • jvm_exporter • Grafana • Ambari

Client

Graph API

Multiple Graph Clusters Before

Gravty

HBase Cluster

Client

Graph API

After

Gravty

HBase Cluster HBase Cluster

HBase Cluster

HBase Repository Storage Layer

Memory Repository (Standalone)

Phoenix Repository (Default)

HBase Repository

Abstract Inter face

HBase Phoenix Region

Coprocessor Local Memory

Graph analytics system graph computation

OLAP Functionality

TinkerPop Graph Computing API

We will open source Gravty