connecting cassandra data with graphframes (jon haddad, the last pickle) | c* summit 2016

72
JON HADDAD PRINCIPAL CONSULTANT, TLP CONNECTING C* DATA WITH GRAPHFRAMES loves

Upload: datastax

Post on 12-Apr-2017

162 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

JON HADDAD PRINCIPAL CONSULTANT, TLP

CONNECTING C* DATA WITH GRAPHFRAMES

loves

Page 2: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WHAT’S THE LAST PICKLE DO?

Page 3: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016
Page 4: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WE HELP MAKE YOU A GROUP OF EXPERTS

Page 5: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WHO IS THIS GUY?

Page 6: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

15 YEARS EXPERIENCE

Page 7: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

4 YEARS WITH CASSANDRA

Page 8: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CASSANDRA DATA MODELING

Page 9: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

BIG DATA PROBLEMS

Page 10: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

DATA MODELING RULES

Page 11: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

DENORMALIZE

Page 12: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

I MISS MY JOINS

Page 13: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016
Page 14: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

PERFORMANCE & RELIABILITY

Page 15: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

LOW FLEXIBILITY

Page 16: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WHAT IS A GRAPH?

Page 17: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

JON

TLP

NATE

cofoundedknows

works at

Page 18: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

RELATIONSHIPS ARE ARBITRARY

Page 19: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

A GRAPH IS TRAVERSED

JON TLPworks at

start endfollow

Page 20: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

ELEMENTS ARE NOT TYPED

Page 21: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

FOLLOW ALL EDGES IN ANY QUERY

Page 22: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

JON

TLP

NATE

cofoundedknows

works at

Page 23: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

TRAVERSALS ARE MORE FLEXIBLE THAN JOINS

Page 24: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

THE WORLD IS A GRAPH

Page 25: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

epic chart of flexibility

Apache Cassandra

Graph databases

Page 26: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GRAPH IS COOL

RIGHT?

Page 27: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

NEO4J TITAN

DSE GRAPHcookie monster photo?

Page 28: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GRAPH ALL THE THINGS!

Page 29: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

TRADEOFFS

Page 30: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

PERFORMANCE

graph

Page 31: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

REMEMBER WHY WE DON’T DO JOINS?

Page 32: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

DISTRIBUTED JOINS ARE HARD WORK

Page 33: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

MORE WORK =

SLOWER DATABASE

Page 34: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

APPLICATION COMPLEXITY

Page 35: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

DO I NEED GRAPH ALL THE TIME?

Page 36: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GRAPH QUERIES ON CASSANDRA?

Page 37: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GRAPH IS COOL FOR ANALYTICS

Page 38: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

LET’S USE SPARK

Page 39: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

cdm install movielens

Page 40: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CREATE TABLE movies ( id uuid PRIMARY KEY, avg_rating float, genres set<text>, name text, release_date date, url text, video_release_date date)

Page 41: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CREATE TABLE users ( id uuid PRIMARY KEY, address text, age int, city text, gender text, name text, occupation text, zip text);

Page 42: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CREATE TABLE ratings_by_movie ( movie_id uuid, user_id uuid, rating int, ts int, PRIMARY KEY (movie_id, user_id));

Page 43: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

RECOMMENDATION ENGINE

Page 44: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

AWESOME

Page 45: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GET DATA INTO A GRAPH

Page 46: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

JON TOP GUNlabel: ratedrating: 5

Page 47: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

DATAFRAMESid genres name

ae4f9269-5d62-4ad1-b87c-1b23962bb224 {'Drama'} Prefontaine (1997)

de9a14a9-6d6d-4573-b415-c8555e85d391 {'Drama'} Raging Bull (1980)

0b67d4e7-ee2b-47ab-9437-df0c793ea72a {'Action', 'Sci-Fi', 'Thriller'} Face/Off (1997)

Page 48: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

BOILERPLATE

sql = SQLContext(sc)from functools import partialconnector = "org.apache.spark.sql.cassandra"load = partial(sql.read.format(connector).load, keyspace="movielens")

Page 49: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

LOAD THE DATA FRAMES

movies = load(table="movies")ratings = load(table="ratings_by_movie")users = load(table="users")

Page 50: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WHAT’S A GRAPHFRAME?

GraphFrame(v, e)

Page 51: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

VERTEX LIST

movies +

users

Page 52: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

MOVIE DATAFRAMEDataFrame[

id: string, avg_rating: float, genres: array<string>, name: string, release_date: date, url: string, video_release_date: date

]

Page 53: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

MOVIES AS LIST OF VERTICES

movies_v = movies.select("id", "name").\ withColumn("label", F.lit("movie"))

graph elements have no type

Page 54: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

MOVIE VERTEX

[Row(id=uʼ6d318848…ʼ, name=u'Anna (1996)', label=u'movie')]

Page 55: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

USERS AS VERTICESusers_v = users.select("id", "name").\ withColumn(“label”, F.lit("user"))

[Row(id=uʼb52fcdfc…ʼ, name=u'Harrold Hills', label=u'user'),

Page 56: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CREATE THE FULL VERTEX LIST

vertices = movies_v.unionAll(users_v)

Page 57: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GET THE EDGES

edges =\ ratings.select(ratings.movie_id.alias("dst"), ratings.user_id.alias("src"), "rating")

Page 58: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

CREATE THE GRAPH

g = GraphFrame(vertices, edges)

Page 59: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

PATTERN MATCHING AKA MOTIFS

Page 60: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

(a)-[r]->(c)

MOTIFS

Page 61: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

(a)-[r]->(c); (b)-[s]->(c)

MOTIFS

Page 62: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

a

r

c

b

s

(a)-[r]->(c); (b)-[s]->(c)

name: jon name: dani

name: top gun

Page 63: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

QUERY THE GRAPH

corated = g.find("(a)-[r]->(c); (b)-[s]->(c)").\ filter("a.label = ʻuserʼ").\

filter("b.label = ʻuser'").\ filter("r.rating >= 4").\ filter("s.rating >= 4")

Page 64: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WORKING WITH RESULTSuser_movie_rating_freq = \ corated.select(corated.a.id.alias("user"), corated.c.id.alias("user2")).\ groupBy("user", "user2").count()

[Row(user=u'87281e3a-3ca5-438b-917d-fb8d3d96da35', user2=u'e9e24ad2-457a-488c-bdd1-3cb0ea82a470', count=7)]

Page 65: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WRITE YOUR DATA FRAMES BACK TO CASSANDRA

Page 66: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

create table corated (user1 uuid,user2 uuid,count int,primary key(user1, user2)

);

Page 67: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

SHORTEST PATH

Page 68: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

WHAT IS THE CONNECTION FROM A TO B?

Page 69: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

ATTENDED

@RUSTYRAZORBLADE

@PATRICKMCFADIN

COUSIN

@oscar_the_grouch

ATTENDED

CAL POLY

MY COUSIN WENT TO SCHOOL WITH PATRICK

Page 70: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

GRAPH PROBLEMS ARE USUALLY JUST FEATURES

Page 71: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

USE CASSANDRA PLUS SPARK

Page 72: Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C* Summit 2016

@RUSTYRAZORBLADE