connecting cassandra data with graphframes (jon haddad, the last pickle) | c* summit 2016
TRANSCRIPT
JON HADDAD PRINCIPAL CONSULTANT, TLP
CONNECTING C* DATA WITH GRAPHFRAMES
loves
WHAT’S THE LAST PICKLE DO?
WE HELP MAKE YOU A GROUP OF EXPERTS
WHO IS THIS GUY?
15 YEARS EXPERIENCE
4 YEARS WITH CASSANDRA
CASSANDRA DATA MODELING
BIG DATA PROBLEMS
DATA MODELING RULES
DENORMALIZE
I MISS MY JOINS
PERFORMANCE & RELIABILITY
LOW FLEXIBILITY
WHAT IS A GRAPH?
JON
TLP
NATE
cofoundedknows
works at
RELATIONSHIPS ARE ARBITRARY
A GRAPH IS TRAVERSED
JON TLPworks at
start endfollow
ELEMENTS ARE NOT TYPED
FOLLOW ALL EDGES IN ANY QUERY
JON
TLP
NATE
cofoundedknows
works at
TRAVERSALS ARE MORE FLEXIBLE THAN JOINS
THE WORLD IS A GRAPH
epic chart of flexibility
Apache Cassandra
Graph databases
GRAPH IS COOL
RIGHT?
NEO4J TITAN
DSE GRAPHcookie monster photo?
GRAPH ALL THE THINGS!
TRADEOFFS
PERFORMANCE
graph
REMEMBER WHY WE DON’T DO JOINS?
DISTRIBUTED JOINS ARE HARD WORK
MORE WORK =
SLOWER DATABASE
APPLICATION COMPLEXITY
DO I NEED GRAPH ALL THE TIME?
GRAPH QUERIES ON CASSANDRA?
GRAPH IS COOL FOR ANALYTICS
LET’S USE SPARK
cdm install movielens
CREATE TABLE movies ( id uuid PRIMARY KEY, avg_rating float, genres set<text>, name text, release_date date, url text, video_release_date date)
CREATE TABLE users ( id uuid PRIMARY KEY, address text, age int, city text, gender text, name text, occupation text, zip text);
CREATE TABLE ratings_by_movie ( movie_id uuid, user_id uuid, rating int, ts int, PRIMARY KEY (movie_id, user_id));
RECOMMENDATION ENGINE
AWESOME
GET DATA INTO A GRAPH
JON TOP GUNlabel: ratedrating: 5
DATAFRAMESid genres name
ae4f9269-5d62-4ad1-b87c-1b23962bb224 {'Drama'} Prefontaine (1997)
de9a14a9-6d6d-4573-b415-c8555e85d391 {'Drama'} Raging Bull (1980)
0b67d4e7-ee2b-47ab-9437-df0c793ea72a {'Action', 'Sci-Fi', 'Thriller'} Face/Off (1997)
BOILERPLATE
sql = SQLContext(sc)from functools import partialconnector = "org.apache.spark.sql.cassandra"load = partial(sql.read.format(connector).load, keyspace="movielens")
LOAD THE DATA FRAMES
movies = load(table="movies")ratings = load(table="ratings_by_movie")users = load(table="users")
WHAT’S A GRAPHFRAME?
GraphFrame(v, e)
VERTEX LIST
movies +
users
MOVIE DATAFRAMEDataFrame[
id: string, avg_rating: float, genres: array<string>, name: string, release_date: date, url: string, video_release_date: date
]
MOVIES AS LIST OF VERTICES
movies_v = movies.select("id", "name").\ withColumn("label", F.lit("movie"))
graph elements have no type
MOVIE VERTEX
[Row(id=uʼ6d318848…ʼ, name=u'Anna (1996)', label=u'movie')]
USERS AS VERTICESusers_v = users.select("id", "name").\ withColumn(“label”, F.lit("user"))
[Row(id=uʼb52fcdfc…ʼ, name=u'Harrold Hills', label=u'user'),
CREATE THE FULL VERTEX LIST
vertices = movies_v.unionAll(users_v)
GET THE EDGES
edges =\ ratings.select(ratings.movie_id.alias("dst"), ratings.user_id.alias("src"), "rating")
CREATE THE GRAPH
g = GraphFrame(vertices, edges)
PATTERN MATCHING AKA MOTIFS
(a)-[r]->(c)
MOTIFS
(a)-[r]->(c); (b)-[s]->(c)
MOTIFS
a
r
c
b
s
(a)-[r]->(c); (b)-[s]->(c)
name: jon name: dani
name: top gun
QUERY THE GRAPH
corated = g.find("(a)-[r]->(c); (b)-[s]->(c)").\ filter("a.label = ʻuserʼ").\
filter("b.label = ʻuser'").\ filter("r.rating >= 4").\ filter("s.rating >= 4")
WORKING WITH RESULTSuser_movie_rating_freq = \ corated.select(corated.a.id.alias("user"), corated.c.id.alias("user2")).\ groupBy("user", "user2").count()
[Row(user=u'87281e3a-3ca5-438b-917d-fb8d3d96da35', user2=u'e9e24ad2-457a-488c-bdd1-3cb0ea82a470', count=7)]
WRITE YOUR DATA FRAMES BACK TO CASSANDRA
create table corated (user1 uuid,user2 uuid,count int,primary key(user1, user2)
);
SHORTEST PATH
WHAT IS THE CONNECTION FROM A TO B?
ATTENDED
@RUSTYRAZORBLADE
@PATRICKMCFADIN
COUSIN
@oscar_the_grouch
ATTENDED
CAL POLY
MY COUSIN WENT TO SCHOOL WITH PATRICK
GRAPH PROBLEMS ARE USUALLY JUST FEATURES
USE CASSANDRA PLUS SPARK
@RUSTYRAZORBLADE