insight data engineering project: rettiwt

14
Daoyu Tu Insight Data Engineering, New York Summer 2016 Rebuilding the Twitter Platform

Upload: daoyu-tu

Post on 14-Apr-2017

112 views

Category:

Career


0 download

TRANSCRIPT

Page 1: Insight Data Engineering Project: Rettiwt

Daoyu Tu

Insight Data Engineering, New York

Summer 2016

Rebuilding the Twitter Platform

Page 2: Insight Data Engineering Project: Rettiwt

Why reinvent the wheel?

• Building a scalable & robust

social media system

• Delivering real-time data to

all the users

Page 4: Insight Data Engineering Project: Rettiwt

Feeds TableFeeds Table

Mr_Geek 8:30 Hello

8:30 Hello

8:30 Hello

.

.

.

7:00 JAVA

7:30 Dota2

8:00 Cute Dogs

6:00 C++

8:20 Rock music

.

.

.

.

.

.

. . .

. . .

. . .

Game_player

Pet_owner

Muscic_lover

Mr_Geek

Login

Hello

JAVA

8:30

7:00

RettiwtAll my users

Page 5: Insight Data Engineering Project: Rettiwt

A tweet comes in

Hello

userId

8:30

Stephen

Curry tweets:

Page 6: Insight Data Engineering Project: Rettiwt

A tweet comes in

Hello

userId

8:30

Followers Table

Followers: userId[]

Curry Curry's followers

Stephen

Curry tweets:

Page 7: Insight Data Engineering Project: Rettiwt

A tweet comes in

Hello

userId

8:30

Followers Table

Followers: userId[]

Feeds Table

Follower #1

.

.

.

Curry Curry's followers

7:00 JAVA

7:30 Dota2

8:00 Cute Dogs

6:00 C++

8:20 Rock music

.

.

.

. . .

. . .

. . .

All my users

Stephen

Curry tweets:

Curry's

followers

Follower #2

Follower #n

Other users

. . .

Page 8: Insight Data Engineering Project: Rettiwt

A tweet comes in

Hello

userId

8:30

Followers Table

Followers: userId[]

Feeds Table

Follower #1 8:30 Hello

8:30 Hello

8:30 Hello

.

.

.

Curry Curry's followers

7:00 JAVA

7:30 Dota2

8:00 Cute Dogs

6:00 C++

8:20 Rock music

.

.

.

.

.

.

. . .

. . .

. . .

All my users

Stephen

Curry tweets:

Curry's

followers

Follower #2

Follower #n

Other users

Page 9: Insight Data Engineering Project: Rettiwt

Pipeline

RettiwtTwitter

Page 10: Insight Data Engineering Project: Rettiwt

Benchmarking

● Save the tweet or tweet Id?

● Redis or Cassandra for followers

table?

Tweets

tid: Long

tweet: String

vs

Feeds

uid: Long

tweet: String

time: timestamp

Denormalized Schema

Results Norm Denorm

Writing 1974 recs/s 691 recs/s

Reading 14ms for 20

tweets

10ms for 20

tweets

Feeds

uid: Long

tid: Long

time: timestamp

Normalized

Schema

Page 11: Insight Data Engineering Project: Rettiwt

Benchmarking

● Save the tweet or tweet Id?

● Redis or Cassandra for followers

table?

Results Redis Cassandra

Tweets 24.4 tweets/s 21.9 tweets/s

Redis Cassandra

• Fast in-memory

reads

• Pure key-value

storage

• Fault tolerance

• More storage

Page 12: Insight Data Engineering Project: Rettiwt

Benchmarking

● Save the tweet or tweet Id?

● Redis or Cassandra for followers

table?

Results Redis Cassandra

Tweets 24.4 tweets/s 21.9 tweets/s

Redis Cassandra

• Fast in-memory

reads

• Pure key-value

storage

• Fault tolerance

• More storage

Results Norm

Writing 1974 recs/s

Reading 14ms for 20

tweets

Page 13: Insight Data Engineering Project: Rettiwt

About Me

• Daoyu Tu

• Computer Science B.S & M.S.

• Web Developer

Page 14: Insight Data Engineering Project: Rettiwt

Some Stats

2 clusters of 4 m4.xlarge nodes on AWS, one for Cassandra, one for Kafka,

Spark & Redis

100,000 users with average degree of 94

Writing 1974 records, processing 24.4 tweets per second