insight data engineering project

RedditRYour Personalized gateway to

Reddit.com

Aravind Kumar RameshInsight Data Engineering Fellow, New York

Motivation

82.54 billion pageviews73.15 million submissions, 725.85 million comments

1

In 2015

What’s trending ? Maximize Content Engagement

Personalized recommendation

1,034,259.8MB of Reddit data

Data Pipeline

Challenges

◉ Restricted Reddit API

Solutions

◉Restricted Reddit API- A Multi-threaded API to scrape

Reddit

Challenges

◉Generating recommendations using ALS

Solutions

◉Generating recommendations using ALS

- ALS - Compute Intensive.

- Generating recommendations using user graph

Challenges

◉Dealing with large data

Use Parquet

Original Dataset1084.5 GBCompressed Parquet187.8 GB

Queries ran 3x faster on Parquet.

Solution

Table Design

PRIMARY KEY (author,created_utc))with clustering order by (created_utc asc)

Secondary IndexCREATE INDEX subreddit ON subredditinfo (subreddit);

I am Aravind I am here because I love data engineering and working with large scale data. You can find me @aravindk1992

About Me

Bachelor’s in Telecommunication Engineering Master’s in Computer Science from the State University of New York at Buffalo, New York

Any questions ?

Thanks!

Back up slides!

User Graph

USER A( POSTS A CONTENT ON

REDDIT )

User Graph

USER B( READS THE POST AND REPLIES TO THE POST )

USER A

User Graph

USER B USER A INTERACTION

Indegree: Influence

Outdegree: Activity

What are you mostly likely to like?

◉Look at the indegree of all the nodes in a cluster/subreddit and rank them.

◉For the top 10 nodes with highest indegree, compute outdegree to other cluseters

◉You are more likely to like what the most influential user of your favourite subredditengages with.

insight data engineering project

Data & Analytics