meshwork - insight data engineering project

9
RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow

Upload: justin-cano

Post on 14-Aug-2015

52 views

Category:

Technology


1 download

TRANSCRIPT

RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow

Motivation • The internet is huge • How does your page rank amongst others in your mesh?

• What is the reach of your website? Which pages are affecting your page rank?

Data Source • Common Crawl Organization

• More than 7 years of web page data, over 500TB • CC April 2015 web corpus ~168TB • Processed ~445GB for project • Readily available in S3

Meshwork – your mesh in a network http://www.jcano.me/meshwork

Pipeline

Data from S3 (source of truth)

REST

Data

Raw (WARC format) Extraction

Edge List

… …

Data Flow

Link edge data (vertexId, pageRank)

Scaling up Page Rank job… spark-submit --class pageRank...4 nodes

=> 20 hours >3 mil records

6 nodes

=> 16 hours

8 nodes

=> 10 hours

About Me Justin Cano UC Riverside BS Computer Engineering

Previous work experience Software Engineer @

Hobbies I like building things!

•  Hardware, software Learning and using new technologies Moviegoer Outdoor activities: biking, snowboarding Interests: design, app dev Favorite TV Shows: Futurama & The Daily Show

Embedded Systems Developer @

Software Engineer Intern @