meshwork - insight data engineering project
TRANSCRIPT
Motivation • The internet is huge • How does your page rank amongst others in your mesh?
• What is the reach of your website? Which pages are affecting your page rank?
Data Source • Common Crawl Organization
• More than 7 years of web page data, over 500TB • CC April 2015 web corpus ~168TB • Processed ~445GB for project • Readily available in S3
Scaling up Page Rank job… spark-submit --class pageRank...4 nodes
=> 20 hours >3 mil records
6 nodes
=> 16 hours
8 nodes
=> 10 hours
About Me Justin Cano UC Riverside BS Computer Engineering
Previous work experience Software Engineer @
Hobbies I like building things!
• Hardware, software Learning and using new technologies Moviegoer Outdoor activities: biking, snowboarding Interests: design, app dev Favorite TV Shows: Futurama & The Daily Show
Embedded Systems Developer @
Software Engineer Intern @