rapid data exploration with hadoop

33
Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch

Upload: peter-skomoroch

Post on 11-May-2015

3.388 views

Category:

Technology


2 download

DESCRIPTION

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio. To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way. This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.

TRANSCRIPT

Page 1: Rapid Data Exploration With Hadoop

Rapid Data Exploration With Hadoop

Peter SkomorochSenior Data Scientist

@peteskomoroch

Page 2: Rapid Data Exploration With Hadoop

Outline• Overview: LinkedIn Biz, Tech, & Analytics• Rapid Data Exploration 101

- Spatial Analytics Pig Code- Trend detection with Pig & Python- R Streaming Example

• Deep Dive: Our Data Analysis Approach• Building Data Products• LinkedIn Data Insights

Page 3: Rapid Data Exploration With Hadoop

Connect the world’s professionals to make them more productive and successful

Page 4: Rapid Data Exploration With Hadoop

Professional Identity

Page 5: Rapid Data Exploration With Hadoop

LinkedIn at a glance• Founded in 2003• #17 site in the US (Alexa)• 60+ million members• First million members = 477 days• Latest million = 9 days• 500K+ company profiles• 12+ million small business professionals• In 2009 - 1billion people searches• Average age: 41• Household income $107,000• 42% are “decision makers”

Page 6: Rapid Data Exploration With Hadoop

How International?• More than 50% international

(members in over 200 countries & territories) • 13+ million in Europe• 4+ million in India• 3+ million in UK• #13 site in UK (Alexa)

Page 7: Rapid Data Exploration With Hadoop

How do we keep the lights on?• Profitable since 2007• Valued at over $1B at the last funding round• Subscriptions• Ads• Job Postings• Enterprise Client

Page 8: Rapid Data Exploration With Hadoop

Hadoop on LinkedIn1,400+ members list “Hadoop” on their profileWhat other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they?

• 36% in Bay Area• 8% in India• 6% in NYC• 4% in Seattle• 4% in Los Angeles

Who do they work for?• 11% Yahoo!• 2% Apache Software Foundation• 1% LinkedIn• 1% Google• 1% Facebook

Page 9: Rapid Data Exploration With Hadoop

Hadoop at LinkedIn

Page 10: Rapid Data Exploration With Hadoop

Voldemort Data StorageCompact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }

Page 11: Rapid Data Exploration With Hadoop

Getting Data In•From Databases (user data, news, jobs etc.)

• Need a way to get data reliably periodically• Need tests to verify data• Support for incremental replication• Solution: Transmogrify Driver Program

• InputReader: JDBCReader, CSV Reader• Output Writer: JDBCWriter, HDFS writers

• From web logs (page views, search, clicks etc)• Weblogs files are rsynced and loaded up in HDFS• Hadoop jobs for date cleaning and transformation.

Page 12: Rapid Data Exploration With Hadoop

Getting Data Out

Page 13: Rapid Data Exploration With Hadoop

Giving Back: Open Sourcehttp://sna-projects.com/sna/

Page 14: Rapid Data Exploration With Hadoop

Analytics Technologies

Page 15: Rapid Data Exploration With Hadoop

We Build Things With Data

Give smart people great tools, enable them to solve problems

Page 16: Rapid Data Exploration With Hadoop

Prototyping Culture

Page 17: Rapid Data Exploration With Hadoop

How does Hadoop enable rapid data

exploration?

Page 18: Rapid Data Exploration With Hadoop

Pig for Spatial Analytics

Page 19: Rapid Data Exploration With Hadoop

US County HeatMap

Page 20: Rapid Data Exploration With Hadoop

Pig for Trend Detection

Page 21: Rapid Data Exploration With Hadoop

Python Streaming Script

Page 22: Rapid Data Exploration With Hadoop

Sort Output & Display

Page 23: Rapid Data Exploration With Hadoop

R Streaming Also Easy

*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/

Page 24: Rapid Data Exploration With Hadoop

Let’s Talk Data

Page 25: Rapid Data Exploration With Hadoop

Business is recognizing the importance of analytics

Page 26: Rapid Data Exploration With Hadoop

What data do we start with?

Page 27: Rapid Data Exploration With Hadoop

We can also leverage... • Connection Graph• Recommendations• Address Book Uploads• Search Logs• Profile Views & Activity• Job Postings• LinkedIn Groups• LinkedIn Questions

• Company Pages• Talent Match• Web Referrals• 1M+ Twitter Accounts• Wikipedia Data• Mechanical Turk• Census, BLS, & Data.gov• Much more...

Page 28: Rapid Data Exploration With Hadoop

How do we think of Analytics?

Data Jujitsu

Page 29: Rapid Data Exploration With Hadoop

Lots of Medium can be more powerful than Big

>

Page 30: Rapid Data Exploration With Hadoop

Reconstruct Realityfrom Data Exhaust

Page 31: Rapid Data Exploration With Hadoop

Data Scientist Lessons• Follow the data, avoid assumptions• Sanity check the extremes (0, infinity)• Don’t get mired in rare edge cases• Data Jujitsu: solve easier auxiliary problems• Build smaller consistent samples to test code• Establish a baseline model quickly, iterate often• Use the right tool for the job at hand• Iterate quickly with high level languages

Page 32: Rapid Data Exploration With Hadoop

Where did the bankers go?