rapid data exploration with hadoop

Post on 11-May-2015

3.388 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio. To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way. This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.

TRANSCRIPT

Rapid Data Exploration With Hadoop

Peter SkomorochSenior Data Scientist

@peteskomoroch

Outline• Overview: LinkedIn Biz, Tech, & Analytics• Rapid Data Exploration 101

- Spatial Analytics Pig Code- Trend detection with Pig & Python- R Streaming Example

• Deep Dive: Our Data Analysis Approach• Building Data Products• LinkedIn Data Insights

Connect the world’s professionals to make them more productive and successful

Professional Identity

LinkedIn at a glance• Founded in 2003• #17 site in the US (Alexa)• 60+ million members• First million members = 477 days• Latest million = 9 days• 500K+ company profiles• 12+ million small business professionals• In 2009 - 1billion people searches• Average age: 41• Household income $107,000• 42% are “decision makers”

How International?• More than 50% international

(members in over 200 countries & territories) • 13+ million in Europe• 4+ million in India• 3+ million in UK• #13 site in UK (Alexa)

How do we keep the lights on?• Profitable since 2007• Valued at over $1B at the last funding round• Subscriptions• Ads• Job Postings• Enterprise Client

Hadoop on LinkedIn1,400+ members list “Hadoop” on their profileWhat other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they?

• 36% in Bay Area• 8% in India• 6% in NYC• 4% in Seattle• 4% in Los Angeles

Who do they work for?• 11% Yahoo!• 2% Apache Software Foundation• 1% LinkedIn• 1% Google• 1% Facebook

Hadoop at LinkedIn

Voldemort Data StorageCompact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }

Getting Data In•From Databases (user data, news, jobs etc.)

• Need a way to get data reliably periodically• Need tests to verify data• Support for incremental replication• Solution: Transmogrify Driver Program

• InputReader: JDBCReader, CSV Reader• Output Writer: JDBCWriter, HDFS writers

• From web logs (page views, search, clicks etc)• Weblogs files are rsynced and loaded up in HDFS• Hadoop jobs for date cleaning and transformation.

Getting Data Out

Giving Back: Open Sourcehttp://sna-projects.com/sna/

Analytics Technologies

We Build Things With Data

Give smart people great tools, enable them to solve problems

Prototyping Culture

How does Hadoop enable rapid data

exploration?

Pig for Spatial Analytics

US County HeatMap

Pig for Trend Detection

Python Streaming Script

Sort Output & Display

R Streaming Also Easy

*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/

Let’s Talk Data

Business is recognizing the importance of analytics

What data do we start with?

We can also leverage... • Connection Graph• Recommendations• Address Book Uploads• Search Logs• Profile Views & Activity• Job Postings• LinkedIn Groups• LinkedIn Questions

• Company Pages• Talent Match• Web Referrals• 1M+ Twitter Accounts• Wikipedia Data• Mechanical Turk• Census, BLS, & Data.gov• Much more...

How do we think of Analytics?

Data Jujitsu

Lots of Medium can be more powerful than Big

>

Reconstruct Realityfrom Data Exhaust

Data Scientist Lessons• Follow the data, avoid assumptions• Sanity check the extremes (0, infinity)• Don’t get mired in rare edge cases• Data Jujitsu: solve easier auxiliary problems• Build smaller consistent samples to test code• Establish a baseline model quickly, iterate often• Use the right tool for the job at hand• Iterate quickly with high level languages

Where did the bankers go?

We’re Hiring!

pskomoro@linkedin.com@peteskomoroch

http://sna-projects.com/sna/

top related