big data analytics in linkedin -...

44
Big Data Analytics in LinkedIn by Danielle Aring & William Merritt

Upload: buikiet

Post on 20-Mar-2018

233 views

Category:

Documents


2 download

TRANSCRIPT

Big Data Analytics in LinkedIn

by

Danielle Aring & William Merritt

2

Brief History of LinkedIn

- Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/)

- 2005: Introduced first business lines : Jobs and Subscriptions

- 2006: Launched public profiles (achieved portability/new features)

- 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/)

- 2012: Site transformation/rapid growth

- 2013: ~225 million members (27 % of LinkedIn subscribers are recruiters)

- 2014: Next decade focused on map of digital economy3

4

5

Three Major Data Dimensions @LinkedIn

6

LinkedIn Challenges for Web-scale OLAP

● Horizontally scalable○ currently over 200+ million users○ adding 2 new members per second

● Quick response time to user’s queries● High availability● High read & write throughput (billions of monthly page views)● Heavy dependency on slowest node’s response as data is spread across

various nodes

7

Current OLAP Solutions - not suited for high-traffic website

● What is OLAP - Online Analytical Processing○ Long transactions○ Complex queries○ Mining and analyzing large amounts of data○ Infrequent updates of data

● Traditional for Business Intelligence (i.e. SAP, Oracle and etc)○ retrieve & consolidate partial results across nodes (causing slow responses)

● Distributed (problems: w/latency, availability and cost)● Materialized Cubes (loading billions of page views - load too high)

8

Avatara: solution for Web-scale Analytics Products● Provides fast scalable OLAP system

○ handles small cubes scenarios○ simple grammar for cube construction and query at scale○ sharding of cube dimension into key-value model○ leverage distributed key-value store for low-latency ○ high availability access to cubes○ leverages hadoop for joins

● Two examples of analytics features:○ WVMP - cube sharded by member ID

■ Who’s viewed my profile? (WVMP)○ WVTJ - cube sharded across jobs

■ Who’s viewed this job? (WVTJ)

9

Avatara: solution con’t● Sharding (i.e horizontal scaling)

○ divides the data set and distributes the data over multiple servers. Each shard is an independent database and together the shards make up a single logical database■ sharding on a primary key (turning a big cube into smaller ones)

● Store cube data’s in one location requires a single disk fetch

● Offline Batch Engine○ High throughput○ Batch processing (Hadoop Jobs)

● Online Query Engine○ low latency, high availability○ key-value paradigm for storing data (Voldemort)

10

Avatara: Architecture

--

11

Avatara: Offline Batch Engine - Three Phases- driven by a simple configuration file

● Preprocessing○ preparing the data○ using built-in functions to roll up data○ customized scripting for further processing

● Projections and Joins○ builds the dimension & fact tables○ a join key ties dimension & fact tables

● Cubification○ partitions the data by cube shard key & produces small cubes○ data can be retrieved in a single disk fetch for faster responses○ cubes are bulk loaded into a distributed key-value store (i.e. Voldemort)

12

Avatara: Online Query Engine

● Serves queries in real time

● Retrieves & processes data from key-value store (i.e. Voldemort)

● Fast retrieval because of compact cubes per sharded key (i.e. member_id)

● SQL-like syntax for clients

● Supports select, where, group-by, having, order and etc. operations

● Simplifies development for developers

13

Cube Thinning

● Avatara’s mechanism for thinning cubes too large to process on page load (such as: President Obama or Lebron James)

● Allows developers to do the following:○ set priorities and constraints

■ on dimensions aggregated to a specific value (such as “other” category)○ drop data across pre-defined dimensions

■ ex: WVMP can opt to drop data across time dimension ● resulting in a shorter history!

14

15

16

In Summary● Avatara has been working several years at LinkedIn (i.e. in-house OLAP system)● Allows developers to build OLAP cubes with a single configuration file● Hybrid offline/online strategy combined with sharding into key-value store● Powers large web-scale applications such as: WVMP, WVTJ and Jobs You May

Be Interested In● Avatara uses Hadoop for batch computing infrastructure● SQL-like query interaction● Hadoop batch engine can handle TBs of data & process in less than hrs of time● Voldemort can respond to online queries in milliseconds

● Future Work:○ Near real-time cubing○ Streaming joins○ Dimension and schema changes

17

BIG DATA PROJECT

18

Data Mining with LinkedIn using AJAX call to REST APIOverview:

1. Extract a large quantity of data from LinkedIn using AJAX call to REST API 2. Transform data into structured csv file format via scripts 3. Create tables in noSQL database Hive installed on top of HDFS4. Query database to make Analytic insights on people, jobs, and companies5. Visualize Hive queries via Tableau

19

Issue: Extracting Data From LinkedIn API

20

● Followed instructions on LinkedIn Developer: Authenticating with Oauth 2.0○ Success: configuring app, requesting

authorization code○ Fail: Exchanging authorization code for

request token (INVALID)

How to Extract Data From LinkedIn When Refused by LinkedIn API?

21

● Unable to download streamed data using CONVENTIONAL tools

● Solution: Data Extraction via AJAX call to REST API○ Jase Clamp tutorial on YouTube “How to Extract Data from LinkedIn”

Tools For Data Extraction/Transformation

22

● FireFox Web Console● Firebug ● JavaScript AJAX JQuery Scripts

○ company, person and jobs

★ Since we can not stream data into HDFS data transformation to structured file format done externally!

REST API, AJAX, JQuery ● REST API: REpresentational State Transfer

○ stateless○ separates server from client○ leverages layered system

● JQuery: JavaScript and DOM manipulation library○ makes client side scripting easier○ simplifies syntax for finding, selecting and manipulating DOM

elements● AJAX request: (asynchronous JavaScript and XML) ~ XMLHttpRequest

○ uses client side scripting to exchange data with web server○ types: GET, POST, PUT, DELETE○ loads all data once

23

24

theurl='https://www.linkedin.com/vsearch/jj?orig=TRNV&rsid=3824972411448555669155&trk=vsrp_jobs_sel&trkInfo=VSRPsearchId%3A3824972411448555665180,VSRPcmpt%3Atrans_nav&locationType=I&countryCode=us&openFacets=L,C&page_num='+i+'&pt=jobs&rnd=1448555696180';

Retrieving Data Structure

25

Structure of Companies Data

26

Structure of Jobs Data

27

Structure of Person Data

28

Data Extraction/Transformation Steps1. Login Linkedin account2. Create search (on Jobs, Companies, People etc)3. Right click select inspect element using firebug (console should display)4. Code/Paste script into right side of console window move to “All” tab5. Navigate to page 2 of search results6. Locate GET REST api call (in console window) right click copy location7. Paste call into “theurl” variable inside script8. Change pagenum=2 in URL to pagenum=’+i+’ 9. Run script (parse into comma separated JSON)

10. navigate to info in console copy and paste data into text editor11. remove empty lines and pasted in csv

29

30

Hadoop and Hive ~versions 1.2.1★ Hive chosen because data transformed into structured csv

31

Hive ~Table Creation and Data Upload ★ created 3 tables: companies, jobs, person

~Hive HQL: similar to SQL DDL statements for table creation and DML for insert

32

HQL (Hive Query Language)● Used HQL queries to derive insights from our data

○ Includes:■ Top companies with highest # followers■ Top locations with highest job count■ Job title and count per location■ Top job titles recently listed■ Location of jobs listed “1 day ago”■ Comparison of # of connections of people with and without profile

image■ Comparison Profile Headlines with Highest Connection Count vs

those with lower connection count

● Query visualization done in Tableau33

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','Select followercount, name, rank() over (ORDER BY followercount DESC) as rank from companies ranked_followers WHERE ranked_followers.rank < 10 ORDER BY followercount DESC;

34

Top companies with highest number of followers

F1~ # of followers

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' Select location, jobcount FROM (select location, rank() over (ORDER BY jobcount DESC) as rank, jobcount from companies) ranked_jobs WHERE ranked_jobs.rank < 51 ORDER BY location, jobcount DESC;

35

Top locations that have the highest number of jobs

F2~ # of jobs

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','SELECT c.location, j.jobTitle FROM companies c left outer join jobs j on (c.location = j.location);

36

Join on companies and jobs table selecting location and jobtitle (looking at number of jobs listed in each area)

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' SELECT companyName, jobTitle, jobRecency FROM (select companyName, jobTitle, rank() over (ORDER BY jobRecency DESC) as rank, jobRecency from jobs) ranked_jobTitles WHERE ranked_jobTitles.rank < 11 ORDER BY jobTitle, jobRecency DESC;

37

Top Job titles recently listed

insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' select location, companyName, jobTitle from jobs where jobRecency="1 day ago";

38

locations of jobs listed 1 day ago

insert overwrite local directory '/usr/local/hql' select count(*), sum(connectionCount) from person where imageUrl !="undefined";insert overwrite local directory '/usr/local/hql' select count(*), sum(connectionCount) from person where imageUrl ="undefined";

39

● Comparison: # connections of people with and without profile photo on webpage.

● ratio 5 : 454

● on Average those ○ w/out profile pic: ~470

connections○ with profile pic: ~394

● 76 person connection difference!

40

insert overwrite local directory '/usr/local/hql' select connectionCount, firstName, headline from person where connectionCount > 500;

Profile Headlines with Highest Connections

insert overwrite local directory '/usr/local/hql' select connectionCount, firstName, headline from person where connectionCount < 200;

41

Profile Headlines with lowestConnections

Interested in trying on your own?

Links:● FireBug add-on to FireFox:

○ https://addons.mozilla.org/en-us/firefox/addon/firebug/

● Jase Clamp tutorial “Extracting Data From LinkedIn”: ○ https://www.youtube.com/watch?v=S-9BWrtxoDw

● Data Extraction Script on Github: ○ https://gist.github.com/jaseclamp/2c74062bac1cc4dd929f\

● Tableau Download:○ http://www.tableau.com/products/desktop/download?os=windows

42

Sources

1. http://vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf2. http://www.slideshare.net/liliwu/avatara-olap-for-webscale-analytics-products3. https://ourstory.linkedin.com/#year-20044. http://www.slideshare.net/MichaelLi17/how-business-analytics-drives-business-

value-teradata-partners-conference-nashvile-2014?next_slideshow=15. https://engineering.linkedin.com/olap/avatara-olap-web-scale-analytics-products6. https://www.youtube.com/watch?v=9s-vSeWej1U

43

THANK YOU

44