data infrastructure at linkedin

36
Data Infrastructure at LinkedIn Kapil Surlaker http://www.linkedin.com/in/kapilsurlaker @kapilsurlaker 1

Upload: amy-w-tang

Post on 11-May-2015

1.159 views

Category:

Technology


2 download

DESCRIPTION

This talk was given by Kapil Surlaker (Staff Software Engineer @ LinkedIn) at the 28th IEEE International Conference on Data Engineering (ICDE 2012).

TRANSCRIPT

Page 1: Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn Kapil Surlaker http://www.linkedin.com/in/kapilsurlaker @kapilsurlaker

1

Page 2: Data Infrastructure at LinkedIn

Outline

LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play

2

Page 3: Data Infrastructure at LinkedIn

LinkedIn By The Numbers

150M + users* ~ 4.2B People Searches in 2011** >2M companies with LinkedIn Company Pages** 16 languages 75% of Fortune 100 Companies use LinkedIn to hire***

* As of February 9th 2012 ** As of December 31st 2011

*** As of September 30th 2011

3

Page 4: Data Infrastructure at LinkedIn

Broad Range of Products & Services

4

Page 5: Data Infrastructure at LinkedIn

5

User Profiles Large dataset Medium writes Very high reads Freshness <1s

Page 6: Data Infrastructure at LinkedIn

6

Communications Large dataset

High writes High reads

Freshness <1s

Page 7: Data Infrastructure at LinkedIn

People You May Know

7

Large dataset Compute intensive

High reads Freshness ~hrs

Page 8: Data Infrastructure at LinkedIn

LinkedIn Today

8

Moving dataset High writes High reads

Freshness ~mins

Page 9: Data Infrastructure at LinkedIn

Outline

LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play

9

Page 10: Data Infrastructure at LinkedIn

Three Paradigms : Simplifying the Data Continuum

• Member Profiles

• Company Profiles • Connections • Communications

Online

• Linkedin Today

• Profile Standardization • News • Recommendations • Search • Communications

Nearline

• People You May Know

• Connection Strength • News • Recommendations • Next best idea

Offline

10

Activity that should be reflected immediately

Activity that should be reflected soon

Activity that can be reflected later

Page 11: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

11

Page 12: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

12

Page 13: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

13

Page 14: Data Infrastructure at LinkedIn

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

14

Page 15: Data Infrastructure at LinkedIn

Databus at LinkedIn

15

DB

Bootstrap

Capture Changes

On-line Changes

On-line Changes

DB

Consistent Snapshot at U

Consumer 1 Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1 Consumer n

Dat

abus

C

lient

Lib

Client

Relay Event Win

Page 16: Data Infrastructure at LinkedIn

Databus at LinkedIn

16

DB

Bootstrap

Capture Changes

On-line Changes

On-line Changes

DB

Consistent Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1 Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1 Consumer n

Dat

abus

C

lient

Lib

Client

Relay Event Win

Page 17: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

17

Page 18: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

18

Page 19: Data Infrastructure at LinkedIn

Voldemort: Highly-Available Distributed KV Store

LinkedIn Data Infrastructure Solutions

19

Page 20: Data Infrastructure at LinkedIn

• Pluggable components • Tunable consistency /

availability • Key/value model,

server side “views”

• 10 clusters, 100+ nodes • Largest cluster – 10K+ qps • Avg latency: 3ms • Hundreds of Stores • Largest store – 2.8TB+

Voldemort: Architecture

Page 21: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

21

Page 22: Data Infrastructure at LinkedIn

Kafka: High-Volume Low-Latency Messaging System

LinkedIn Data Infrastructure Solutions

22

Page 23: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

23

Page 24: Data Infrastructure at LinkedIn

Kafka: Architecture

24

WebTier

Topic 1

Broker Tier

Push Events Topic 2

Topic N

Zookeeper Offset Management

Topic, Partition Ownership

Sequential write sendfile

Kaf

ka

Clie

nt L

ib

Consumers

Pull Events Iterator 1

Iterator n

Topic Offset

100 MB/sec 200 MB/sec

Page 25: Data Infrastructure at LinkedIn

Kafka: Architecture

25

WebTier

Topic 1

Broker Tier

Push Events Topic 2

Topic N

Zookeeper Offset Management

Topic, Partition Ownership

Sequential write sendfile

Kaf

ka

Clie

nt L

ib

Consumers

Pull Events Iterator 1

Iterator n

Topic Offset

100 MB/sec 200 MB/sec

Billions of Events, TBs per day 50K+ per sec at peak Inter and Intra-cluster replication End-to-end latency: few seconds

At least once delivery Very high throughput Low latency Durability

Page 26: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

26

Page 27: Data Infrastructure at LinkedIn

Espresso: Indexed Timeline-Consistent Distributed Data Store

LinkedIn Data Infrastructure Solutions

27

Page 28: Data Infrastructure at LinkedIn

Application View

28

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality within resource groups

Transactions Secondary index Text search

Page 29: Data Infrastructure at LinkedIn

Partitioning

29

Page 30: Data Infrastructure at LinkedIn

Node 3

Node 2

Espresso Partition Layout: Master, Slave

Cluster Manager

Partition: P.1 Node: 1 … Partition: P.12 Node: 3

Database

Node: 1 M: P.1 – Active … S: P.5 – Active …

Cluster Node 1

P.1 P.2

P.4

P.3

P.5 P.6

P.9 P.10

P.5 P.6

P.8

P.7

P.1 P.2

P.11 P.12

P.9 P.10

P.12

P.11

P.3 P.4

P.7 P.8 Master

Slave

3 Storage Engine nodes, 2 way replication

Page 31: Data Infrastructure at LinkedIn

Espresso: System Components

31

Page 32: Data Infrastructure at LinkedIn

Generic Cluster Manager: Helix

• Generic Distributed State Model • Centralized Config Management • Automatic Load Balancing • Fault tolerance • Health monitoring • Cluster expansion and

rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix

32

Page 33: Data Infrastructure at LinkedIn

Espresso@Linkedin

Launched first application Oct 2011 Open source 2012 Future

– Multi-Datacenter support – Global secondary indexes – Time-partitioned data

33

Page 34: Data Infrastructure at LinkedIn

LinkedIn Product Architecture

34

Page 35: Data Infrastructure at LinkedIn

Acknowledgments

Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar, Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez, Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White, Victor Ye, David Zhang, and Jason Zhang

35

Page 36: Data Infrastructure at LinkedIn

Questions?

36