how linkedin scaled

7/26/2019 How Linkedin Scaled

1/73

Josh Clemmwww.linkedin.com/in/joshclemm

SCALING LINKEDINA BRIEF HISTORY


2/73

Scaling = replacing all the componentsof a car while driving it at 100mph

Via Mike Krieger, Scaling Instagram


3/73

LinkedIn started back in 2003 toconnect to your network for better jobopportunities.

It had 2700 members in first week.


4/73

First week growth guesses from founding team


5/73

0M

50M

300M

250M

200M

150M

100M

400M

32M

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

5

400M

350M

Fast forward to today...


6/73

LinkedIn is a global site with over 400 millionmembers

Web pages and mobile traffic are served attens of thousandsof queries per second

Backend systems serve millionsof queries persecond

LINKEDIN SCALE TODAY


7/73

7

How did we get there?


8/73

Lets start from

the beginning


9/73

LEO

DB


10/73

DB

LEO

Huge monolithic appcalled Leo

Java, JSP, Servlets,JDBC

Served every page,same SQL database

LEO

Circa 2003

LINKEDINS ORIGINAL ARCHITECTURE


11/73

So far so good, but two areasto improve:

1. The growing member to member

connection graph

2. The ability to search those members


12/73

Needed to live in-memory for topperformance

Used graph traversal queries not suitable forthe shared SQL database.

Different usage profile than other parts of site

MEMBER CONNECTION GRAPH


13/73

MEMBER CONNECTION GRAPH

So, a dedicated service was created.LinkedIns firstservice.

Needed to live in-memory for topperformance

Used graph traversal queries not suitable forthe shared SQL database.

Different usage profile than other parts of site


14/73

Social networks need powerful search

Lucenewas used on top of our member graph

MEMBER SEARCH


15/73

Social networks need powerful search

Lucenewas used on top of our member graph

MEMBER SEARCH

LinkedIns secondservice.


16/73

LINKEDIN WITH CONNECTION GRAPHAND SEARCH

MemberGraphLEO

DB

RPC

Circa 2004

Lucene

Connection / Profile Updates


17/73

Getting better, but the single database was

under heavy load.

Vertically scaling helped, but we needed to

offload the read traffic...


18/73

Master/slave concept

Read-onlytraffic from replica

Writes go to main DB

Early version of Databus kept DBs in sync

REPLICA DBs

Main DB

ReplicaReplicaDatabus

relay Replica DB


19/73

Good medium term solution

We could vertically scale servers for a while

Master DBs have finite scaling limits

These days, LinkedIn DBs use partitioning

REPLICA DBs TAKEAWAYS

Main DB

ReplicaReplicaDatabus

relay Replica DB


20/73

MemberGraphLEO

RPC

Main DB ReplicaReplicaDatabus relay Replica DB

ConnectionUpdates

R/WR/O

Circa 2006

LINKEDIN WITH REPLICA DBs

Search

ProfileUpdates


21/73

As LinkedIn continued to grow, themonolithic application Leowas becomingproblematic.

Leo was difficult to release, debug, and the

site kept going down...


22/73


23/73


24/73


25/73

Kill L O

IT WAS TIME TO...


26/73

Public ProfileWeb App

Profile Service

LEO

Recruiter WebApp

Yet anotherService

Extracting services (Java Spring MVC) fromlegacy Leo monolithic application

Circa 2008 on

SERVICE ORIENTED ARCHITECTURE


27/73

Goal - create vertical stack ofstateless services

Frontend serversfetch data

from many domains, buildHTML or JSON response

Mid-tier serviceshost APIs,

business logic

Data-tieror back-tier services

encapsulate data domains

Profile WebApp

ProfileService

Profile DB

SERVICE ORIENTED ARCHITECTURE


28/73


29/73

GroupsContentService

ConnectionsContentService

ProfileContentService

Browser / App

FrontendWeb App

Mid-tierService

Mid-tierService

Mid-tierService

Edu Data

ServiceData

Service

Hadoop

DB Voldemort

EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN

Kafka


30/73

PROS

Stateless services

easily scale

Decoupled domains

Build and deploy

independently

CONS

Ops overhead

Introduces backwards

compatibility issues

Leads to complex call

graphs and fanout

SERVICE ORIENTED ARCHITECTURE COMPARISON


31/73

bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l

756

In 2003, LinkedIn had one service (Leo)

By 2010, LinkedIn had over 150 services

Today in 2015, LinkedIn has over 750 services

SERVICES AT LINKEDIN


32/73

Getting better, but LinkedIn wasexperiencing hypergrowth...


33/73


34/73

Simple way to reduce load onservers and speed up responses

Mid-tier caches store derived

objects from different domains,reduce fanout

Caches in the data layer

We use memcache, couchbase,

even Voldemort

FrontendWeb App

Mid-tier

Service

Cache

DB

Cache

CACHING


35/73

There are only two hard problems in

Computer Science:Cache invalidation, naming things, and

off-by-one errors.

Via Twitter by Kellan Elliott-McCrea

and later Jonathan Feinberg


36/73

CACHING TAKEAWAYS

Caches are easy to add in the beginning, butcomplexity adds up over time.

Over time LinkedIn removedmany mid-tier

caches because of the complexity around

invalidation

We kept caches closer to data layer


37/73

CACHING TAKEAWAYS (cont.)

Services must handle full load - cachesimprove speed, not permanent load bearing

solutions

Well use a low latency solution like

Voldemort when appropriate and precompute

results


38/73

LinkedIns hypergrowth was extending tothe vast amounts of data it collected.

Individual pipelines to route that datawerent scaling. A better solution was

needed...


39/73


40/73

KAFKA MOTIVATIONS

LinkedIn generates a ton of data Pageviews Edits on profile, companies, schools Logging, timing Invites, messaging Tracking

Billions of events everyday

Separate and independently created pipelinesrouted this data


41/73

A WHOLE LOT OF CUSTOM PIPELINES...


42/73

A WHOLE LOT OF CUSTOM PIPELINES...

As LinkedIn needed to scale, each pipelineneeded to scale.


43/73

Distributed pub-sub messaging platform as LinkedIns

universal data pipeline

KAFKA

Kafka

Frontendservice

Frontendservice

BackendService

DWH Monitoring Analytics HadoopOracle


44/73

BENEFITS Enabled near realtime access to any data source

Empowered Hadoop jobs

Allowed LinkedIn to build realtime analytics

Vastly improved site monitoring capability

Enabled devs to visualize and track call graphs

Over 1 trillionmessages published per day, 10 millionmessages per second

KAFKA AT LINKEDIN


45/73

OVER1TRILLIONPUBLISH

EDDAILYOVER1TRILL

IONPUBLISHEDDAILY


46/73

Lets end with

the modern years


47/73


48/73

Services extracted from Leo or created newwere inconsistent and often tightly coupled

Rest.li was our move to a data model centric

architecture

It ensured a consistent stateless Restful API

model across the company.

REST.LI


49/73

By using JSON over HTTP, our new APIssupported non-Java-based clients.

By using Dynamic Discovery (D2), we gotload balancing, discovery, and scalability of

each service API.

Today, LinkedIn has 1130+Rest.li resourcesand over 100 billionRest.li calls per day

REST.LI (cont.)


50/73

Rest.li Automatic API-documentation

REST.LI (cont.)


51/73

Rest.li R2/D2 tech stack

REST.LI (cont.)


52/73

LinkedIns success with Data infrastructure

like Kafka and Databus led to thedevelopment of more and more scalableData infrastructuresolutions...


53/73

It was clear LinkedIn could build datainfrastructure that enables long term growth

LinkedIn doubled down on infra solutions like:

Storage solutions Espresso, Voldemort, Ambry (media)

Analytics solutions like Pinot

Streaming solutions Kafka, Databus, and Samza

Cloud solutions like Helix and Nuage

DATA INFRASTRUCTURE


54/73

DATABUS


55/73

LinkedIn is a global companyand wascontinuing to see large growth. How elseto scale?


56/73

Natural progression of horizontally scaling

Replicate data across many data centers using

storage technology like Espresso

Pin users to geographically close data center

Difficult but necessary

MULTIPLE DATA CENTERS


57/73

Multiple data centers are imperative tomaintain high availability.

You need to avoid any single point of failurenot just for each service, but the entire site.

LinkedIn runs out of three main data centers,

additional PoPs around the globe, and more

coming online every day...



58/73


LinkedIn's operational setup as of 2015(circles represent data centers, diamonds represent PoPs)


59/73

Of course LinkedIns scaling story is neverthis simple, so what else have we done?


60/73

Each of LinkedIns critical systems have

undergone their own rich history of scale

(graph, search, analytics, profile backend,

comms, feed)

LinkedIn uses Hadoop / Voldemortfor insights

like People You May Know, Similar profiles,

Notable Alumni, and profile browse maps.

WHAT ELSE HAVE WE DONE?


61/73

Re-architected frontend approachusing

Client templates

BigPipe

Play Framework

LinkedIn added multiple tiers of proxiesusing

Apache Traffic Server and HAProxy

We improved the performance of serverswith

new hardware, advanced system tuning, and

newer Java runtimes.

WHAT ELSE HAVE WE DONE? (cont.)


62/73

Scaling sounds easyand quickto do, right?


63/73

Hofstadter's Law:It always takes longer

than you expect, even when you take

into account Hofstadter's Law.

Via Douglas Hofstadter,

Gdel, Escher, Bach: An Eternal Golden Braid


64/73

Josh Clemm

www.linkedin.com/in/joshclemm

THANKS!
http://www.linkedin.com/in/joshclemmhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012


65/73

Blog version of this slide deckhttps://engineering.linkedin.com/architecture/brief-history-scaling-linkedin

Visual story of LinkedIns historyhttps://ourstory.linkedin.com/

LinkedIn Engineering bloghttps://engineering.linkedin.com

LinkedIn Open-Sourcehttps://engineering.linkedin.com/open-source

LinkedIns communication system slides whichinclude earliest LinkedIn architecture http://www.slideshare.net/linkedin/linkedins-communication-architecture

Slides which include earliest LinkedIn data infra workhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012

LEARN MORE

( )
http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/linkedin/linkedins-communication-architecturehttps://engineering.linkedin.com/https://engineering.linkedin.com/architecture/brief-history-scaling-linkedinhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/linkedin/linkedins-communication-architecturehttp://www.slideshare.net/linkedin/linkedins-communication-architecturehttps://engineering.linkedin.com/open-sourcehttps://engineering.linkedin.com/https://ourstory.linkedin.com/https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin


66/73

Project Inversion - internal project to enable developer

productivity (trunk based model), faster deploys, unified

serviceshttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-

freeze-that-saved-linkedin

LinkedIns use of Apache Traffic serverhttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-

server

Multi Data Center - testing fail overs

https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeung

LEARN MORE (cont.)

LEARN MORE KAFKA
https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttps://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttps://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin


67/73

History and motivation around Kafkahttp://www.confluent.io/blog/stream-data-platform-1/

Thinking about streaming solutions as a commit loghttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-

should-know-about-real-time-datas-unifying

Kafka enabling monitoring and alertinghttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collection

Kafka enabling real-time analytics (Pinot)http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

Kafkas current use and future at LinkedInhttp://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future

Kafka processing 1 trillion events per dayhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-

kafka-linkedin

LEARN MORE - KAFKA

LEARN MORE DATA INFRASTRUCTURE
https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collectionhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttp://engineering.linkedin.com/kafka/kafka-linkedin-current-and-futurehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collectionhttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifyinghttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifyinghttp://www.confluent.io/blog/stream-data-platform-1/


68/73

Open sourcing Databushttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-

latency-change-data-capture-system

Samza streams to help LinkedIn view call graphshttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-

apache-samza

Real-time analytics (Pinot)http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

Introducing Espresso data store

http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store

LEARN MORE - DATA INFRASTRUCTURE

LEARN MORE FRONTEND TECH
http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttp://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-systemhttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-systemhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hell


69/73

LinkedIns use of client templates

Dust.jshttp://www.slideshare.net/brikis98/dustjs

Profilehttp://engineering.linkedin.com/profile/engineering-new-linkedin-profile

Big Pipe on LinkedIns homepagehttp://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page

Play Framework

Introduction at LinkedIn https://engineering.linkedin.com/play/composable-and-streamable-play-apps

Switching to non-block asynchronous modelhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-

and-callback-hell

LEARNMORE - FRONTEND TECH

LEARN MORE REST LI
https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttp://www.slideshare.net/brikis98/dustjshttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttp://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-pagehttp://engineering.linkedin.com/profile/engineering-new-linkedin-profilehttp://www.slideshare.net/brikis98/dustjs


70/73

Introduction to Rest.li and how it helps LinkedIn scale

http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale

How Rest.li expanded across the company

http://engineering.linkedin.com/restli/linkedins-restli-moment

LEARN MORE - REST.LI

LEARN MORE SYSTEM TUNING
http://engineering.linkedin.com/restli/linkedins-restli-momenthttp://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale


71/73

JVM memory tuning

http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-

throughput-and-low-latency-java-applications

System tuning

http://engineering.linkedin.com/performance/optimizing-linux-memory-management-

low-latency-high-throughput-databases

Optimizing JVM tuning automatically

https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-

difficulties-and-using-jtune-solution

LEARN MORE - SYSTEM TUNING

WERE HIRING
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applicationshttps://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-difficulties-and-using-jtune-solutionhttps://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-difficulties-and-using-jtune-solutionhttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applicationshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications


72/73

LinkedIn continues to grow quickly and theres

still a ton of work we can do to improve.

Were working on problems that very few ever

get to solve - come join us!

WE RE HIRING
https://www.linkedin.com/company/linkedin/careers?trk=eng-blog


73/73

how linkedin scaled

Documents