how linkedin scaled
TRANSCRIPT
-
7/26/2019 How Linkedin Scaled
1/73
Josh Clemmwww.linkedin.com/in/joshclemm
SCALING LINKEDINA BRIEF HISTORY
-
7/26/2019 How Linkedin Scaled
2/73
Scaling = replacing all the componentsof a car while driving it at 100mph
Via Mike Krieger, Scaling Instagram
-
7/26/2019 How Linkedin Scaled
3/73
LinkedIn started back in 2003 toconnect to your network for better jobopportunities.
It had 2700 members in first week.
-
7/26/2019 How Linkedin Scaled
4/73
First week growth guesses from founding team
-
7/26/2019 How Linkedin Scaled
5/73
0M
50M
300M
250M
200M
150M
100M
400M
32M
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
5
400M
350M
Fast forward to today...
-
7/26/2019 How Linkedin Scaled
6/73
LinkedIn is a global site with over 400 millionmembers
Web pages and mobile traffic are served attens of thousandsof queries per second
Backend systems serve millionsof queries persecond
LINKEDIN SCALE TODAY
-
7/26/2019 How Linkedin Scaled
7/73
7
How did we get there?
-
7/26/2019 How Linkedin Scaled
8/73
Lets start from
the beginning
-
7/26/2019 How Linkedin Scaled
9/73
LEO
DB
-
7/26/2019 How Linkedin Scaled
10/73
DB
LEO
Huge monolithic appcalled Leo
Java, JSP, Servlets,JDBC
Served every page,same SQL database
LEO
Circa 2003
LINKEDINS ORIGINAL ARCHITECTURE
-
7/26/2019 How Linkedin Scaled
11/73
So far so good, but two areasto improve:
1. The growing member to member
connection graph
2. The ability to search those members
-
7/26/2019 How Linkedin Scaled
12/73
Needed to live in-memory for topperformance
Used graph traversal queries not suitable forthe shared SQL database.
Different usage profile than other parts of site
MEMBER CONNECTION GRAPH
-
7/26/2019 How Linkedin Scaled
13/73
MEMBER CONNECTION GRAPH
So, a dedicated service was created.LinkedIns firstservice.
Needed to live in-memory for topperformance
Used graph traversal queries not suitable forthe shared SQL database.
Different usage profile than other parts of site
-
7/26/2019 How Linkedin Scaled
14/73
Social networks need powerful search
Lucenewas used on top of our member graph
MEMBER SEARCH
-
7/26/2019 How Linkedin Scaled
15/73
Social networks need powerful search
Lucenewas used on top of our member graph
MEMBER SEARCH
LinkedIns secondservice.
-
7/26/2019 How Linkedin Scaled
16/73
LINKEDIN WITH CONNECTION GRAPHAND SEARCH
MemberGraphLEO
DB
RPC
Circa 2004
Lucene
Connection / Profile Updates
-
7/26/2019 How Linkedin Scaled
17/73
Getting better, but the single database was
under heavy load.
Vertically scaling helped, but we needed to
offload the read traffic...
-
7/26/2019 How Linkedin Scaled
18/73
Master/slave concept
Read-onlytraffic from replica
Writes go to main DB
Early version of Databus kept DBs in sync
REPLICA DBs
Main DB
ReplicaReplicaDatabus
relay Replica DB
-
7/26/2019 How Linkedin Scaled
19/73
Good medium term solution
We could vertically scale servers for a while
Master DBs have finite scaling limits
These days, LinkedIn DBs use partitioning
REPLICA DBs TAKEAWAYS
Main DB
ReplicaReplicaDatabus
relay Replica DB
-
7/26/2019 How Linkedin Scaled
20/73
MemberGraphLEO
RPC
Main DB ReplicaReplicaDatabus relay Replica DB
ConnectionUpdates
R/WR/O
Circa 2006
LINKEDIN WITH REPLICA DBs
Search
ProfileUpdates
-
7/26/2019 How Linkedin Scaled
21/73
As LinkedIn continued to grow, themonolithic application Leowas becomingproblematic.
Leo was difficult to release, debug, and the
site kept going down...
-
7/26/2019 How Linkedin Scaled
22/73
-
7/26/2019 How Linkedin Scaled
23/73
-
7/26/2019 How Linkedin Scaled
24/73
-
7/26/2019 How Linkedin Scaled
25/73
Kill L O
IT WAS TIME TO...
-
7/26/2019 How Linkedin Scaled
26/73
Public ProfileWeb App
Profile Service
LEO
Recruiter WebApp
Yet anotherService
Extracting services (Java Spring MVC) fromlegacy Leo monolithic application
Circa 2008 on
SERVICE ORIENTED ARCHITECTURE
-
7/26/2019 How Linkedin Scaled
27/73
Goal - create vertical stack ofstateless services
Frontend serversfetch data
from many domains, buildHTML or JSON response
Mid-tier serviceshost APIs,
business logic
Data-tieror back-tier services
encapsulate data domains
Profile WebApp
ProfileService
Profile DB
SERVICE ORIENTED ARCHITECTURE
-
7/26/2019 How Linkedin Scaled
28/73
-
7/26/2019 How Linkedin Scaled
29/73
GroupsContentService
ConnectionsContentService
ProfileContentService
Browser / App
FrontendWeb App
Mid-tierService
Mid-tierService
Mid-tierService
Edu Data
ServiceData
Service
Hadoop
DB Voldemort
EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN
Kafka
-
7/26/2019 How Linkedin Scaled
30/73
PROS
Stateless services
easily scale
Decoupled domains
Build and deploy
independently
CONS
Ops overhead
Introduces backwards
compatibility issues
Leads to complex call
graphs and fanout
SERVICE ORIENTED ARCHITECTURE COMPARISON
-
7/26/2019 How Linkedin Scaled
31/73
bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l
756
In 2003, LinkedIn had one service (Leo)
By 2010, LinkedIn had over 150 services
Today in 2015, LinkedIn has over 750 services
SERVICES AT LINKEDIN
-
7/26/2019 How Linkedin Scaled
32/73
Getting better, but LinkedIn wasexperiencing hypergrowth...
-
7/26/2019 How Linkedin Scaled
33/73
-
7/26/2019 How Linkedin Scaled
34/73
Simple way to reduce load onservers and speed up responses
Mid-tier caches store derived
objects from different domains,reduce fanout
Caches in the data layer
We use memcache, couchbase,
even Voldemort
FrontendWeb App
Mid-tier
Service
Cache
DB
Cache
CACHING
-
7/26/2019 How Linkedin Scaled
35/73
There are only two hard problems in
Computer Science:Cache invalidation, naming things, and
off-by-one errors.
Via Twitter by Kellan Elliott-McCrea
and later Jonathan Feinberg
-
7/26/2019 How Linkedin Scaled
36/73
CACHING TAKEAWAYS
Caches are easy to add in the beginning, butcomplexity adds up over time.
Over time LinkedIn removedmany mid-tier
caches because of the complexity around
invalidation
We kept caches closer to data layer
-
7/26/2019 How Linkedin Scaled
37/73
CACHING TAKEAWAYS (cont.)
Services must handle full load - cachesimprove speed, not permanent load bearing
solutions
Well use a low latency solution like
Voldemort when appropriate and precompute
results
-
7/26/2019 How Linkedin Scaled
38/73
LinkedIns hypergrowth was extending tothe vast amounts of data it collected.
Individual pipelines to route that datawerent scaling. A better solution was
needed...
-
7/26/2019 How Linkedin Scaled
39/73
-
7/26/2019 How Linkedin Scaled
40/73
KAFKA MOTIVATIONS
LinkedIn generates a ton of data Pageviews Edits on profile, companies, schools Logging, timing Invites, messaging Tracking
Billions of events everyday
Separate and independently created pipelinesrouted this data
-
7/26/2019 How Linkedin Scaled
41/73
A WHOLE LOT OF CUSTOM PIPELINES...
-
7/26/2019 How Linkedin Scaled
42/73
A WHOLE LOT OF CUSTOM PIPELINES...
As LinkedIn needed to scale, each pipelineneeded to scale.
-
7/26/2019 How Linkedin Scaled
43/73
Distributed pub-sub messaging platform as LinkedIns
universal data pipeline
KAFKA
Kafka
Frontendservice
Frontendservice
BackendService
DWH Monitoring Analytics HadoopOracle
-
7/26/2019 How Linkedin Scaled
44/73
BENEFITS Enabled near realtime access to any data source
Empowered Hadoop jobs
Allowed LinkedIn to build realtime analytics
Vastly improved site monitoring capability
Enabled devs to visualize and track call graphs
Over 1 trillionmessages published per day, 10 millionmessages per second
KAFKA AT LINKEDIN
-
7/26/2019 How Linkedin Scaled
45/73
OVER1TRILLIONPUBLISH
EDDAILYOVER1TRILL
IONPUBLISHEDDAILY
-
7/26/2019 How Linkedin Scaled
46/73
Lets end with
the modern years
-
7/26/2019 How Linkedin Scaled
47/73
-
7/26/2019 How Linkedin Scaled
48/73
Services extracted from Leo or created newwere inconsistent and often tightly coupled
Rest.li was our move to a data model centric
architecture
It ensured a consistent stateless Restful API
model across the company.
REST.LI
-
7/26/2019 How Linkedin Scaled
49/73
By using JSON over HTTP, our new APIssupported non-Java-based clients.
By using Dynamic Discovery (D2), we gotload balancing, discovery, and scalability of
each service API.
Today, LinkedIn has 1130+Rest.li resourcesand over 100 billionRest.li calls per day
REST.LI (cont.)
-
7/26/2019 How Linkedin Scaled
50/73
Rest.li Automatic API-documentation
REST.LI (cont.)
-
7/26/2019 How Linkedin Scaled
51/73
Rest.li R2/D2 tech stack
REST.LI (cont.)
-
7/26/2019 How Linkedin Scaled
52/73
LinkedIns success with Data infrastructure
like Kafka and Databus led to thedevelopment of more and more scalableData infrastructuresolutions...
-
7/26/2019 How Linkedin Scaled
53/73
It was clear LinkedIn could build datainfrastructure that enables long term growth
LinkedIn doubled down on infra solutions like:
Storage solutions Espresso, Voldemort, Ambry (media)
Analytics solutions like Pinot
Streaming solutions Kafka, Databus, and Samza
Cloud solutions like Helix and Nuage
DATA INFRASTRUCTURE
-
7/26/2019 How Linkedin Scaled
54/73
DATABUS
-
7/26/2019 How Linkedin Scaled
55/73
LinkedIn is a global companyand wascontinuing to see large growth. How elseto scale?
-
7/26/2019 How Linkedin Scaled
56/73
Natural progression of horizontally scaling
Replicate data across many data centers using
storage technology like Espresso
Pin users to geographically close data center
Difficult but necessary
MULTIPLE DATA CENTERS
-
7/26/2019 How Linkedin Scaled
57/73
Multiple data centers are imperative tomaintain high availability.
You need to avoid any single point of failurenot just for each service, but the entire site.
LinkedIn runs out of three main data centers,
additional PoPs around the globe, and more
coming online every day...
MULTIPLE DATA CENTERS
-
7/26/2019 How Linkedin Scaled
58/73
MULTIPLE DATA CENTERS
LinkedIn's operational setup as of 2015(circles represent data centers, diamonds represent PoPs)
-
7/26/2019 How Linkedin Scaled
59/73
Of course LinkedIns scaling story is neverthis simple, so what else have we done?
-
7/26/2019 How Linkedin Scaled
60/73
Each of LinkedIns critical systems have
undergone their own rich history of scale
(graph, search, analytics, profile backend,
comms, feed)
LinkedIn uses Hadoop / Voldemortfor insights
like People You May Know, Similar profiles,
Notable Alumni, and profile browse maps.
WHAT ELSE HAVE WE DONE?
-
7/26/2019 How Linkedin Scaled
61/73
Re-architected frontend approachusing
Client templates
BigPipe
Play Framework
LinkedIn added multiple tiers of proxiesusing
Apache Traffic Server and HAProxy
We improved the performance of serverswith
new hardware, advanced system tuning, and
newer Java runtimes.
WHAT ELSE HAVE WE DONE? (cont.)
-
7/26/2019 How Linkedin Scaled
62/73
Scaling sounds easyand quickto do, right?
-
7/26/2019 How Linkedin Scaled
63/73
Hofstadter's Law:It always takes longer
than you expect, even when you take
into account Hofstadter's Law.
Via Douglas Hofstadter,
Gdel, Escher, Bach: An Eternal Golden Braid
-
7/26/2019 How Linkedin Scaled
64/73
Josh Clemm
www.linkedin.com/in/joshclemm
THANKS!
http://www.linkedin.com/in/joshclemmhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012 -
7/26/2019 How Linkedin Scaled
65/73
Blog version of this slide deckhttps://engineering.linkedin.com/architecture/brief-history-scaling-linkedin
Visual story of LinkedIns historyhttps://ourstory.linkedin.com/
LinkedIn Engineering bloghttps://engineering.linkedin.com
LinkedIn Open-Sourcehttps://engineering.linkedin.com/open-source
LinkedIns communication system slides whichinclude earliest LinkedIn architecture http://www.slideshare.net/linkedin/linkedins-communication-architecture
Slides which include earliest LinkedIn data infra workhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012
LEARN MORE
( )
http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/linkedin/linkedins-communication-architecturehttps://engineering.linkedin.com/https://engineering.linkedin.com/architecture/brief-history-scaling-linkedinhttp://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012http://www.slideshare.net/linkedin/linkedins-communication-architecturehttp://www.slideshare.net/linkedin/linkedins-communication-architecturehttps://engineering.linkedin.com/open-sourcehttps://engineering.linkedin.com/https://ourstory.linkedin.com/https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin -
7/26/2019 How Linkedin Scaled
66/73
Project Inversion - internal project to enable developer
productivity (trunk based model), faster deploys, unified
serviceshttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-
freeze-that-saved-linkedin
LinkedIns use of Apache Traffic serverhttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-
server
Multi Data Center - testing fail overs
https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeung
LEARN MORE (cont.)
LEARN MORE KAFKA
https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttps://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttps://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-angel-au-yeunghttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-serverhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttp://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin -
7/26/2019 How Linkedin Scaled
67/73
History and motivation around Kafkahttp://www.confluent.io/blog/stream-data-platform-1/
Thinking about streaming solutions as a commit loghttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-
should-know-about-real-time-datas-unifying
Kafka enabling monitoring and alertinghttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collection
Kafka enabling real-time analytics (Pinot)http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
Kafkas current use and future at LinkedInhttp://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future
Kafka processing 1 trillion events per dayhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-
kafka-linkedin
LEARN MORE - KAFKA
LEARN MORE DATA INFRASTRUCTURE
https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collectionhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttps://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedinhttp://engineering.linkedin.com/kafka/kafka-linkedin-current-and-futurehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttp://engineering.linkedin.com/52/autometrics-self-service-metrics-collectionhttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifyinghttps://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifyinghttp://www.confluent.io/blog/stream-data-platform-1/ -
7/26/2019 How Linkedin Scaled
68/73
Open sourcing Databushttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-
latency-change-data-capture-system
Samza streams to help LinkedIn view call graphshttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-
apache-samza
Real-time analytics (Pinot)http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
Introducing Espresso data store
http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store
LEARN MORE - DATA INFRASTRUCTURE
LEARN MORE FRONTEND TECH
http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttp://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-storehttp://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinothttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttps://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samzahttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-systemhttps://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-systemhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hell -
7/26/2019 How Linkedin Scaled
69/73
LinkedIns use of client templates
Dust.jshttp://www.slideshare.net/brikis98/dustjs
Profilehttp://engineering.linkedin.com/profile/engineering-new-linkedin-profile
Big Pipe on LinkedIns homepagehttp://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page
Play Framework
Introduction at LinkedIn https://engineering.linkedin.com/play/composable-and-streamable-play-apps
Switching to non-block asynchronous modelhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-
and-callback-hell
LEARNMORE - FRONTEND TECH
LEARN MORE REST LI
https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttp://www.slideshare.net/brikis98/dustjshttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-and-callback-hellhttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttps://engineering.linkedin.com/play/composable-and-streamable-play-appshttp://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-pagehttp://engineering.linkedin.com/profile/engineering-new-linkedin-profilehttp://www.slideshare.net/brikis98/dustjs -
7/26/2019 How Linkedin Scaled
70/73
Introduction to Rest.li and how it helps LinkedIn scale
http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale
How Rest.li expanded across the company
http://engineering.linkedin.com/restli/linkedins-restli-moment
LEARN MORE - REST.LI
LEARN MORE SYSTEM TUNING
http://engineering.linkedin.com/restli/linkedins-restli-momenthttp://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale -
7/26/2019 How Linkedin Scaled
71/73
JVM memory tuning
http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-
throughput-and-low-latency-java-applications
System tuning
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-
low-latency-high-throughput-databases
Optimizing JVM tuning automatically
https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-
difficulties-and-using-jtune-solution
LEARN MORE - SYSTEM TUNING
WERE HIRING
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applicationshttps://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-difficulties-and-using-jtune-solutionhttps://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-difficulties-and-using-jtune-solutionhttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databaseshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applicationshttp://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications -
7/26/2019 How Linkedin Scaled
72/73
LinkedIn continues to grow quickly and theres
still a ton of work we can do to improve.
Were working on problems that very few ever
get to solve - come join us!
WE RE HIRING
https://www.linkedin.com/company/linkedin/careers?trk=eng-blog -
7/26/2019 How Linkedin Scaled
73/73