making facebook faster
DESCRIPTION
Slides from talk on Frontend Performance Engineering delivered to Velocity 2009 by David Wei and Changhao JiangTRANSCRIPT
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
1Sunday, September 27, 2009
Making Facebook faster Frontend performance engineering
Velocity 2009Jun 24, 2009 San Jose, CA
David Wei and Changhao Jiang
2Sunday, September 27, 2009
1 Site speed matters
2 Performance monitoring
3 Static resource management
4 Ajaxification
5 Client side cache
Agenda
3Sunday, September 27, 2009
Site speed matters!
4Sunday, September 27, 2009First thing first: site speed matters.
▪ 10ms per page = more than 1 man-year per day
= more than 5 human-life of time per year
Site speed matters: large scale200 million users, more than 4 billion page views /
day
5Sunday, September 27, 2009Facebook cares site speed. … -- so yes, we care about site speed.
With our scales, our 200 Million users generated more than 4 billion page loads per day.
If we can speed up each page load by 10 ms, aggregately, we will save our users 1 man-year of time per day; and accumulating over a year, that’s more than 5 human life of time.
Site speed is also affecting our bottleline. Experiments show that if we reduce the latency by 600ms, the user click rate improves by more than 5%. We are currently running an in-depth experiment on the impact of latency.
Site speed matters: emerging
• Agile development
6Sunday, September 27, 2009On the other hand, there are huge challenges for a site like facebook in term of site performance optimization. Here are a few major ones….
Move fast, no stable code base
Fast Development: every week we release a new version of the site – with hundreds of code changes; tens of small code changes are pushed everyday. So the code base is never stable and there is no time to stop for pure optimization
Site speed matters: emerging
• Agile development
• Deep integration
7Sunday, September 27, 2009
Deep integration: Each facebook home page is customized for a particular user, with features developed by many teams – some of them are applications by 3rd party developers, some of them are internal facebook feature – depending on the users’ adoption on the features and applications. it also takes a lot of javascript to run them.
Site speed matters: emerging
• Agile development
• Deep integration
• Viral adoption
8Sunday, September 27, 2009Viral adoption: it is very hard to predict if a feature that is released today will be used by 1 million users or 10 million users next week. It is difficult to optimize beforehand. The infrastructure has to be adaptable to the growth of user adoption.
• Agile development
• Deep integration
• Viral adoption
• Heavily interactive
9Sunday, September 27, 2009… this talk, we will share our experience on how to make a site faster with these challenges
Heavy interaction: our pages have many dynamic features that rely on javascript. E.g. the in-browser chat and application dock provide very convenient user experience, while it also takes a lot of javascript to run them.
Site speed matters: emerging
• Agile development
• Deep integration
• Viral adoption
• Heavily interactive
10Sunday, September 27, 2009In summary, we have a lot of challenges.
And these challenges are actually essential to make Facebook a paradise for people who want to build new things – you can write something cool tonight, and push it out tomorrow to 200millions users. At the same time, it also makes the site performance hard to predict and maintain.
In this talk, we will share our experience on how to optimize front end performance with these challenges.
▪ From a user request to the presentation of the page at the browser, interactive:▪ Network Transfer Time▪ Server Generation Time▪ Client Render Time
Site speed: end-to-end latency experienced by
FBServer
Content DistributionNetwork(CDN)
Browsers
▪ GenTim
▪ NetTim
Render
11Sunday, September 27, 2009Before going into details, we’d define our problem domains.
We define the end-to-end user latency as the time from user starts a page request, to the time the page is presented in the browser, interactive.
There are three components of latency in this process:
Network Transfer time is the time from the user browser to Facebook server, and back;Server Generation time is the time spent on the Facebook servers;And client render time is the time the browser spends on parsing the HTML, loading javascript/css/images and rendering the contents.
▪ RenderTime: ~50% of end-user latency
▪ NetTime: ~25% of end-user latency
▪ GenTime: ~25% of end-user latency
Site speed: end-to-end latency experienced by
User latency = RenderTime + NetTime + GenTime
12Sunday, September 27, 2009Looking at facebook’s user latency, client side render time is about 50% of the end-to-end latency; network time and server-side generation time are about 25% each.
▪ RenderTime: ~50% of end-user latency
▪ NetTime: ~25% of end-user latency
▪ GenTime: ~25% of end-user latency
Site speed: end-to-end latency experienced by
User latency = RenderTime + NetTime + GenTime
13Sunday, September 27, 2009In this talk, we focus on the biggest chunk: render time.
Cavalry: Site speed monitoring
14Sunday, September 27, 2009
User-based measurementServer
JS
All content loaded, Page Interactive
ReportWhat’s our speed?▪ sampling 1/10000 page loads
First bytes of HTML
15Sunday, September 27, 2009To make the site faster, the first question we want to ask is: what is our site speed?
There are usually two approaches: run some in-house testing, or samples on real users We did both and found that the second approach is much more helpful for us.
We actually have lessons on the first approach: our pages are vastly different for different users, and Facebook employees are most likely to be the outliers because they tend to have much more features and functionalities than normal users, and installed many plugins such as firebug, ie developers. even finding a “typical” users is hard, as the usage behaviors of our users have been changing all the time.
Our approach is to take samples from our users. We have javascript measurement on a sampled users, 1/10000. to measure the real speed. The red arrows are the events that we records.
This gives us a real image of what the site speed looks like for facebook.
Btw, we are loading the javascripts before our css, because the javascripts are loaded in parallel, along with css and images
User-based measurementServer
JS
All content loaded, Page Interactive
ReportWhat’s our speed?▪ sampling 1/10000 page loads
First bytes of HTML
16Sunday, September 27, 2009The last thing I want to point out on this slide is that, we are loading the javascripts before our css – this violates the common best practice of putting css in front of js. However, the case here is that we are downloading most of our javascripts in parallel. If we put JS at top, we make JS, css and images are all in parallels. Half a year ago, we tested and found this is faster. We are running another set of experiments to see if things changed.
Cavalry: Day-to-day monitoringWhat’s our speed?▪ Collect gen time / network transfer time and render time
Network Time
Cavalry Logs
GenTime
Browser onload time
Daily site speed monitoring
17Sunday, September 27, 2009We combine the js measurement along with our serverside measurement on page generation time and network round trip time, and put it into a database.
Now we can yell to the company that “Hey the site is slower today!”.
However, we still don’t know who made it? We are continuously launching different features every week. It is hard to stop-and-test for performance.
Cavalry: Project-based analysisWho made it faster / slower?▪ Integrated with Launch System
Launch System
Network Time
Cavalry Logs
GenTime
Browser onload time
Daily site speed monitoring
Project-based regression detection
18Sunday, September 27, 20091. The second step of our measurement is to hook the logs with our launching system. For each measurement sample, we record what new features are launched in the
page load.
2. When there is a regression, we can go over the samples and identify the feature launch that causes regression.
3. This can make the corresponding team much more responsive to a regression.
4. Then there is still a question: “why is it slow? How can I fix it?”
Project-based regression detection
Cavalry: Numeric metricsWhy are we fast / slow? How can I fix it? ▪ YSlow-like technical metrics
Gate Keeper
Network Time
Cavalry Logs
GenTime
Browser onload time
Daily site speed monitoring
Regression analysis
Yslow-like metrics
19Sunday, September 27, 2009To answer the “why” question, Yslow is a good tool.
1. We instrument a subset of the Yslow metrics into our sampled page load. We measure the # of images / # of dom nodes / # of script tags / # of html bytes / # of css rules and etc. These metrics can give indication on what causes a perf regression.
2. The missing thing is that we still don’t have a mapping from the yslow-metrics to the actual time (msec)
“WWW” in performance monitoring:What? Who? Why?
▪ User-based measurement: unbiased, representative results
▪ Feature-launch integration: identify the regression
▪ Technical metrics: define actionable items for improvement
20Sunday, September 27, 20091. Missing part is the priority definition: how much saving, in ms, is if we reduce the # of css rules by 10%? Vs we move the js down to the bottom?
Haste: Static resource management
21Sunday, September 27, 2009
Why we need SR Management?• Day 1: Some smart engineers start a project!
<Print css tag for feature A>
<Print css tag for feature B>
<Print css tag for feature C>
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
“Let’s write a new page with features A, B and C!”
22Sunday, September 27, 2009
Why we need SR Management?• Day 2: Some smart engineers run PageSpeed and
thinks…
<Print css tag for feature A>
<Print css tag for feature B>
<Print css tag for feature C>
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
“A & B & C are always used; let’s package them together!”
23Sunday, September 27, 2009
Why we need SR Management?• Day 2: Awesome!
<Print css tag for feature A&B&C>
<print HTML of feature A>
<print HTML of feature B>
<print HTML of feature C>
…
24Sunday, September 27, 2009
Why we need SR Management?• Day 3: feature C evolves…
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
If (users_signup_for_C()) { <print HTML of feature C>}
…
25Sunday, September 27, 2009
Why we need SR Management?• Day 3:
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
If (users_signup_for_C()) { <print HTML of feature C>}
…
A&B are always used, while C is not. ..
26Sunday, September 27, 2009
Why we need SR Management?• Day 4: feature C is deprecated
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
// no one uses C { <print HTML of feature C>}
…
27Sunday, September 27, 2009
Why we need SR Management?• Day 4: we start to send unused bits
<Print css tag for feature A & B & C>
<print HTML of feature A>
<print HTML of feature B>
// no one uses C { <print HTML of feature C>}
…
It is hard to remember we should remove C here.
28Sunday, September 27, 2009
Why we need SR Management?• One months later…
<Print css tag for feature A & B & C & D & E & F & G…>
if (F is used) <print HTML of feature F>
<print HTML of feature G>
if (F is not used) { <print HTML of feature E>}
…
Thousands of dead CSS rules in the package.
29Sunday, September 27, 2009
Static Resource Management @ Challenges:
• Deep Integration
• Viral Adoption
• Agile Development
Responses:
• Separate requirement declaration and delivery of static resources
• Requirement declaration: lives with HTML generation
• Delivery: Globally optimized
30Sunday, September 27, 2009Deep Integration: each page has many features;Viral adoption: usage pattern changes quicklyAgile development: feature changes fast
Haste: Static Resource Management
• Back to Day 1:
require_static(A_css); <render HTML of feature A>
require_static(B_css); <render HTML of feature B>
require_static(C_css);<render HTML of feature C>
<deliver all required CSS>
<print all rendered HTML>
Separate Declaration from actual Delivery
Global Optimization on Delivery
Requirement Declaration lives with HTML
31Sunday, September 27, 2009
Haste: Global OptimizationOnline process
require_static(A_css);<render HTML of feature A>
require_static(B_css); <render HTML of feature B>
require_static(C_css); <render HTML of feature C>
<deliver all required CSS>
<print all rendered HTML>
Usage Pattern logs
Clustering algorithms
“Optimal” packages
Offline analysis
32Sunday, September 27, 2009
Haste: Trace-based PackagingNov 2008 => May 2009
Date # of JS files # of JS bytes # of pkg at a home.php
# of bytes at a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
33Sunday, September 27, 2009The # of JS files are increased by 60%, the byte sites are increased by 30%. The # of pkg sent is halved, the byte size is 10% less.
find | grep -v \.svn | grep -v intern | grep \.css$ -cfind | grep -v \.svn | grep -v intern | grep \.css$ | xargs cat > /tmp/dwei_2008
Haste: Trace-based PackagingNov 2008 => May 2009
Date # of JS files # of JS bytes # of pkg at a home.php
# of bytes at a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
'js/careers/jobs.js’, 'js/lib/ui/timeeditor.js’, 'resume/js/resumepro.js’, 'resume/js/resumesection.js’
34Sunday, September 27, 2009Developers think that timeeditor.js is a library file – in fact, it is only used in one production page (career)On the other hand, it turns out that “resume“ function is almost always used in career page.
Haste: Trace-based PackagingNov 2008 => May 2009
Date # CSS files # of CSS bytes
# of pkg at a home.php
# of bytes at a home.php
Nov 2008 487 1.7 MB 24 69 KB
May 2009 706 1.9 MB 15 64 KB
Date # of JS files # of JS bytes # of pkg at a home.php
# of bytes at a home.php
Nov 2008 461 4.4 MB 29 629 KB
May 2009 729 5.9 MB 14 560 KB
35Sunday, September 27, 2009CSS is a similar story
Haste: Trace-based AnalysisPotentials for image sprites too!
• Thousands of virtual gifts with static images, which to sprite?
36Sunday, September 27, 2009The same tracebase analysis techniques can be use in image spriting too:
Haste: Trace-based AnalysisPotentials for image sprites too!
• The answer is…
37Sunday, September 27, 2009The answer is…
In retrospection, this is pretty straight forward.
Haste: Trace-based AnalysisAdaptive Performance Optimization
• JS / CSS package optimization
• Guidance for image spriting
• Guidance of progressive rendering
38Sunday, September 27, 2009Once we separate the declaration and delivery of static resources, we have tons of area for automatic optimizations with trace analysis.
You can do automatic packaging, you can do automatic spriting, you can also do automatic progressive rendering – you can look at the most frequently used resources, and flush them out before generating the page.
Quickling: Ajaxify the Facebook site
39Sunday, September 27, 2009
load unload load unload load unload load unload
Full page load Ajax call
Remove redundant work via Ajax
Page 1 Page 2 Page 3 Page 4
Use session
40Sunday, September 27, 2009
load unload load unload load unload load unload
Full page load Ajax call
Remove redundant work via Ajax
Page 1 Page 2 Page 3 Page 4
Use session
40Sunday, September 27, 2009
load unload load unload load unload load unload
load unload
Full page load Ajax call
Remove redundant work via Ajax
Page 1 Page 2 Page 3 Page 4
Page 1 Page 2 Page 3 Page 4
Use session
Use session
40Sunday, September 27, 2009
How Quickling works?
41Sunday, September 27, 2009
How Quickling works?1. User clicks a link or back/forward button
41Sunday, September 27, 2009
How Quickling works?1. User clicks a link or back/forward button
2. Quickling sends an ajax to server
3. Response arrives
41Sunday, September 27, 2009
How Quickling works?1. User clicks a link or back/forward button
2. Quickling sends an ajax to server
4. Quickling blanks the content area
3. Response arrives
41Sunday, September 27, 2009
How Quickling works?1. User clicks a link or back/forward button
2. Quickling sends an ajax to server
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
41Sunday, September 27, 2009
How Quickling works?1. User clicks a link or back/forward button
2. Quickling sends an ajax to server
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
6. Show new content
41Sunday, September 27, 2009
LinkControllerIntercept user clicks on links▪ Dynamically attach a handler to all link clicks:
$(‘a’).click(function() {
// ‘payload’ is a JSON encoded response from the server $.get(this.href, function(payload) { // Dynamically load ‘js’, ‘css’ resources for this page. bootload(payload.bootload, function() {
// Swap in the new page’s content $(‘#content’).html(payload.html)
// Execute the onloadRegister’ed js code execute(payload.onload) }); }});
42Sunday, September 27, 2009
HistoryManagerEnable ‘Back/Forward’ buttons for AJAX requests▪ Set target page URL as the fragment of the URL
▪ http://www.facebook.com/home.php
▪ http://www.facebook.com/home.php#/cjiang?ref=profile
▪ http://www.facebook.com/home.php#/friends/?ref=tn
43Sunday, September 27, 2009
BootloaderLoad static resources via ‘script’, ‘link’ tag injection
function requestResource(type, source) { var h = document.getElementsByTagName('head')[0]; switch (type) { case 'js': var script = document.createElement('script'); script.src = source; script.type = 'text/javascript'; h.appendChild(script); break; case 'css': var link = document.createElement('link'); link.rel = "stylesheet"; link.type = "text/css"; link.media = "all" ; link.href = source; h.appendChild(link); break; } }
44Sunday, September 27, 2009
Other details▪ All pages now share a single global javascript scope:▪ Explicitly reclaim resources or reset states before leaving a page
▪ Stub out setTimeout and setInterval
▪ All CSS rules will be accumulated▪ Name-spacing CSS rules with page-specific information
▪ Busy indicator
▪ iframe transport▪ Permanent link
▪prelude inlined js code to redirect if necessary
45Sunday, September 27, 2009
Current status
▪ Turned on for FireFox and IE users: (>90% users)▪ ~60% of page hits to Facebook site are Quickling requests
46Sunday, September 27, 2009
Performance improvement
40% ~ 50% reduction in render time
47Sunday, September 27, 2009
PageCache: Cache visited pages at client side
48Sunday, September 27, 2009
PageCacheCache user visited pages in browsers▪ Motivation:▪ A typical user session:
▪ home -> profile -> photo -> home -> notes -> home -> photo -> photo
▪ Some pages are likely to be revisited soon (temporal locality)▪ Home page visited every 3 ~ 5 page views▪ Back/Forward button
49Sunday, September 27, 2009
How PageCache works?1. User clicks a link or back button
2. Quickling sends ajax to server
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
6. Show new content
50Sunday, September 27, 2009
How PageCache works?1. User clicks a link or back button
2. Quickling sends ajax to server
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
6. Show new content
3.5 Save response in cache
50Sunday, September 27, 2009
How PageCache works?1. User clicks a link or back button
2. Quickling sends ajax to server
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
6. Show new content
50Sunday, September 27, 2009
How PageCache works?1. User clicks a link or back button
4. Quickling blanks the content area
3. Response arrives
5. Download javascript/CSS
6. Show new content
2. Find Page in the cache
50Sunday, September 27, 2009
Cache consistency 1: Incremental updates
Cached version
51Sunday, September 27, 2009Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.Used by home page to refresh ‘ads’, fetch latest stories
Cache consistency 1: Incremental updates
Cached version Restored version
51Sunday, September 27, 2009Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.Used by home page to refresh ‘ads’, fetch latest stories
Cache consistency 1: Incremental Poll server for incremental updates via ajax calls.▪ Allow registering javascript functions to be called right before
cached page is shown.▪ Used by home page to refresh ‘ads’, fetch latest stories
Cached version Restored version
52Sunday, September 27, 2009Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown.Used by home page to refresh ‘ads’, fetch latest stories
Cache consistency 2: In-page writes
Cached version
53Sunday, September 27, 2009
Cache consistency 2: In-page writes
Cached version Restored version
53Sunday, September 27, 2009
Cache consistency 2: In-page writesRecord and replay▪ Automatically record all state-changing operations in a cached
page▪ Automatically replay those operations when cached page is
restored.
Cached version Restored version54Sunday, September 27, 2009
Cache consistency 3: Cross-page writes
Cached version
55Sunday, September 27, 2009
Cache consistency 3: Cross-page writes
Cached version State-changing op
55Sunday, September 27, 2009
Cache consistency 3: Cross-page writes
Cached version Restored versionState-changing op
55Sunday, September 27, 2009
Cache consistency 3: Cross-page writesServer side invalidation▪ Instrument server-side database access API, whenever a write
operations is detected, send a signal to the client to invalidate the cache.
Cached version Restored versionState-changing op
56Sunday, September 27, 2009
Current status
▪ Deployed on production▪ Only cache in memory▪ Only turned on for home page
57Sunday, September 27, 2009
20%
~20% savings on page hits to home page 58Sunday, September 27, 2009
Performance improvement
3X ~ 4X speedup in render time vs Quickling
59Sunday, September 27, 2009
Summary
60Sunday, September 27, 2009
Summary▪ Performance monitoring: What, Who, and Why (“WWW”)▪ Static resource management: Adaptive to fast evolution▪ Ajaxify the website.▪ Client side caching of user visited pages
61Sunday, September 27, 2009Measurement: we need to answer three questions: what’s the speed, who made it faster/slower, why it is faster/slower.Static resource management: need to be adaptive to fast evolution of code changes and user adoption
Ajaxifying websites where pages in a user session share a lot of common work can save the redundant work and improve user perceived performance.Caching user’s visited pages on the client side can reduce server’s overall load and improve user perceived performance
Thank you!
62Sunday, September 27, 2009