towards understanding modern web traffic sunghwan ihm and vivek s. pai google inc. / princeton...
TRANSCRIPT
Towards Understanding Modern Web Traffic
Sunghwan Ihm and Vivek S. PaiGoogle Inc. / Princeton University
Sunghwan Ihm, Princeton University
2
Web Changes and Growth
Simple static documents complex rich media applicationsHeavy client-side interactions (e.g., Ajax)
Traffic increaseSocial networking, file-sharing, and video
streaming sites
Trends expected to continueApplications migrated to the WebA de facto standard interface of cloud
services
Sunghwan Ihm, Princeton University
3
Understanding Changes
Goal: shape system design by better understanding the traffic optimization opportunities
Improve response times
Understand caching effectiveness
Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems
Sunghwan Ihm, Princeton University
4
Challenges
Tracking changes Requires large-scale data set spanning many
years collected under the same conditions
Web page analysis Requires new analysis techniques suitable for
dynamic Web pages with client-side interactions (e.g, Ajax)
Redundancy and caching Requires full content instead of simple access
logs for assessing implications of content-based caching
We address these challenges by
1. Analyzing large-scale data with full content2. Developing a new Web page analysis technique
Sunghwan Ihm, Princeton University
5
CoDeeN Traffic
CoDeeN content distribution network (CDN)http://codeen.cs.princeton.edu/
A semi-open globally distributed open proxy on 500+ PlanetLab nodes
Running since 2003
30+ million requests per day
Sunghwan Ihm, Princeton University
6
Data Collection
Assume local proxy caches 1. Access logs (all requests, but limited
info.)URL, Timestamp, Content-Length, Content-
Type, Referer, etc. 2. Full content (cache-misses)
Header + body
OriginWeb Server
Local ProxyCache
User BrowserCache
CoDeeNCache
WAN
Access Logs
Full Content
Sunghwan Ihm, Princeton University
7
Data Set
5 years: from 2006 to 2010Focus on one month (April) per yearFull content data only for 2010
Total volume per month3.3~6.6 TB280~460 million requests240~360K unique client IPs (40~60% /8
nets)168~187 countries and regions820K~1.2 million servers
Focus on US, CN, FR, BR:100M+ requests / 1TB+ / 100K+ users
Sunghwan Ihm, Princeton University
8
Analysis Outline
1. High-level analysis
2. Page-level analysis
3. Caching analysis
Access Logs
Full Content
Sunghwan Ihm, Princeton University
9
1. High-Level Analysis
Q: What has changed over five years?
Connection speed
NAT usage
Max # concurrent browser connections
Content type
Object Size
Traffic share of Web sites
Sunghwan Ihm, Princeton University
10
Content Type
US, 20062010, both X and Y log-scale A sharp increase of Ajax: JavaScript / CSS / XML A sharp increase of Flash video (FLV) (<5%25%)
Sunghwan Ihm, Princeton University
11
Traffic Share of Web Sites
Increase in video sites’ traffic
Increase in ad networks and analytics sites’ requests (~12%)Ad networks market growth
Most accessed site by userssearch / analyticsgoogle.com, baidu.com, google-
analytics.com% user share increasing, tracking up to
65%
Sunghwan Ihm, Princeton University
12
2. Page-Level Analysis
Q: How have Web pages changed?
New page detection heuristic
Initial page characteristicsPage size / # of embedded objects / latency
Page load latency simulation
Entire page characterization
Sunghwan Ihm, Princeton University
13
Page Detection Problem
Given a set of access logs, detect the page boundaries
# of embedded objects, page size, time, etc.
Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic
main embedded
Time
Sunghwan Ihm, Princeton University
14
Previous Approach #1:Time-based Check idle time between requests If within a threshold (e.g. 1 second), they
belong to the same page
Misclassify client-side interactions (Ajax) with longer idle time as pages
Sunghwan Ihm, Princeton University
15
Previous Approach #2:Type-based Check file extension / content type Regard every html object as a main object
Misclassify frames/iframes within a page as separate pages
Sunghwan Ihm, Princeton University
16
StreamStructure Algorithm
1. Group logs into streams by Referer field2. Consider all html object as main object candidates ( Type-based)3. Ignore those with no children (embedded objects)4. Apply idle time among the candidates for finalizing selection ( Time-based)
Ajax
frames/iframes
Sunghwan Ihm, Princeton University
17
Validation
Ground truth: browse Alexa’s top 100 sitesVisit about 10 pages per siteRecord Web page URLs (main objects)Total 1197 pages
Precision# correct pages found / # total pages found
Recall# correct pages found / # total correct
pages
Sunghwan Ihm, Princeton University
18
Validation Result
StreamStructure outperforms other approachesRobust to the idle time parameter
selection
1 sec
Bette
r
26~3319~30
4~24
4
Sunghwan Ihm, Princeton University
19
Identifying Initial Page Loads
Initial page: user-perceived page user-perceived latency traffic/revenue of Websites
Apply Time-based approach, but DNS lookup or browser processing time can vary significantly
Use Google Analytics beacon JavaScript collecting various client-side
info.Fires when document are loaded
Client-sideInteractions(e.g., Ajax)
InitialPage Load
40-60% of trafficafter initial page loads
Sunghwan Ihm, Princeton University
20
Initial Page Size and # Objects
Initial pages become increasingly complex US: about 2x increase
2006: 69 KB / 6 objects2010: 133 KB / 12 objects
CachingEffectiveness
Sunghwan Ihm, Princeton University
21
Initial Page Load Latency
Median latency dropped in 2009 and 2010 Increased # of browser concurrent connections Reduced per-object latency from improved caching behavior / client bandwidth
Sunghwan Ihm, Princeton University
22
3. Caching Analysis
Q: Implications for caching?
URL popularity
Caching effectiveness
Required cache storage size
Impact of aborted transfers
Sunghwan Ihm, Princeton University
23
Two Caching Approaches
HTTP Object-based ApproachWhole objectHTTP-cacheable onlyPreviously reported cache hit rate:
35~50%Byte hit rate usually much less
Content-based ApproachCache smaller chunks instead of objectsProtocol independentEffective for uncacheable content as wellWAN accelerators, storage/file systems
Sunghwan Ihm, Princeton University
24
Ideal Cache Hit Rate
HTTP object-based: 17~28%Mainly effective for JavaScript and image
Content-based: 42~51% with 128-byte chunksEffective for any content type
Growth of tail that hurts caching
1.8~2.5x
Sunghwan Ihm, Princeton University
25
Origins of Redundancy
Most of additional savings from the redundancyacross different versions (intra-URL)across different objects (inter-URL)
US, 128 byte
Contentupdates
Aborted
Sunghwan Ihm, Princeton University
26
Required Cache Storage Size
1-KB outperforms 128-B w/ metadata overhead
MRC: Multi-Resolution Chunking (USENIX’10) Increases working set sizeLarge cache storage highly desirable
CN: 218GB
Sunghwan Ihm, Princeton University
27
Conclusions Analyzed five years of real Web traffic with
over 70,000 users
Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users
Developed StreamStructureHalf of the traffic occurs due to client-side
interactions after initial page loadsPages have become increasingly complex
Content-based caching with large cache storage highly desirable2x larger byte hit rate, aborted transfers
Sunghwan Ihm, Princeton University
28
http://www.cs.princeton.edu/~sihm/
Thank You