towards understanding modern web traffic sunghwan ihm and vivek s. pai google inc. / princeton...

Towards Understanding Modern Web Traffic

Sunghwan Ihm and Vivek S. PaiGoogle Inc. / Princeton University

Sunghwan Ihm, Princeton University

2

Web Changes and Growth

Simple static documents complex rich media applicationsHeavy client-side interactions (e.g., Ajax)

Traffic increaseSocial networking, file-sharing, and video

streaming sites

Trends expected to continueApplications migrated to the WebA de facto standard interface of cloud

services


3

Understanding Changes

Goal: shape system design by better understanding the traffic optimization opportunities

Improve response times

Understand caching effectiveness

Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems


4

Challenges

Tracking changes Requires large-scale data set spanning many

years collected under the same conditions

Web page analysis Requires new analysis techniques suitable for

dynamic Web pages with client-side interactions (e.g, Ajax)

Redundancy and caching Requires full content instead of simple access

logs for assessing implications of content-based caching

We address these challenges by

1. Analyzing large-scale data with full content2. Developing a new Web page analysis technique


5

CoDeeN Traffic

CoDeeN content distribution network (CDN)http://codeen.cs.princeton.edu/

A semi-open globally distributed open proxy on 500+ PlanetLab nodes

Running since 2003

30+ million requests per day

http://codeen.cs.princeton.edu/

http://codeen.cs.princeton.edu/


6

Data Collection

Assume local proxy caches 1. Access logs (all requests, but limited

info.)URL, Timestamp, Content-Length, Content-

Type, Referer, etc. 2. Full content (cache-misses)

Header + body

OriginWeb Server

Local ProxyCache

User BrowserCache

CoDeeNCache

WAN

Access Logs

Full Content


7

Data Set

5 years: from 2006 to 2010Focus on one month (April) per yearFull content data only for 2010

Total volume per month3.3~6.6 TB280~460 million requests240~360K unique client IPs (40~60% /8

nets)168~187 countries and regions820K~1.2 million servers

Focus on US, CN, FR, BR:100M+ requests / 1TB+ / 100K+ users


8

Analysis Outline

1. High-level analysis

2. Page-level analysis

3. Caching analysis

Access Logs

Full Content


9

1. High-Level Analysis

Q: What has changed over five years?

Connection speed

NAT usage

Max # concurrent browser connections

Content type

Object Size

Traffic share of Web sites


10

Content Type

US, 20062010, both X and Y log-scale A sharp increase of Ajax: JavaScript / CSS / XML A sharp increase of Flash video (FLV) (<5%25%)


11

Traffic Share of Web Sites

Increase in video sites’ traffic

Increase in ad networks and analytics sites’ requests (~12%)Ad networks market growth

Most accessed site by userssearch / analyticsgoogle.com, baidu.com, google-

analytics.com% user share increasing, tracking up to

65%


12

2. Page-Level Analysis

Q: How have Web pages changed?

New page detection heuristic

Initial page characteristicsPage size / # of embedded objects / latency

Page load latency simulation

Entire page characterization


13

Page Detection Problem

Given a set of access logs, detect the page boundaries

# of embedded objects, page size, time, etc.

Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic

main embedded

Time


14

Previous Approach #1:Time-based Check idle time between requests If within a threshold (e.g. 1 second), they

belong to the same page

Misclassify client-side interactions (Ajax) with longer idle time as pages


15

Previous Approach #2:Type-based Check file extension / content type Regard every html object as a main object

Misclassify frames/iframes within a page as separate pages


16

StreamStructure Algorithm

1. Group logs into streams by Referer field2. Consider all html object as main object candidates ( Type-based)3. Ignore those with no children (embedded objects)4. Apply idle time among the candidates for finalizing selection ( Time-based)

Ajax

frames/iframes


17

Validation

Ground truth: browse Alexa’s top 100 sitesVisit about 10 pages per siteRecord Web page URLs (main objects)Total 1197 pages

Precision# correct pages found / # total pages found

Recall# correct pages found / # total correct

pages


18

Validation Result

StreamStructure outperforms other approachesRobust to the idle time parameter

selection

1 sec

Bette

r

26~3319~30

4~24

4


19

Identifying Initial Page Loads

Initial page: user-perceived page user-perceived latency traffic/revenue of Websites

Apply Time-based approach, but DNS lookup or browser processing time can vary significantly

Use Google Analytics beacon JavaScript collecting various client-side

info.Fires when document are loaded

Client-sideInteractions(e.g., Ajax)

InitialPage Load

40-60% of trafficafter initial page loads


20

Initial Page Size and # Objects

Initial pages become increasingly complex US: about 2x increase

2006: 69 KB / 6 objects2010: 133 KB / 12 objects

CachingEffectiveness


21

Initial Page Load Latency

Median latency dropped in 2009 and 2010 Increased # of browser concurrent connections Reduced per-object latency from improved caching behavior / client bandwidth


22

3. Caching Analysis

Q: Implications for caching?

URL popularity

Caching effectiveness

Required cache storage size

Impact of aborted transfers


23

Two Caching Approaches

HTTP Object-based ApproachWhole objectHTTP-cacheable onlyPreviously reported cache hit rate:

35~50%Byte hit rate usually much less

Content-based ApproachCache smaller chunks instead of objectsProtocol independentEffective for uncacheable content as wellWAN accelerators, storage/file systems


24

Ideal Cache Hit Rate

HTTP object-based: 17~28%Mainly effective for JavaScript and image

Content-based: 42~51% with 128-byte chunksEffective for any content type

Growth of tail that hurts caching

1.8~2.5x


25

Origins of Redundancy

Most of additional savings from the redundancyacross different versions (intra-URL)across different objects (inter-URL)

US, 128 byte

Contentupdates

Aborted


26

Required Cache Storage Size

1-KB outperforms 128-B w/ metadata overhead

MRC: Multi-Resolution Chunking (USENIX’10) Increases working set sizeLarge cache storage highly desirable

CN: 218GB


27

Conclusions Analyzed five years of real Web traffic with

over 70,000 users

Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users

Developed StreamStructureHalf of the traffic occurs due to client-side

interactions after initial page loadsPages have become increasingly complex

Content-based caching with large cache storage highly desirable2x larger byte hit rate, aborted transfers


28

[email protected]

http://www.cs.princeton.edu/~sihm/

Thank You

towards understanding modern web traffic sunghwan ihm and vivek s. pai google inc. / princeton...

Documents

content slide

content data

princeton university

content type

caching analysis sunghwan

implications of content

servers sunghwan ihm

day sunghwan ihm