efficient and adaptive replication using content clustering

Efficient and Adaptive Replication using Content ClusteringYan Chen

EECS DepartmentUC Berkeley

MotivationInternet has evolved to become a commercial infrastructure for service deliveryWeb delivery, VoIP, streaming media Challenges for Internet-scale servicesScalability: 600M users, 35M Web sites, 28Tb/sEfficiency: bandwidth, storage, managementAgility: dynamic clients/network/serversSecurity, etc. Focus on content delivery - Content Distribution Network (CDN)Totally 4 Billion Web pages, daily growth of 7M pagesAnnual growth of 200% for next 4 years

CDN and its Challenges

CDN and its ChallengesInefficient replicationNo coherence for dynamic contentUnscalable network monitoring - O(M*N)X

SCAN: Scalable Content Access NetworkCDN Applications (e.g. streaming media)Provision: Cooperative Clustering-based ReplicationUser Behavior/Workload MonitoringCoherence: Update Multicast Tree ConstructionNetwork PerformanceMonitoringNetwork Distance/ Congestion/ FailureEstimationred: my work, black: out of scope

SCANCoherence for dynamic contentCooperative clustering-based replications1, s4, s5

SCANXScalable network monitoring - O(M+N)s1, s4, s5Cooperative clustering-based replicationCoherence for dynamic content

Internet-scale SimulationNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology

Internet-scale Simulation E2E MeasurementNLANR Active Measurement Project data set111 sites on America, Asia, Australia and EuropeRound-trip time (RTT) between every pair of hosts every minute17M daily measurementRaw data: Jun. Dec. 2001, Nov. 2002 Keynote measurement dataMeasure TCP performance from about 100 worldwide agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargets40 most popular Web servers + 27 Internet Data CentersRaw data: Nov. Dec. 2001, Mar. May 2002

Clustering Web Content for Efficient Replication

OverviewCDN uses non-cooperative replication - inefficientParadigm shift: cooperative pushWhere to push greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But what content to be pushed? At what granularity?Clustering of objects for replicationClose-to-optimal performance with small overheadIncremental clusteringPush before accessed: improve availability during flash crowds

OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research

Conventional CDN: Non-cooperative PullClient 1Web content serverISP 2ISP 1Inefficient replication

SCAN: Cooperative PushCDN name serverClient 1ISP 2ISP 1Significantly reduce the # of replicas and update cost

Comparison between Conventional CDNs and SCAN

Problem FormulationFind a scalable, adaptive replication strategy to reduce Clients average retrieval costReplica location computation cost Amount of replica directory state to maintainSubject to certain total replication cost (e.g., # of URL replicas)

Per Web site1234

60 70% average retrieval cost reduction for Per URL schemePer URL is too expensive for management!Replica Placement: Per Website vs. Per URL

Chart1

170.878.4

132.759.2

112.849

98.742.4

91.437.5

8733.8

82.930.9

79.128.4

75.526.4

Replicate per Website

Replicate per URL

Average number of replicas per URL

Average retrieval cost

www02

MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case

1429.712668429.71266847.75298747.752987

2170.878.436.06635527.319388

3132.759.232.22923423.087655

4112.84929.45913520.487025

598.742.427.0309718.625499

691.437.525.29342517.199268

78733.823.6918416.035896

882.930.922.26131815.05205

979.128.421.3425114.199426

1075.526.420.63394313.448004

MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering

120501003001000

291.103407142.290429139.696062129.885976121.021061120.783844

291.103407266.290804261.416096241.544629170.940168120.783844

291.103407170.413829159.311739149.119835137.158313120.783844

MSNBC, 0801, multiple reps, using access vector, session based clustering

# replicas1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

Replicate per Web site294.24829170.794807132.715103112.76344998.67935291.43610487.02487482.85842779.10248175.50979472.34989769.40989766.73225964.10348761.90317260.19864958.54286757.24377855.95019854.73148953.53410852.50465951.50483150.60541149.72402248.85905548.05908947.2630346.4837945.73238445.00320544.31786743.67704843.03645842.41211741.82317541.24237540.67528840.11265939.56397239.01536638.50764438.01266937.51959537.06298136.62823136.19545935.77650135.37517234.977634

spatial clustering294.2482997.60607576.55688265.21992157.46739751.54513546.65240942.85245439.7792637.17163934.87058332.88120431.1566729.66069128.36131927.19995726.15036725.20355724.3333723.5286322.77763322.07650521.41869220.80920220.22956719.67863219.15242818.66128518.19464217.74812717.31887816.90885416.51644616.13962915.77844615.42986515.10014714.78434414.47965814.1816413.8907813.60804713.33390613.06732712.80893712.5569612.3134512.07836311.85225111.633959

temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589

5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug

total new URLs13151389143114891483153915381530152615231483

metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044

metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??

best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528

metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063

replica # in metric3598360916050652165036618662166106592655136503722369240063994

# of URLs not clustered002116468600211

# of new clusters generated based on limit-radius of 19990802/CRG-200021133333

# of URLs not clustered with biggest radius0020164675

metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466

# of replica reclaimed948120513911606158217721805174817611739

# of clusters whose URLs are all cold and reclaimed

best perf. (with added replica#)25.17623.67322.82320.90921.21320.677

Access freq clusteirng

metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511

metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551

metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721


www02


Replicate per URL



000

000

000

000

000

000


Replicate per URL

Number of replicas/URL

Total latency of clients (Ksec)

Replicate per Web site

Replicate with access frequency clustering



8/38/38/3

8/48/48/4

8/58/58/5

8/108/108/10

8/118/118/11

9/279/279/27

9/289/289/28

9/299/299/29

9/309/309/30

10/110/110/1

Old clusters, old replica locations

Old clusters, new replica locations

New clusters, new replica locations

Date


Stability analysis of MSNBC traces

8/38/38/38/38/3

8/48/48/48/48/4

8/58/58/58/58/5

8/108/108/108/108/10

8/118/118/118/118/11

9/279/279/279/279/27

9/289/289/289/289/28

9/299/299/299/299/29

9/309/309/309/309/30

10/110/110/110/110/1

Static clustering, old replication

Static clustering, re-replication

Reclustering, re-replication (optimal)

Offline incremental clustering, step 1 only

Offline incremental clustering, replication

New date


0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

Static clustering, old replication

Static clustering, re-replication


Offline incremental clustering, replication

New date


Where R: # of replicas per URL M: # of URLsTo compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!

Overhead Comparison

Replication SchemeState to Maintain Computation CostPer WebsiteO(R)O(R)

Per URLO(R M)O(R M)

Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)

Overhead Comparison

Replication SchemeStates to Maintain Computation CostPer WebsiteO(R)O(R) Per ClusterO(R K + M)O(R K)Per URLO(R M)O(R M)

Clustering Web ContentGeneral clustering frameworkDefine the correlation distance between URLsCluster diameter: the max distance between any two membersWorst correlation in a clusterGeneric clustering: minimize the max diameter of all clusters Correlation distance definition based onSpatial localityTemporal localityPopularity

Spatial ClusteringCorrelation distance between two URLs defined asEuclidean distanceVector similarity

Clustering Web Content (contd)Popularity-based clustering

OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

Temporal clustering Divide traces into multiple individuals access sessions [ABQ01] In each session,

Average over multiple sessions in one day

Performance of Cluster-based ReplicationSpatial clustering with Euclidean distance and popularity-based clustering perform the bestSmall # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overheadMSNBC, 8/2/1999, 5 replicas/URL

Chart1

98.6898.6898.6898.68

59.05397.38797.83363.58

51.60595.93192.81257.47

47.30894.33587.9451.13

46.4489.22276.02347.954359

42.81771.10360.23845

42.37242.37242.37242.372

Spatial clustering: Euclidean distance

Spatial clustering: cosine similarity

Temporal clustering

Popularity-based clustering

Number of clusters


msnbc-19990802-5reps-clustering

198.6898.6898.6898.6898.680.05

1059.05397.38797.83363.5863.5786910.5

2051.60595.93192.81257.4757.4673971

5047.30894.33587.9451.1351.132982.55

10046.4489.22276.02347.9543595.1

30042.81771.10360.2384515.3

100042.37242.37242.37242.37251

MSNBC: above, NASA: below

1141.03234141.03234141.03234141.03234

6100.60196140.49318124.26101.07638

1596.64702140.2433120.6897.01858

3095.56142139.7855102.3296.38984

9094.26264118.2724697.1695.4

30093.0477493.0477493.0477493.04774

4572595626.2




Temporal clustering

Access frequency clustering

Number of clusters


No replication of new URLs

Random replication of new URLs

Online incremental clustering & replication

Complete re-clustering & re-replication (oracle)


Number of clusters

Computational cost (hours)



Temporal clustering

Popularity-based clustering

Number of clusters






Chart16

0.05

0.5

1

2.55

5.1

15.3

51

Number of clusters



198.6898.6898.6898.6898.680.05

1059.05397.38797.83363.5863.5786910.5

2051.60595.93192.81257.4757.4673971

5047.30894.33587.9451.1351.132982.55

10046.4489.22276.02347.9543595.1

30042.81771.10360.2384515.3

100042.37242.37242.37242.37251


1141.03234141.03234141.03234141.03234

6100.60196140.49318124.26101.07638

1596.64702140.2433120.6897.01858

3095.56142139.7855102.3296.38984

9094.26264118.2724697.1695.4

30093.0477493.0477493.0477493.04774

4572595626.2


0000

0000

0000

0000

0000

0000



Temporal clustering


Number of clusters


0

0

0

0


0

0

0

0

0

0

0

Number of clusters


0000

0000

0000

0000

0000

0000

0000



Temporal clustering


Number of clusters


Static clustering and replicationTwo daily traces: training trace and new trace

Static clustering performs poorly beyond a week

Chart2

38.2237.25428.576

39.13738.22326.9

43.06542.05725.503

39.0838.12724.95

44.76443.86825.39

48.2446.78624.944

33.68732.54917.245

45.21943.75622.566

38.08836.88218.427

23.90422.98511.762

Static clustering 1

Static clustering 2


New traces


www02

MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case

1429.712668429.71266847.75298747.752987

2326.806921155.4850336.06635527.319388

3291.103407120.78384432.22923423.087655

4262.058643101.43584429.45913520.487025

5242.95174388.19952427.0309718.625499

6225.68053878.24808825.29342517.199268

7213.66075170.33510423.6918416.035896

8203.47370563.83776422.26131815.05205

9194.5215158.36462321.3425114.199426

10185.95830553.68353820.63394313.448004

MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering

120501003001000

291.103407142.290429139.696062129.885976121.021061120.783844

291.103407266.290804261.416096241.544629170.940168120.783844

291.103407170.413829159.311739149.119835137.158313120.783844

MSNBC, 0801, multiple reps, using access vector, session based clustering

# replicas12345678910111213141516171819202123242526272829303132333435363738394041424344454647484950

Replicate per Web site429.712668326.806921291.103407262.058643242.951743225.680538213.660751203.473705194.52151185.958305179.57821173.902084168.580902163.466699158.685096153.937913149.683283145.453766141.583481138.191325134.937477131.847434129.034592123.853151121.369711118.971794116.6723114.490966112.57276110.702858108.882123107.178673105.548969103.946558102.436065100.98686499.6361298.32623397.1063395.94401594.78745893.68489592.59370791.53921690.58690589.6776188.78807387.93216687.0857

spatial clustering429.712668175.160743142.290429121.854833114.895495104.52119995.24120391.444883.66593683.19534175.11010674.73348770.41610366.96563863.53753762.27375959.27095656.03238854.34557753.81285851.75335850.25345949.02200648.80636547.24332345.96648444.50371544.43399642.5796641.72261441.47807740.59677339.82875138.58454438.38735937.73849336.58925736.54179235.92552834.82994634.29379634.25433233.73850933.19427232.29343732.13940331.55468131.1389330.735593

temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589

5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug

total new URLs13151389143114891483153915381530152615231483

metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044

metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??

best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528

metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063


# of URLs not clustered002116468600211

# of new clusters generated based on limit-radius of 19990802/CRG-200021133333

# of URLs not clustered with biggest radius0020164675

metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466

# of replica reclaimed948120513911606158217721805174817611739

# of clusters whose URLs are all cold and reclaimed

best perf. (with added replica#)25.17623.67322.82320.90921.21320.677

Access freq clusteirng

metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511

metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551

metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721


# of new URL replicas (non-orphan URLs)983109110501521150316181621161015921551

# of new URL replicas (orphan URLs)003418579154184138169188

# of URL replicas (optimal)4000400040004000400040004000400040004000

after normalization by 4024.57527.27526.2538.02537.57540.4540.52540.2539.838.775

008.5252.1251.9753.854.63.454.2254.7

www02


Replicate per URL



000

000

000

000

000

000


Replicate per URL



Replicate per Web site

Replicated with spatial clustering


Total lantency of clients (Ksec)

8/38/38/38/38/3

8/48/48/48/48/4

8/58/58/58/58/5

8/108/108/108/108/10

8/118/118/118/118/11

9/279/279/279/279/27

9/289/289/289/289/28

9/299/299/299/299/29

9/309/309/309/309/30

10/110/110/110/110/1

Static clustering 1

Static clustering 2


Offline incremental clustering, step 1

Offline incremental clustering, complete

New traces


8/38/38/38/3

8/48/48/48/4

8/58/58/58/5

8/108/108/108/10

8/118/118/118/11

9/279/279/279/27

9/289/289/289/28

9/299/299/299/29

9/309/309/309/30

10/110/110/110/1

Static clustering 1

Static clustering 2


Offline incremental clustering

New traces


8/38/38/3

8/48/48/4

8/58/58/5

8/108/108/10

8/118/118/11

9/279/279/27

9/289/289/28

9/299/299/29

9/309/309/30

10/110/110/1

Static clustering 1

Static clustering 2


New traces


8/38/38/3

8/48/48/4

8/58/58/5

8/108/108/10

8/118/118/11

9/279/279/27

9/289/289/28

9/299/299/29

9/309/309/30

10/110/110/1



Re-clustering, re-replication (optimal)

New traces


8/38/38/3

8/48/48/4

8/58/58/5

8/108/108/10

8/118/118/11

9/279/279/27

9/289/289/28

9/299/299/29

9/309/309/30

10/110/110/1



Re-clustering, re-replication (optimal)

New traces


8/38/3

8/48/4

8/58/5

8/108/10

8/118/11

9/279/27

9/289/28

9/299/29

9/309/30

10/110/1



New date

Replication cost, normalized by that of the optimal case

Incremental ClusteringGeneric frameworkIf new URL u match with existing clusters c, add u to c and replicate u to existing replicas of cElse create new clusters and replicate themTwo types of incremental clusteringOnline: without any access logsHigh availabilityOffline: with access logsClose-to-optimal performance

Online Incremental ClusteringGroups of siblings

Groups of the same hyperlink depth(smallest # of links from root)Predict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!

Online Popularity PredictionExperimentsCrawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLsUse corresponding access logs to analyze the correlationGroups of siblings have the best correlationMeasure the divergence of URL popularity within a group:

Semantics-based Incremental ClusteringPut new URL into existing cluster with largest # of siblingsIn case of a tie, choose the cluster w/ more replicasSimulation on 5/3/2002 MSNBC8-10am trace: static popularity clustering + replicationAt 10am: 16 new URLs - online inc. clustering + replication Evaluation with 10-12am trace: 16 URLs has 33K requests

Online Incremental Clustering and Replication Results1/8 compared w/ no replication, and 1/5 for random replication

Chart2

457

259

56



198.6898.6898.6898.6898.680.05

1059.05397.38797.83363.5863.5786910.5

2051.60595.93192.81257.4757.4673971

5047.30894.33587.9451.1351.132982.55

10046.4489.22276.02347.9543595.1

30042.81771.10360.2384515.3

100042.37242.37242.37242.37251


1141.03234141.03234141.03234141.03234

6100.60196140.49318124.26101.07638

1596.64702140.2433120.6897.01858

3095.56142139.7855102.3296.38984

9094.26264118.2724697.1695.4

30093.0477493.0477493.0477493.04774

4572595626.2




Temporal clustering


Number of clusters







Number of clusters




Temporal clustering


Number of clusters






Online Incremental Clustering and Replication ResultsDouble the optimal retrieval cost, but only 4% of its replication cost

Chart1

457

259

56

26.2



198.6898.6898.6898.6898.680.05

1059.05397.38797.83363.5863.5786910.5

2051.60595.93192.81257.4757.4673971

5047.30894.33587.9451.1351.132982.55

10046.4489.22276.02347.9543595.1

30042.81771.10360.2384515.3

100042.37242.37242.37242.37251


1141.03234141.03234141.03234141.03234

6100.60196140.49318124.26101.07638

1596.64702140.2433120.6897.01858

3095.56142139.7855102.3296.38984

9094.26264118.2724697.1695.4

30093.0477493.0477493.0477493.04774

4572595626.2




Temporal clustering


Number of clusters







Number of clusters




Temporal clustering


Number of clusters






ConclusionsCooperative, clustering-based replicationCooperative push: only 4 - 5% replication/update cost compared with existing CDNsURL Clustering reduce the management/computational overhead by two orders of magnitudeSpatial clustering and popularity-based clustering recommendedIncremental clustering to adapt to emerging URLsHyperlink-based online incremental clustering for high availability and performance improvementSelf-organize replicas into app-level multicast tree for update disseminationScalable overlay network monitoringO(M+N) instead of O(M*N), given M client groups and N servers

Future Research (I)Measurement-based Internet study and protocol/architecture designUse inference techniques to develop Internet behavior modelsNetwork operators reluctant to reveal internal network configurationsRoot cause analysis: large, heterogeneous data miningLeverage graphics/visualization for interactive miningApply deeper understanding of Internet behaviors for reassessment/design of protocol/architectureE.g., Internet bottleneck peering links? How and Why? Implications?

Future Research (II)Network traffic anomaly characterization, identification and detectionMany unknown flow-level anomalies revealed from real router traffic analysis (AT&T)Profile traffic patterns of new applications (e.g. P2P) > benign anomaliesUnderstand the cause, pattern and prevalence of other unknown anomaliesIdentify malicious patterns for intrusion detectionE.g., fight against Sapphire/Slammer Worm

Backup Materials

SCANCoherence for dynamic contentCooperative clustering-based replicationXScalable network monitoring O(M+N)s1, s4, s5

Problem FormulationSubject to certain total replication cost (e.g., # of URL replicas)Find a scalable, adaptive replication strategy to reduce avg access cost

Simulation MethodologyNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology

Online Incremental ClusteringPredict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!Groups of siblingsGroups of the same hyperlink depth: smallest # of links from root

Challenges for CDNOver-provisioning for replicationProvide good QoS to clients (e.g., latency bound, coherence)Small # of replicas with small delay and bandwidth consumption for updateReplica ManagementScalability: billions of replicas if replicating in URLO(104) URLs/server, O(105) CDN edge servers in O(103) networksAdaptation to dynamics of content providers and customersMonitoringUser workload monitoring End-to-end network distance/congestion/failures monitoringMeasurement scalabilityInference accuracy and stability

SCAN ArchitectureLeverage Decentralized Object Location and Routing (DOLR) - Tapestry forDistributed, scalable location with guaranteed successSearch with localitySoft state maintenance of dissemination tree (for each object)

data planenetwork planedatasourceWeb serverSCAN serverRequest LocationDynamic Replication/Update and Content Management

Wide-area Network Measurement and Monitoring System (WNMMS)Cluster AClientsCluster BMonitorsCluster CSCAN edge serversSelect a subset of SCAN servers to be monitorsE2E estimation forDistanceCongestionFailuresnetwork plane

Dynamic ProvisioningDynamic replica placementMeeting clients latency and servers capacity constraintsClose-to-minimal # of replicasSelf-organized replicas into app-level multicast treeSmall delay and bandwidth consumption for update multicastEach node only maintains states for its parent & direct childrenEvaluated based on simulation ofSynthetic traces with various sensitivity analysisReal traces from NASA and MSNBCPublicationIPTPS 2002Pervasive Computing 2002

Effects of the Non-Uniform Size of URLsReplication cost constraint : bytesSimilar trends existPer URL replication outperforms per Website dramatically Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective1234

End HostLandmarkDiagram of Internet Iso-bar

Cluster AEnd HostCluster BMonitorCluster CLandmarkDiagram of Internet Iso-bar

Real Internet Measurement DataNLANR Active Measurement Project data set119 sites on US (106 after filtering out most offline sites)Round-trip time (RTT) between every pair of hosts every minuteRaw data: 6/24/00 12/3/01Keynote measurement dataMeasure TCP performance from about 100 agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargetsWeb site perspective: 40 most popular Web servers27 Internet Data Centers (IDCs)

Related WorkInternet content delivery systemsWeb cachingClient-initiated Server-initiatedPull-based Content Delivery Networks (CDNs)Push-based CDNsUpdate disseminationIP multicast Application-level multicastNetwork E2E Distance Monitoring Systems

ClientISP 2ISP 1Web Proxy Caching

ClientISP 2Pull-based CDNISP 1

ClientISP 2Push-based CDNISP 1

Internet Content Delivery Systems

Scalability for request redirectionPre-configured in browserUse Bloom filter to exchange replica locationsCentralized CDN name serverCentralized CDN name serverDecentra-lized P2P location

PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCAN

Efficiency (# of caches or replicas)No cache sharing among proxiesCache sharingNo replica sharing among edge serversReplica sharingReplica sharing

Network- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring system

Coherence supportNoNoYes NoYes

Previous Work: Update DisseminationNo inter-domain IP multicastApplication-level multicast (ALM) unscalableRoot maintains states for all children (Narada, Overcast, ALMI, RMX)Root handles all join requests (Bayeux)Root split is common solution, but suffers consistency overhead

Design PrinciplesScalabilityNo centralized point of control: P2P location services, TapestryReduce management states: minimize # of replicas, object clusteringDistributed load balancing: capacity constraintsAdaptation to clients dynamicsDynamic distribution/deletion of replicas with regarding to clients QoS constraintsIncremental clusteringNetwork-awareness and fault-tolerance (WNMMS)Distance estimation: Internet Iso-barAnomaly detection and diagnostics

Comparison of Content Delivery Systems (contd)

PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCANDistributed load balancingNoYesYesNoYesDynamic replica placementYesYesYesNoYesNetwork- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring systemNo global network topology assumptionYesYesYesNoYes

Network-awareness (contd)Loss/congestion predictionMaximize the true positive and minimize the false positiveOrthogonal loss/congestion paths discoveryWithout underlying topology

How stable is such orthogonality?Degradation of orthogonality over timeReactive and proactive adaptation for SCAN

# of Internet users: http://www.usabilitynews.com/news/article637.asp, will reach one billion by 2005. # of total Internet traffic: http://www.cs.columbia.edu/~hgs/internet/traffic.html

Efficiency: design systems with growth potential

Amazing growth in WWW trafficDaily growth of roughly 7M Web pagesAnnual growth of 200% predicted for next 4 years

1M page growth Scientific American June 1999 issue.

7M page growth rate: http://cyberatlas.internet.com/big_picture/traffic_patterns/article/0,,5931_413691,00.htmlTotally 4 billion pages

The convergence in the digital world of voice, data and video is expected to lead to a compound annual growth rate of 200% of Web traffic over the next four years -- http://www.skybridgesatellite.com/l21_mark/cont_22.htm

Define the term for replicaThere are large and popular Web servers, such as CNN.com and msnbc.com, which need to improve performance and scalability. Lets imagine that there is a Web server in Cambridge, London. And one client in Northwestern try to access some content.Later on, a nearby client from U of Chicago try to access similar content, then directly served from the CDN server.

So, the CDN reduces the latency for the client, reduces the b/w for Web servers, and improve the scalability and availability for the Web content server. And it helps the Internet as a whole by reducing the long-haul traffic.In-efficient replication will have two effects: 1. wastes a lot of replication bandwidth, and consequently, update bandwidth;2. Working set includes all the Web objects. They are cached in, but constantly replaced before serving more clients.

Questions on consistent caching: it is another type of hash table for directory scheme with high probability on query success. It mainly supports fixed # of replicas for each URL, and is not very flexible for hot URLs, have to continuously change the hash functions.Secondly, it doesnt really record the location of replicas. So cant update them when change occurs.CDN applications, not addressed in the thesisSCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.

Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.It has full control of the replication cost and can also improve the caching effect.

Define the term for replicaHow to evaluate the architecture/algorithms designed for Internet-scale services? We cant deploy it on thousands of nodes to test out the scalability or performance

Analytical way, back-of-envelope calculation.Realistic simulation, real-world traces and measurement to understand the Internet behavior, and apply them for evaluation of algorithms/arch.

In my thesis, I developed wide collaboration w/ industry and research labs to obtain these real-world traces and measurement.It include the network topo, web workload and e2e network dist measurement.

Throughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.

For MSNBC, 10K groups left, chooses top 10% covering >70% of requests

Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Collect several months data, to study not only the performance, but also the stability of performance.

First, given certain replication bandwidth cost constraint, to choose the optimal replica locations is an NP-complete problem. Greedy algorithm is proved to be efficient, and can achieve close-to-optimal performance.

Previous work use per-Website based replication. By looking at finer granularity, we found that replication at per URL basis can significantly reduce the clients average latency. However, it suffers big management overhead.As a solution, we propose cluster the URLs, and replication in the unit of clusters.

At what granularity? Per website or per URL? Here URL refers to the Web object that the URL address points to. The tradeoff is replication performance in term of users access latency vs. the management overhead.

Of course, we dont have to replicate all Web contents, especially for those particular dynamic ones. According to Anja Feldman Infocomm 99 paper, about 40% of requests are dynamically generated. But we can still push the rest 60%.

Give some numbers for scalability problemPush replicas with greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But expensive for replica location managementRemove the citations for greedy algorithms,

Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThere are many ISPs, each . There are also CDN name servers and Web content server, which hosts the Web contents denoted as the green box. When client 1 has request for the green URL, the hostname resolution has to go through the CDN name server.CDN name server will return the IP address of the local CDN server of ISP 1.Then client 1 has to send request to its local CDN server, which essentially performs as a cache. The problem is that the CDN name servers dont track where the content has been replicated. So when client 2 request for green URL, the CDN name server still reply the IP address of CDN server of ISP 2.although the content has been replicated in CDN server 1, which could be quite close to client 2. 4%: explains that comparison is under similar average latency for two schemes.Then, the problem is how to efficiently do cooperative push?Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsArchitecture will compare uncooperative pull-based vs. cooperative push-based replicationTalk about using greedy algorithms for pre web site and per URL schemes.For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringFor per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clustering

Although greedy algorithm has been proved to be the most cost-effective one for solving the replica location NP-complete problem, For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringSemantics: contents in the same directory not necessarily have good correlation in terms of user access.

We use K-split algorithm by Gonzalez. Cost O(NK)Intuition: each vector unique represents a URL in the high-dim space. Vector similarity represents the consin of the angle between two vectors, regardless of the length. In other words, it only characterize the relative ratio of the # of accesses from each client cluster, rather than the absolute value

Mention semantics-based clustering.Define the term for replica

Here spatial access vector essentially captures the aggregated access pattern to that URL.Temporal clustering tries to characterize how frequent two URLs are accessed together in the same session, normalized by the total # of access.

Popularity-based clustering use pure access # (no matter where they are from) for clustering. It performs very well, it may imply that the popular URLs are globally popular.Tested over various topologies and traces

Popularity-based clustering also performs very well, -> implies that popular URLs are globally popular, and URLs w/ similar popularity may have similar aggregated access patterns. Well use this conjecture in our incremental clustering later.

1-2% of URLs, say 10 clusters

Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThe major reason for dynamic clustering is that b/c of the emergence of new contents, which need to be appropriately clustered and replicated, especially when there is no access history. The goal is two fold:1. With minimum protuberance to the existing clusters;2. Cluster and replicate to reduce retrieval cost.After framework, 2 options and different implementationsReplicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Results is groups of siblings has the best correlation.Then talk about the graph!We carried out the following experiments:First, we use popularity-based clustering and replication to distribute the URLs in 8-10am 2-hour traces.Then at the moment of 10am, 16 new URLs were created, and need to be replicated. We use the online incremental clustering algorithms above to cluster and replicate them. Then we observe their performance in the next 2 hours, and compare with some other schemes.First, compared with we use static clustering and no replication for the 16 URLs, our scheme reduce the average retrieval cost to be only 1/8Second, compared with random replication of the 16 URLs, with the same # of replicas, our scheme reduce the average retrieval cost to be about 20%.Finally, we compared with the oracle case, which assume the knowledge of next 2 hour accesses in advance, and use expensive static clustering and replication, our scheme doubles the cost.But that is the oracle case, and can never be achieved.

1/8 1/5 compared w/ no replication, and random replicationDouble the optimalChange the term for static!!! Change it to two graphs1/8 1/5 compared w/ no replication, and random replicationDouble the optimalCompared w/ per-URL based replication, Architecture will compare uncooperative pull-based vs. cooperative push-based replicationMention wide collaboration in the beginning.SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.

Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsThroughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.

For MSNBC, 10K groups left, chooses top 10% covering >70% of requests

Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Replicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Every line start with How toWorld Cup log peak time: 209K req/min, 580MB transfer per minuteMSNBC SIGCOMM 2000 paper, number for order of URLs and clients.

The CDN operators set up the latency constraints, based onDifferent classes of clientsHuman perception of latency

What if no Tapestry? DHT, from operator point of view, divide the problems. The other problem is solvedWhy choose Tapestry?Introduce replicas, caches, etc.UpdatesTapestry, who r involved and functionalitiesIntroduce the dissemination tree, soft state maintentanceOngoing: better clustering methods, dynamic service modelOne reason is that the size doesnt differ much for top 1000 URLs (from several hundred bytes to tens of thousands bytes).Another interationScalability for summary cache is that they target O(100) proxies and O(1M) pages, which is not as good as we need.The key problem is proxy server is usually installed by ISPs, which may not accept the cooperative model.Distributed load balancing is implemented through request redirection mechanisms, such as server initiated Web caching, pull-based CDNs and SCAN.Orthogonality (AC, BC) = 1 co-occurrences of loss (AC, BC) -------------------------------------------------------loss occurrences (AC) + loss occurrences (BC)

efficient and adaptive replication using content clustering

Documents

clustering web content

web content serverisp

web sites

web pages

clustering of objects

cooperative pushcdn

popular web servers

cooperative pushwhere