efficient and adaptive replication using content clustering
DESCRIPTION
Efficient and Adaptive Replication using Content Clustering. Yan Chen EECS Department UC Berkeley. Motivation. Internet has evolved to become a commercial infrastructure for service delivery Web delivery, VoIP, streaming media … Challenges for Internet-scale services - PowerPoint PPT PresentationTRANSCRIPT
-
Efficient and Adaptive Replication using Content ClusteringYan Chen
EECS DepartmentUC Berkeley
-
MotivationInternet has evolved to become a commercial infrastructure for service deliveryWeb delivery, VoIP, streaming media Challenges for Internet-scale servicesScalability: 600M users, 35M Web sites, 28Tb/sEfficiency: bandwidth, storage, managementAgility: dynamic clients/network/serversSecurity, etc. Focus on content delivery - Content Distribution Network (CDN)Totally 4 Billion Web pages, daily growth of 7M pagesAnnual growth of 200% for next 4 years
-
CDN and its Challenges
-
CDN and its ChallengesInefficient replicationNo coherence for dynamic contentUnscalable network monitoring - O(M*N)X
-
SCAN: Scalable Content Access NetworkCDN Applications (e.g. streaming media)Provision: Cooperative Clustering-based ReplicationUser Behavior/Workload MonitoringCoherence: Update Multicast Tree ConstructionNetwork PerformanceMonitoringNetwork Distance/ Congestion/ FailureEstimationred: my work, black: out of scope
-
SCANCoherence for dynamic contentCooperative clustering-based replications1, s4, s5
-
SCANXScalable network monitoring - O(M+N)s1, s4, s5Cooperative clustering-based replicationCoherence for dynamic content
-
Internet-scale SimulationNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology
-
Internet-scale Simulation E2E MeasurementNLANR Active Measurement Project data set111 sites on America, Asia, Australia and EuropeRound-trip time (RTT) between every pair of hosts every minute17M daily measurementRaw data: Jun. Dec. 2001, Nov. 2002 Keynote measurement dataMeasure TCP performance from about 100 worldwide agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargets40 most popular Web servers + 27 Internet Data CentersRaw data: Nov. Dec. 2001, Mar. May 2002
-
Clustering Web Content for Efficient Replication
-
OverviewCDN uses non-cooperative replication - inefficientParadigm shift: cooperative pushWhere to push greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But what content to be pushed? At what granularity?Clustering of objects for replicationClose-to-optimal performance with small overheadIncremental clusteringPush before accessed: improve availability during flash crowds
-
OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research
-
Conventional CDN: Non-cooperative PullClient 1Web content serverISP 2ISP 1Inefficient replication
-
SCAN: Cooperative PushCDN name serverClient 1ISP 2ISP 1Significantly reduce the # of replicas and update cost
-
Comparison between Conventional CDNs and SCAN
-
Problem FormulationFind a scalable, adaptive replication strategy to reduce Clients average retrieval costReplica location computation cost Amount of replica directory state to maintainSubject to certain total replication cost (e.g., # of URL replicas)
-
OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research
-
Per Web site1234
-
60 70% average retrieval cost reduction for Per URL schemePer URL is too expensive for management!Replica Placement: Per Website vs. Per URL
Chart1
170.878.4
132.759.2
112.849
98.742.4
91.437.5
8733.8
82.930.9
79.128.4
75.526.4
Replicate per Website
Replicate per URL
Average number of replicas per URL
Average retrieval cost
www02
MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case
1429.712668429.71266847.75298747.752987
2170.878.436.06635527.319388
3132.759.232.22923423.087655
4112.84929.45913520.487025
598.742.427.0309718.625499
691.437.525.29342517.199268
78733.823.6918416.035896
882.930.922.26131815.05205
979.128.421.3425114.199426
1075.526.420.63394313.448004
MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering
120501003001000
291.103407142.290429139.696062129.885976121.021061120.783844
291.103407266.290804261.416096241.544629170.940168120.783844
291.103407170.413829159.311739149.119835137.158313120.783844
MSNBC, 0801, multiple reps, using access vector, session based clustering
# replicas1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
Replicate per Web site294.24829170.794807132.715103112.76344998.67935291.43610487.02487482.85842779.10248175.50979472.34989769.40989766.73225964.10348761.90317260.19864958.54286757.24377855.95019854.73148953.53410852.50465951.50483150.60541149.72402248.85905548.05908947.2630346.4837945.73238445.00320544.31786743.67704843.03645842.41211741.82317541.24237540.67528840.11265939.56397239.01536638.50764438.01266937.51959537.06298136.62823136.19545935.77650135.37517234.977634
spatial clustering294.2482997.60607576.55688265.21992157.46739751.54513546.65240942.85245439.7792637.17163934.87058332.88120431.1566729.66069128.36131927.19995726.15036725.20355724.3333723.5286322.77763322.07650521.41869220.80920220.22956719.67863219.15242818.66128518.19464217.74812717.31887816.90885416.51644616.13962915.77844615.42986515.10014714.78434414.47965814.1816413.8907813.60804713.33390613.06732712.80893712.5569612.3134512.07836311.85225111.633959
temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589
5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug
total new URLs13151389143114891483153915381530152615231483
metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044
metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??
best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528
metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063
replica # in metric3598360916050652165036618662166106592655136503722369240063994
# of URLs not clustered002116468600211
# of new clusters generated based on limit-radius of 19990802/CRG-200021133333
# of URLs not clustered with biggest radius0020164675
metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466
# of replica reclaimed948120513911606158217721805174817611739
# of clusters whose URLs are all cold and reclaimed
best perf. (with added replica#)25.17623.67322.82320.90921.21320.677
Access freq clusteirng
metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511
metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551
metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721
replica # in metric35381528751085136499253154969522551755394
www02
Replicate per Website
Replicate per URL
Average number of replicas per URL
Average retrieval cost
000
000
000
000
000
000
Replicate per Website
Replicate per URL
Number of replicas/URL
Total latency of clients (Ksec)
Replicate per Web site
Replicate with access frequency clustering
Average number of replicas per URL
Average retrieval cost
8/38/38/3
8/48/48/4
8/58/58/5
8/108/108/10
8/118/118/11
9/279/279/27
9/289/289/28
9/299/299/29
9/309/309/30
10/110/110/1
Old clusters, old replica locations
Old clusters, new replica locations
New clusters, new replica locations
Date
Total latency of clients (Ksec)
Stability analysis of MSNBC traces
8/38/38/38/38/3
8/48/48/48/48/4
8/58/58/58/58/5
8/108/108/108/108/10
8/118/118/118/118/11
9/279/279/279/279/27
9/289/289/289/289/28
9/299/299/299/299/29
9/309/309/309/309/30
10/110/110/110/110/1
Static clustering, old replication
Static clustering, re-replication
Reclustering, re-replication (optimal)
Offline incremental clustering, step 1 only
Offline incremental clustering, replication
New date
Average retrieval cost
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
Static clustering, old replication
Static clustering, re-replication
Reclustering, re-replication (optimal)
Offline incremental clustering, replication
New date
Average retrieval cost
-
Where R: # of replicas per URL M: # of URLsTo compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!
Overhead Comparison
Replication SchemeState to Maintain Computation CostPer WebsiteO(R)O(R)
Per URLO(R M)O(R M)
-
Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)
Overhead Comparison
Replication SchemeStates to Maintain Computation CostPer WebsiteO(R)O(R) Per ClusterO(R K + M)O(R K)Per URLO(R M)O(R M)
-
Clustering Web ContentGeneral clustering frameworkDefine the correlation distance between URLsCluster diameter: the max distance between any two membersWorst correlation in a clusterGeneric clustering: minimize the max diameter of all clusters Correlation distance definition based onSpatial localityTemporal localityPopularity
-
Spatial ClusteringCorrelation distance between two URLs defined asEuclidean distanceVector similarity
-
Clustering Web Content (contd)Popularity-based clustering
OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation
Temporal clustering Divide traces into multiple individuals access sessions [ABQ01] In each session,
Average over multiple sessions in one day
-
Performance of Cluster-based ReplicationSpatial clustering with Euclidean distance and popularity-based clustering perform the bestSmall # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overheadMSNBC, 8/2/1999, 5 replicas/URL
Chart1
98.6898.6898.6898.68
59.05397.38797.83363.58
51.60595.93192.81257.47
47.30894.33587.9451.13
46.4489.22276.02347.954359
42.81771.10360.23845
42.37242.37242.37242.372
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Popularity-based clustering
Number of clusters
Average retrieval cost
msnbc-19990802-5reps-clustering
198.6898.6898.6898.6898.680.05
1059.05397.38797.83363.5863.5786910.5
2051.60595.93192.81257.4757.4673971
5047.30894.33587.9451.1351.132982.55
10046.4489.22276.02347.9543595.1
30042.81771.10360.2384515.3
100042.37242.37242.37242.37251
MSNBC: above, NASA: below
1141.03234141.03234141.03234141.03234
6100.60196140.49318124.26101.07638
1596.64702140.2433120.6897.01858
3095.56142139.7855102.3296.38984
9094.26264118.2724697.1695.4
30093.0477493.0477493.0477493.04774
4572595626.2
msnbc-19990802-5reps-clustering
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Complete re-clustering & re-replication (oracle)
Average retrieval cost
Number of clusters
Computational cost (hours)
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Popularity-based clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Average retrieval cost
Chart16
0.05
0.5
1
2.55
5.1
15.3
51
Number of clusters
Computational cost (hours)
msnbc-19990802-5reps-clustering
198.6898.6898.6898.6898.680.05
1059.05397.38797.83363.5863.5786910.5
2051.60595.93192.81257.4757.4673971
5047.30894.33587.9451.1351.132982.55
10046.4489.22276.02347.9543595.1
30042.81771.10360.2384515.3
100042.37242.37242.37242.37251
MSNBC: above, NASA: below
1141.03234141.03234141.03234141.03234
6100.60196140.49318124.26101.07638
1596.64702140.2433120.6897.01858
3095.56142139.7855102.3296.38984
9094.26264118.2724697.1695.4
30093.0477493.0477493.0477493.04774
4572595626.2
msnbc-19990802-5reps-clustering
0000
0000
0000
0000
0000
0000
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
0
0
0
0
Average retrieval cost
0
0
0
0
0
0
0
Number of clusters
Computational cost (hours)
0000
0000
0000
0000
0000
0000
0000
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
-
OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research
-
Static clustering and replicationTwo daily traces: training trace and new trace
Static clustering performs poorly beyond a week
Chart2
38.2237.25428.576
39.13738.22326.9
43.06542.05725.503
39.0838.12724.95
44.76443.86825.39
48.2446.78624.944
33.68732.54917.245
45.21943.75622.566
38.08836.88218.427
23.90422.98511.762
Static clustering 1
Static clustering 2
Reclustering, re-replication (optimal)
New traces
Average retrieval cost
www02
MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case
1429.712668429.71266847.75298747.752987
2326.806921155.4850336.06635527.319388
3291.103407120.78384432.22923423.087655
4262.058643101.43584429.45913520.487025
5242.95174388.19952427.0309718.625499
6225.68053878.24808825.29342517.199268
7213.66075170.33510423.6918416.035896
8203.47370563.83776422.26131815.05205
9194.5215158.36462321.3425114.199426
10185.95830553.68353820.63394313.448004
MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering
120501003001000
291.103407142.290429139.696062129.885976121.021061120.783844
291.103407266.290804261.416096241.544629170.940168120.783844
291.103407170.413829159.311739149.119835137.158313120.783844
MSNBC, 0801, multiple reps, using access vector, session based clustering
# replicas12345678910111213141516171819202123242526272829303132333435363738394041424344454647484950
Replicate per Web site429.712668326.806921291.103407262.058643242.951743225.680538213.660751203.473705194.52151185.958305179.57821173.902084168.580902163.466699158.685096153.937913149.683283145.453766141.583481138.191325134.937477131.847434129.034592123.853151121.369711118.971794116.6723114.490966112.57276110.702858108.882123107.178673105.548969103.946558102.436065100.98686499.6361298.32623397.1063395.94401594.78745893.68489592.59370791.53921690.58690589.6776188.78807387.93216687.0857
spatial clustering429.712668175.160743142.290429121.854833114.895495104.52119995.24120391.444883.66593683.19534175.11010674.73348770.41610366.96563863.53753762.27375959.27095656.03238854.34557753.81285851.75335850.25345949.02200648.80636547.24332345.96648444.50371544.43399642.5796641.72261441.47807740.59677339.82875138.58454438.38735937.73849336.58925736.54179235.92552834.82994634.29379634.25433233.73850933.19427232.29343732.13940331.55468131.1389330.735593
temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589
5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug
total new URLs13151389143114891483153915381530152615231483
metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044
metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??
best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528
metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063
replica # in metric359836050652165036618662166106592655136503722369240063994
# of URLs not clustered002116468600211
# of new clusters generated based on limit-radius of 19990802/CRG-200021133333
# of URLs not clustered with biggest radius0020164675
metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466
# of replica reclaimed948120513911606158217721805174817611739
# of clusters whose URLs are all cold and reclaimed
best perf. (with added replica#)25.17623.67322.82320.90921.21320.677
Access freq clusteirng
metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511
metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551
metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721
replica # in metric35381528751085136499253154969522551755394
# of new URL replicas (non-orphan URLs)983109110501521150316181621161015921551
# of new URL replicas (orphan URLs)003418579154184138169188
# of URL replicas (optimal)4000400040004000400040004000400040004000
after normalization by 4024.57527.27526.2538.02537.57540.4540.52540.2539.838.775
008.5252.1251.9753.854.63.454.2254.7
www02
Replicate per Website
Replicate per URL
Number of replicas/URL
Total latency of clients (Ksec)
000
000
000
000
000
000
Replicate per Website
Replicate per URL
Number of replicas/URL
Total latency of clients (Ksec)
Replicate per Web site
Replicated with spatial clustering
Number of replicas/URL
Total lantency of clients (Ksec)
8/38/38/38/38/3
8/48/48/48/48/4
8/58/58/58/58/5
8/108/108/108/108/10
8/118/118/118/118/11
9/279/279/279/279/27
9/289/289/289/289/28
9/299/299/299/299/29
9/309/309/309/309/30
10/110/110/110/110/1
Static clustering 1
Static clustering 2
Reclustering, re-replication (optimal)
Offline incremental clustering, step 1
Offline incremental clustering, complete
New traces
Average retrieval cost
8/38/38/38/3
8/48/48/48/4
8/58/58/58/5
8/108/108/108/10
8/118/118/118/11
9/279/279/279/27
9/289/289/289/28
9/299/299/299/29
9/309/309/309/30
10/110/110/110/1
Static clustering 1
Static clustering 2
Reclustering, re-replication (optimal)
Offline incremental clustering
New traces
Average retrieval cost
8/38/38/3
8/48/48/4
8/58/58/5
8/108/108/10
8/118/118/11
9/279/279/27
9/289/289/28
9/299/299/29
9/309/309/30
10/110/110/1
Static clustering 1
Static clustering 2
Reclustering, re-replication (optimal)
New traces
Average retrieval cost
8/38/38/3
8/48/48/4
8/58/58/5
8/108/108/10
8/118/118/11
9/279/279/27
9/289/289/28
9/299/299/29
9/309/309/30
10/110/110/1
Offline incremental clustering, step 1
Offline incremental clustering, step 2
Re-clustering, re-replication (optimal)
New traces
Average retrieval cost
8/38/38/3
8/48/48/4
8/58/58/5
8/108/108/10
8/118/118/11
9/279/279/27
9/289/289/28
9/299/299/29
9/309/309/30
10/110/110/1
Offline incremental clustering, step 1
Offline incremental clustering, step 2
Re-clustering, re-replication (optimal)
New traces
Average retrieval cost
8/38/3
8/48/4
8/58/5
8/108/10
8/118/11
9/279/27
9/289/28
9/299/29
9/309/30
10/110/1
Offline incremental clustering, step 1
Offline incremental clustering, step 2
New date
Replication cost, normalized by that of the optimal case
-
Incremental ClusteringGeneric frameworkIf new URL u match with existing clusters c, add u to c and replicate u to existing replicas of cElse create new clusters and replicate themTwo types of incremental clusteringOnline: without any access logsHigh availabilityOffline: with access logsClose-to-optimal performance
-
Online Incremental ClusteringGroups of siblings
Groups of the same hyperlink depth(smallest # of links from root)Predict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!
-
Online Popularity PredictionExperimentsCrawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLsUse corresponding access logs to analyze the correlationGroups of siblings have the best correlationMeasure the divergence of URL popularity within a group:
-
Semantics-based Incremental ClusteringPut new URL into existing cluster with largest # of siblingsIn case of a tie, choose the cluster w/ more replicasSimulation on 5/3/2002 MSNBC8-10am trace: static popularity clustering + replicationAt 10am: 16 new URLs - online inc. clustering + replication Evaluation with 10-12am trace: 16 URLs has 33K requests
-
Online Incremental Clustering and Replication Results1/8 compared w/ no replication, and 1/5 for random replication
Chart2
457
259
56
Average retrieval cost
msnbc-19990802-5reps-clustering
198.6898.6898.6898.6898.680.05
1059.05397.38797.83363.5863.5786910.5
2051.60595.93192.81257.4757.4673971
5047.30894.33587.9451.1351.132982.55
10046.4489.22276.02347.9543595.1
30042.81771.10360.2384515.3
100042.37242.37242.37242.37251
MSNBC: above, NASA: below
1141.03234141.03234141.03234141.03234
6100.60196140.49318124.26101.07638
1596.64702140.2433120.6897.01858
3095.56142139.7855102.3296.38984
9094.26264118.2724697.1695.4
30093.0477493.0477493.0477493.04774
4572595626.2
msnbc-19990802-5reps-clustering
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Complete re-clustering & re-replication (oracle)
Average retrieval cost
Number of clusters
Computational cost (hours)
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Average retrieval cost
-
Online Incremental Clustering and Replication ResultsDouble the optimal retrieval cost, but only 4% of its replication cost
Chart1
457
259
56
26.2
Average retrieval cost
msnbc-19990802-5reps-clustering
198.6898.6898.6898.6898.680.05
1059.05397.38797.83363.5863.5786910.5
2051.60595.93192.81257.4757.4673971
5047.30894.33587.9451.1351.132982.55
10046.4489.22276.02347.9543595.1
30042.81771.10360.2384515.3
100042.37242.37242.37242.37251
MSNBC: above, NASA: below
1141.03234141.03234141.03234141.03234
6100.60196140.49318124.26101.07638
1596.64702140.2433120.6897.01858
3095.56142139.7855102.3296.38984
9094.26264118.2724697.1695.4
30093.0477493.0477493.0477493.04774
4572595626.2
msnbc-19990802-5reps-clustering
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Complete re-clustering & re-replication (oracle)
Average retrieval cost
Number of clusters
Computational cost (hours)
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
Number of clusters
Average retrieval cost
No replication of new URLs
Random replication of new URLs
Online incremental clustering & replication
Average retrieval cost
-
ConclusionsCooperative, clustering-based replicationCooperative push: only 4 - 5% replication/update cost compared with existing CDNsURL Clustering reduce the management/computational overhead by two orders of magnitudeSpatial clustering and popularity-based clustering recommendedIncremental clustering to adapt to emerging URLsHyperlink-based online incremental clustering for high availability and performance improvementSelf-organize replicas into app-level multicast tree for update disseminationScalable overlay network monitoringO(M+N) instead of O(M*N), given M client groups and N servers
-
OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research
-
Future Research (I)Measurement-based Internet study and protocol/architecture designUse inference techniques to develop Internet behavior modelsNetwork operators reluctant to reveal internal network configurationsRoot cause analysis: large, heterogeneous data miningLeverage graphics/visualization for interactive miningApply deeper understanding of Internet behaviors for reassessment/design of protocol/architectureE.g., Internet bottleneck peering links? How and Why? Implications?
-
Future Research (II)Network traffic anomaly characterization, identification and detectionMany unknown flow-level anomalies revealed from real router traffic analysis (AT&T)Profile traffic patterns of new applications (e.g. P2P) > benign anomaliesUnderstand the cause, pattern and prevalence of other unknown anomaliesIdentify malicious patterns for intrusion detectionE.g., fight against Sapphire/Slammer Worm
-
Backup Materials
-
SCANCoherence for dynamic contentCooperative clustering-based replicationXScalable network monitoring O(M+N)s1, s4, s5
-
Problem FormulationSubject to certain total replication cost (e.g., # of URL replicas)Find a scalable, adaptive replication strategy to reduce avg access cost
-
Simulation MethodologyNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology
-
Online Incremental ClusteringPredict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!Groups of siblingsGroups of the same hyperlink depth: smallest # of links from root
-
Challenges for CDNOver-provisioning for replicationProvide good QoS to clients (e.g., latency bound, coherence)Small # of replicas with small delay and bandwidth consumption for updateReplica ManagementScalability: billions of replicas if replicating in URLO(104) URLs/server, O(105) CDN edge servers in O(103) networksAdaptation to dynamics of content providers and customersMonitoringUser workload monitoring End-to-end network distance/congestion/failures monitoringMeasurement scalabilityInference accuracy and stability
-
SCAN ArchitectureLeverage Decentralized Object Location and Routing (DOLR) - Tapestry forDistributed, scalable location with guaranteed successSearch with localitySoft state maintenance of dissemination tree (for each object)
data planenetwork planedatasourceWeb serverSCAN serverRequest LocationDynamic Replication/Update and Content Management
-
Wide-area Network Measurement and Monitoring System (WNMMS)Cluster AClientsCluster BMonitorsCluster CSCAN edge serversSelect a subset of SCAN servers to be monitorsE2E estimation forDistanceCongestionFailuresnetwork plane
-
Dynamic ProvisioningDynamic replica placementMeeting clients latency and servers capacity constraintsClose-to-minimal # of replicasSelf-organized replicas into app-level multicast treeSmall delay and bandwidth consumption for update multicastEach node only maintains states for its parent & direct childrenEvaluated based on simulation ofSynthetic traces with various sensitivity analysisReal traces from NASA and MSNBCPublicationIPTPS 2002Pervasive Computing 2002
-
Effects of the Non-Uniform Size of URLsReplication cost constraint : bytesSimilar trends existPer URL replication outperforms per Website dramatically Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective1234
-
End HostLandmarkDiagram of Internet Iso-bar
-
Cluster AEnd HostCluster BMonitorCluster CLandmarkDiagram of Internet Iso-bar
-
Real Internet Measurement DataNLANR Active Measurement Project data set119 sites on US (106 after filtering out most offline sites)Round-trip time (RTT) between every pair of hosts every minuteRaw data: 6/24/00 12/3/01Keynote measurement dataMeasure TCP performance from about 100 agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargetsWeb site perspective: 40 most popular Web servers27 Internet Data Centers (IDCs)
-
Related WorkInternet content delivery systemsWeb cachingClient-initiated Server-initiatedPull-based Content Delivery Networks (CDNs)Push-based CDNsUpdate disseminationIP multicast Application-level multicastNetwork E2E Distance Monitoring Systems
-
ClientISP 2ISP 1Web Proxy Caching
-
ClientISP 2Pull-based CDNISP 1
-
ClientISP 2Push-based CDNISP 1
-
Internet Content Delivery Systems
Scalability for request redirectionPre-configured in browserUse Bloom filter to exchange replica locationsCentralized CDN name serverCentralized CDN name serverDecentra-lized P2P location
PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCAN
Efficiency (# of caches or replicas)No cache sharing among proxiesCache sharingNo replica sharing among edge serversReplica sharingReplica sharing
Network- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring system
Coherence supportNoNoYes NoYes
-
Previous Work: Update DisseminationNo inter-domain IP multicastApplication-level multicast (ALM) unscalableRoot maintains states for all children (Narada, Overcast, ALMI, RMX)Root handles all join requests (Bayeux)Root split is common solution, but suffers consistency overhead
-
Design PrinciplesScalabilityNo centralized point of control: P2P location services, TapestryReduce management states: minimize # of replicas, object clusteringDistributed load balancing: capacity constraintsAdaptation to clients dynamicsDynamic distribution/deletion of replicas with regarding to clients QoS constraintsIncremental clusteringNetwork-awareness and fault-tolerance (WNMMS)Distance estimation: Internet Iso-barAnomaly detection and diagnostics
-
Comparison of Content Delivery Systems (contd)
PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCANDistributed load balancingNoYesYesNoYesDynamic replica placementYesYesYesNoYesNetwork- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring systemNo global network topology assumptionYesYesYesNoYes
-
Network-awareness (contd)Loss/congestion predictionMaximize the true positive and minimize the false positiveOrthogonal loss/congestion paths discoveryWithout underlying topology
How stable is such orthogonality?Degradation of orthogonality over timeReactive and proactive adaptation for SCAN
# of Internet users: http://www.usabilitynews.com/news/article637.asp, will reach one billion by 2005. # of total Internet traffic: http://www.cs.columbia.edu/~hgs/internet/traffic.html
Efficiency: design systems with growth potential
Amazing growth in WWW trafficDaily growth of roughly 7M Web pagesAnnual growth of 200% predicted for next 4 years
1M page growth Scientific American June 1999 issue.
7M page growth rate: http://cyberatlas.internet.com/big_picture/traffic_patterns/article/0,,5931_413691,00.htmlTotally 4 billion pages
The convergence in the digital world of voice, data and video is expected to lead to a compound annual growth rate of 200% of Web traffic over the next four years -- http://www.skybridgesatellite.com/l21_mark/cont_22.htm
Define the term for replicaThere are large and popular Web servers, such as CNN.com and msnbc.com, which need to improve performance and scalability. Lets imagine that there is a Web server in Cambridge, London. And one client in Northwestern try to access some content.Later on, a nearby client from U of Chicago try to access similar content, then directly served from the CDN server.
So, the CDN reduces the latency for the client, reduces the b/w for Web servers, and improve the scalability and availability for the Web content server. And it helps the Internet as a whole by reducing the long-haul traffic.In-efficient replication will have two effects: 1. wastes a lot of replication bandwidth, and consequently, update bandwidth;2. Working set includes all the Web objects. They are cached in, but constantly replaced before serving more clients.
Questions on consistent caching: it is another type of hash table for directory scheme with high probability on query success. It mainly supports fixed # of replicas for each URL, and is not very flexible for hot URLs, have to continuously change the hash functions.Secondly, it doesnt really record the location of replicas. So cant update them when change occurs.CDN applications, not addressed in the thesisSCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.
Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.It has full control of the replication cost and can also improve the caching effect.
Define the term for replicaHow to evaluate the architecture/algorithms designed for Internet-scale services? We cant deploy it on thousands of nodes to test out the scalability or performance
Analytical way, back-of-envelope calculation.Realistic simulation, real-world traces and measurement to understand the Internet behavior, and apply them for evaluation of algorithms/arch.
In my thesis, I developed wide collaboration w/ industry and research labs to obtain these real-world traces and measurement.It include the network topo, web workload and e2e network dist measurement.
Throughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.
For MSNBC, 10K groups left, chooses top 10% covering >70% of requests
Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Collect several months data, to study not only the performance, but also the stability of performance.
First, given certain replication bandwidth cost constraint, to choose the optimal replica locations is an NP-complete problem. Greedy algorithm is proved to be efficient, and can achieve close-to-optimal performance.
Previous work use per-Website based replication. By looking at finer granularity, we found that replication at per URL basis can significantly reduce the clients average latency. However, it suffers big management overhead.As a solution, we propose cluster the URLs, and replication in the unit of clusters.
At what granularity? Per website or per URL? Here URL refers to the Web object that the URL address points to. The tradeoff is replication performance in term of users access latency vs. the management overhead.
Of course, we dont have to replicate all Web contents, especially for those particular dynamic ones. According to Anja Feldman Infocomm 99 paper, about 40% of requests are dynamically generated. But we can still push the rest 60%.
Give some numbers for scalability problemPush replicas with greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But expensive for replica location managementRemove the citations for greedy algorithms,
Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThere are many ISPs, each . There are also CDN name servers and Web content server, which hosts the Web contents denoted as the green box. When client 1 has request for the green URL, the hostname resolution has to go through the CDN name server.CDN name server will return the IP address of the local CDN server of ISP 1.Then client 1 has to send request to its local CDN server, which essentially performs as a cache. The problem is that the CDN name servers dont track where the content has been replicated. So when client 2 request for green URL, the CDN name server still reply the IP address of CDN server of ISP 2.although the content has been replicated in CDN server 1, which could be quite close to client 2. 4%: explains that comparison is under similar average latency for two schemes.Then, the problem is how to efficiently do cooperative push?Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsArchitecture will compare uncooperative pull-based vs. cooperative push-based replicationTalk about using greedy algorithms for pre web site and per URL schemes.For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.
We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringFor per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.
We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clustering
Although greedy algorithm has been proved to be the most cost-effective one for solving the replica location NP-complete problem, For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.
We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringSemantics: contents in the same directory not necessarily have good correlation in terms of user access.
We use K-split algorithm by Gonzalez. Cost O(NK)Intuition: each vector unique represents a URL in the high-dim space. Vector similarity represents the consin of the angle between two vectors, regardless of the length. In other words, it only characterize the relative ratio of the # of accesses from each client cluster, rather than the absolute value
Mention semantics-based clustering.Define the term for replica
Here spatial access vector essentially captures the aggregated access pattern to that URL.Temporal clustering tries to characterize how frequent two URLs are accessed together in the same session, normalized by the total # of access.
Popularity-based clustering use pure access # (no matter where they are from) for clustering. It performs very well, it may imply that the popular URLs are globally popular.Tested over various topologies and traces
Popularity-based clustering also performs very well, -> implies that popular URLs are globally popular, and URLs w/ similar popularity may have similar aggregated access patterns. Well use this conjecture in our incremental clustering later.
1-2% of URLs, say 10 clusters
Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThe major reason for dynamic clustering is that b/c of the emergence of new contents, which need to be appropriately clustered and replicated, especially when there is no access history. The goal is two fold:1. With minimum protuberance to the existing clusters;2. Cluster and replicate to reduce retrieval cost.After framework, 2 options and different implementationsReplicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Results is groups of siblings has the best correlation.Then talk about the graph!We carried out the following experiments:First, we use popularity-based clustering and replication to distribute the URLs in 8-10am 2-hour traces.Then at the moment of 10am, 16 new URLs were created, and need to be replicated. We use the online incremental clustering algorithms above to cluster and replicate them. Then we observe their performance in the next 2 hours, and compare with some other schemes.First, compared with we use static clustering and no replication for the 16 URLs, our scheme reduce the average retrieval cost to be only 1/8Second, compared with random replication of the 16 URLs, with the same # of replicas, our scheme reduce the average retrieval cost to be about 20%.Finally, we compared with the oracle case, which assume the knowledge of next 2 hour accesses in advance, and use expensive static clustering and replication, our scheme doubles the cost.But that is the oracle case, and can never be achieved.
1/8 1/5 compared w/ no replication, and random replicationDouble the optimalChange the term for static!!! Change it to two graphs1/8 1/5 compared w/ no replication, and random replicationDouble the optimalCompared w/ per-URL based replication, Architecture will compare uncooperative pull-based vs. cooperative push-based replicationMention wide collaboration in the beginning.SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.
Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsThroughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.
For MSNBC, 10K groups left, chooses top 10% covering >70% of requests
Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Replicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Every line start with How toWorld Cup log peak time: 209K req/min, 580MB transfer per minuteMSNBC SIGCOMM 2000 paper, number for order of URLs and clients.
The CDN operators set up the latency constraints, based onDifferent classes of clientsHuman perception of latency
What if no Tapestry? DHT, from operator point of view, divide the problems. The other problem is solvedWhy choose Tapestry?Introduce replicas, caches, etc.UpdatesTapestry, who r involved and functionalitiesIntroduce the dissemination tree, soft state maintentanceOngoing: better clustering methods, dynamic service modelOne reason is that the size doesnt differ much for top 1000 URLs (from several hundred bytes to tens of thousands bytes).Another interationScalability for summary cache is that they target O(100) proxies and O(1M) pages, which is not as good as we need.The key problem is proxy server is usually installed by ISPs, which may not accept the cooperative model.Distributed load balancing is implemented through request redirection mechanisms, such as server initiated Web caching, pull-based CDNs and SCAN.Orthogonality (AC, BC) = 1 co-occurrences of loss (AC, BC) -------------------------------------------------------loss occurrences (AC) + loss occurrences (BC)