efficient and adaptive replication using content clustering

59
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley

Upload: ryder

Post on 05-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Efficient and Adaptive Replication using Content Clustering. Yan Chen EECS Department UC Berkeley. Motivation. Internet has evolved to become a commercial infrastructure for service delivery Web delivery, VoIP, streaming media … Challenges for Internet-scale services - PowerPoint PPT Presentation

TRANSCRIPT

  • Efficient and Adaptive Replication using Content ClusteringYan Chen

    EECS DepartmentUC Berkeley

  • MotivationInternet has evolved to become a commercial infrastructure for service deliveryWeb delivery, VoIP, streaming media Challenges for Internet-scale servicesScalability: 600M users, 35M Web sites, 28Tb/sEfficiency: bandwidth, storage, managementAgility: dynamic clients/network/serversSecurity, etc. Focus on content delivery - Content Distribution Network (CDN)Totally 4 Billion Web pages, daily growth of 7M pagesAnnual growth of 200% for next 4 years

  • CDN and its Challenges

  • CDN and its ChallengesInefficient replicationNo coherence for dynamic contentUnscalable network monitoring - O(M*N)X

  • SCAN: Scalable Content Access NetworkCDN Applications (e.g. streaming media)Provision: Cooperative Clustering-based ReplicationUser Behavior/Workload MonitoringCoherence: Update Multicast Tree ConstructionNetwork PerformanceMonitoringNetwork Distance/ Congestion/ FailureEstimationred: my work, black: out of scope

  • SCANCoherence for dynamic contentCooperative clustering-based replications1, s4, s5

  • SCANXScalable network monitoring - O(M+N)s1, s4, s5Cooperative clustering-based replicationCoherence for dynamic content

  • Internet-scale SimulationNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology

  • Internet-scale Simulation E2E MeasurementNLANR Active Measurement Project data set111 sites on America, Asia, Australia and EuropeRound-trip time (RTT) between every pair of hosts every minute17M daily measurementRaw data: Jun. Dec. 2001, Nov. 2002 Keynote measurement dataMeasure TCP performance from about 100 worldwide agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargets40 most popular Web servers + 27 Internet Data CentersRaw data: Nov. Dec. 2001, Mar. May 2002

  • Clustering Web Content for Efficient Replication

  • OverviewCDN uses non-cooperative replication - inefficientParadigm shift: cooperative pushWhere to push greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But what content to be pushed? At what granularity?Clustering of objects for replicationClose-to-optimal performance with small overheadIncremental clusteringPush before accessed: improve availability during flash crowds

  • OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research

  • Conventional CDN: Non-cooperative PullClient 1Web content serverISP 2ISP 1Inefficient replication

  • SCAN: Cooperative PushCDN name serverClient 1ISP 2ISP 1Significantly reduce the # of replicas and update cost

  • Comparison between Conventional CDNs and SCAN

  • Problem FormulationFind a scalable, adaptive replication strategy to reduce Clients average retrieval costReplica location computation cost Amount of replica directory state to maintainSubject to certain total replication cost (e.g., # of URL replicas)

  • OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research

  • Per Web site1234

  • 60 70% average retrieval cost reduction for Per URL schemePer URL is too expensive for management!Replica Placement: Per Website vs. Per URL

    Chart1

    170.878.4

    132.759.2

    112.849

    98.742.4

    91.437.5

    8733.8

    82.930.9

    79.128.4

    75.526.4

    Replicate per Website

    Replicate per URL

    Average number of replicas per URL

    Average retrieval cost

    www02

    MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case

    1429.712668429.71266847.75298747.752987

    2170.878.436.06635527.319388

    3132.759.232.22923423.087655

    4112.84929.45913520.487025

    598.742.427.0309718.625499

    691.437.525.29342517.199268

    78733.823.6918416.035896

    882.930.922.26131815.05205

    979.128.421.3425114.199426

    1075.526.420.63394313.448004

    MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering

    120501003001000

    291.103407142.290429139.696062129.885976121.021061120.783844

    291.103407266.290804261.416096241.544629170.940168120.783844

    291.103407170.413829159.311739149.119835137.158313120.783844

    MSNBC, 0801, multiple reps, using access vector, session based clustering

    # replicas1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

    Replicate per Web site294.24829170.794807132.715103112.76344998.67935291.43610487.02487482.85842779.10248175.50979472.34989769.40989766.73225964.10348761.90317260.19864958.54286757.24377855.95019854.73148953.53410852.50465951.50483150.60541149.72402248.85905548.05908947.2630346.4837945.73238445.00320544.31786743.67704843.03645842.41211741.82317541.24237540.67528840.11265939.56397239.01536638.50764438.01266937.51959537.06298136.62823136.19545935.77650135.37517234.977634

    spatial clustering294.2482997.60607576.55688265.21992157.46739751.54513546.65240942.85245439.7792637.17163934.87058332.88120431.1566729.66069128.36131927.19995726.15036725.20355724.3333723.5286322.77763322.07650521.41869220.80920220.22956719.67863219.15242818.66128518.19464217.74812717.31887816.90885416.51644616.13962915.77844615.42986515.10014714.78434414.47965814.1816413.8907813.60804713.33390613.06732712.80893712.5569612.3134512.07836311.85225111.633959

    temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589

    5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug

    total new URLs13151389143114891483153915381530152615231483

    metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044

    metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??

    best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528

    metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063

    replica # in metric3598360916050652165036618662166106592655136503722369240063994

    # of URLs not clustered002116468600211

    # of new clusters generated based on limit-radius of 19990802/CRG-200021133333

    # of URLs not clustered with biggest radius0020164675

    metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466

    # of replica reclaimed948120513911606158217721805174817611739

    # of clusters whose URLs are all cold and reclaimed

    best perf. (with added replica#)25.17623.67322.82320.90921.21320.677

    Access freq clusteirng

    metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511

    metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551

    metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721

    replica # in metric35381528751085136499253154969522551755394

    www02

    Replicate per Website

    Replicate per URL

    Average number of replicas per URL

    Average retrieval cost

    000

    000

    000

    000

    000

    000

    Replicate per Website

    Replicate per URL

    Number of replicas/URL

    Total latency of clients (Ksec)

    Replicate per Web site

    Replicate with access frequency clustering

    Average number of replicas per URL

    Average retrieval cost

    8/38/38/3

    8/48/48/4

    8/58/58/5

    8/108/108/10

    8/118/118/11

    9/279/279/27

    9/289/289/28

    9/299/299/29

    9/309/309/30

    10/110/110/1

    Old clusters, old replica locations

    Old clusters, new replica locations

    New clusters, new replica locations

    Date

    Total latency of clients (Ksec)

    Stability analysis of MSNBC traces

    8/38/38/38/38/3

    8/48/48/48/48/4

    8/58/58/58/58/5

    8/108/108/108/108/10

    8/118/118/118/118/11

    9/279/279/279/279/27

    9/289/289/289/289/28

    9/299/299/299/299/29

    9/309/309/309/309/30

    10/110/110/110/110/1

    Static clustering, old replication

    Static clustering, re-replication

    Reclustering, re-replication (optimal)

    Offline incremental clustering, step 1 only

    Offline incremental clustering, replication

    New date

    Average retrieval cost

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Static clustering, old replication

    Static clustering, re-replication

    Reclustering, re-replication (optimal)

    Offline incremental clustering, replication

    New date

    Average retrieval cost

  • Where R: # of replicas per URL M: # of URLsTo compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!

    Overhead Comparison

    Replication SchemeState to Maintain Computation CostPer WebsiteO(R)O(R)

    Per URLO(R M)O(R M)

  • Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)

    Overhead Comparison

    Replication SchemeStates to Maintain Computation CostPer WebsiteO(R)O(R) Per ClusterO(R K + M)O(R K)Per URLO(R M)O(R M)

  • Clustering Web ContentGeneral clustering frameworkDefine the correlation distance between URLsCluster diameter: the max distance between any two membersWorst correlation in a clusterGeneric clustering: minimize the max diameter of all clusters Correlation distance definition based onSpatial localityTemporal localityPopularity

  • Spatial ClusteringCorrelation distance between two URLs defined asEuclidean distanceVector similarity

  • Clustering Web Content (contd)Popularity-based clustering

    OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

    Temporal clustering Divide traces into multiple individuals access sessions [ABQ01] In each session,

    Average over multiple sessions in one day

  • Performance of Cluster-based ReplicationSpatial clustering with Euclidean distance and popularity-based clustering perform the bestSmall # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overheadMSNBC, 8/2/1999, 5 replicas/URL

    Chart1

    98.6898.6898.6898.68

    59.05397.38797.83363.58

    51.60595.93192.81257.47

    47.30894.33587.9451.13

    46.4489.22276.02347.954359

    42.81771.10360.23845

    42.37242.37242.37242.372

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Popularity-based clustering

    Number of clusters

    Average retrieval cost

    msnbc-19990802-5reps-clustering

    198.6898.6898.6898.6898.680.05

    1059.05397.38797.83363.5863.5786910.5

    2051.60595.93192.81257.4757.4673971

    5047.30894.33587.9451.1351.132982.55

    10046.4489.22276.02347.9543595.1

    30042.81771.10360.2384515.3

    100042.37242.37242.37242.37251

    MSNBC: above, NASA: below

    1141.03234141.03234141.03234141.03234

    6100.60196140.49318124.26101.07638

    1596.64702140.2433120.6897.01858

    3095.56142139.7855102.3296.38984

    9094.26264118.2724697.1695.4

    30093.0477493.0477493.0477493.04774

    4572595626.2

    msnbc-19990802-5reps-clustering

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Complete re-clustering & re-replication (oracle)

    Average retrieval cost

    Number of clusters

    Computational cost (hours)

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Popularity-based clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Average retrieval cost

    Chart16

    0.05

    0.5

    1

    2.55

    5.1

    15.3

    51

    Number of clusters

    Computational cost (hours)

    msnbc-19990802-5reps-clustering

    198.6898.6898.6898.6898.680.05

    1059.05397.38797.83363.5863.5786910.5

    2051.60595.93192.81257.4757.4673971

    5047.30894.33587.9451.1351.132982.55

    10046.4489.22276.02347.9543595.1

    30042.81771.10360.2384515.3

    100042.37242.37242.37242.37251

    MSNBC: above, NASA: below

    1141.03234141.03234141.03234141.03234

    6100.60196140.49318124.26101.07638

    1596.64702140.2433120.6897.01858

    3095.56142139.7855102.3296.38984

    9094.26264118.2724697.1695.4

    30093.0477493.0477493.0477493.04774

    4572595626.2

    msnbc-19990802-5reps-clustering

    0000

    0000

    0000

    0000

    0000

    0000

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    0

    0

    0

    0

    Average retrieval cost

    0

    0

    0

    0

    0

    0

    0

    Number of clusters

    Computational cost (hours)

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

  • OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research

  • Static clustering and replicationTwo daily traces: training trace and new trace

    Static clustering performs poorly beyond a week

    Chart2

    38.2237.25428.576

    39.13738.22326.9

    43.06542.05725.503

    39.0838.12724.95

    44.76443.86825.39

    48.2446.78624.944

    33.68732.54917.245

    45.21943.75622.566

    38.08836.88218.427

    23.90422.98511.762

    Static clustering 1

    Static clustering 2

    Reclustering, re-replication (optimal)

    New traces

    Average retrieval cost

    www02

    MSNBC 0801, 1000 objs, 1000 subnets, per website vs. per url, on the right side, NASA 0701case

    1429.712668429.71266847.75298747.752987

    2326.806921155.4850336.06635527.319388

    3291.103407120.78384432.22923423.087655

    4262.058643101.43584429.45913520.487025

    5242.95174388.19952427.0309718.625499

    6225.68053878.24808825.29342517.199268

    7213.66075170.33510423.6918416.035896

    8203.47370563.83776422.26131815.05205

    9194.5215158.36462321.3425114.199426

    10185.95830553.68353820.63394313.448004

    MSNBC, 0801, 3 replicas, Using access vector, coefficients of access vector, session based clustering

    120501003001000

    291.103407142.290429139.696062129.885976121.021061120.783844

    291.103407266.290804261.416096241.544629170.940168120.783844

    291.103407170.413829159.311739149.119835137.158313120.783844

    MSNBC, 0801, multiple reps, using access vector, session based clustering

    # replicas12345678910111213141516171819202123242526272829303132333435363738394041424344454647484950

    Replicate per Web site429.712668326.806921291.103407262.058643242.951743225.680538213.660751203.473705194.52151185.958305179.57821173.902084168.580902163.466699158.685096153.937913149.683283145.453766141.583481138.191325134.937477131.847434129.034592123.853151121.369711118.971794116.6723114.490966112.57276110.702858108.882123107.178673105.548969103.946558102.436065100.98686499.6361298.32623397.1063395.94401594.78745893.68489592.59370791.53921690.58690589.6776188.78807387.93216687.0857

    spatial clustering429.712668175.160743142.290429121.854833114.895495104.52119995.24120391.444883.66593683.19534175.11010674.73348770.41610366.96563863.53753762.27375959.27095656.03238854.34557753.81285851.75335850.25345949.02200648.80636547.24332345.96648444.50371544.43399642.5796641.72261441.47807740.59677339.82875138.58454438.38735937.73849336.58925736.54179235.92552834.82994634.29379634.25433233.73850933.19427232.29343732.13940331.55468131.1389330.735593

    temporal clustering429.712668207.139683170.804766148.121638135.389962125.465299123.475392108.812736105.56743198.1521192.35284887.90901884.04610780.3912277.18804376.2576871.36276868.70349666.38391465.85126863.29412760.34647358.57338456.9149455.39477154.00965152.5835952.23664650.02330948.7584747.58470947.27649845.3931444.4125844.17490943.22461242.23262240.8893640.00742439.20455938.44122438.16579237.00649936.43315735.85135535.08205134.67643433.91737733.340589

    5 reps on ts topology3-Aug4-Aug5-Aug10-Aug11-Aug27-Sep28-Sep29-Sep30-Sep1-Oct3-Aug4-Aug5-Aug10-Aug11-Aug

    total new URLs13151389143114891483153915381530152615231483

    metric138.2239.13743.06539.0844.76448.2433.68745.21938.08823.904207.543205.264218.8195.264215.044

    metric237.25438.22342.05738.12743.86846.78632.54943.75636.88222.985215.649211.110049223.333075??

    best perf with 5000 reps28.57626.925.50324.9525.3924.94417.24522.56618.42711.762184.832175.94181.06162.455166.528

    metric3 (put in new cluster)29.36629.2631.827.334.13737.9426.49734.37329.36918.381192.03187.82198.5635173.83195.063

    replica # in metric359836050652165036618662166106592655136503722369240063994

    # of URLs not clustered002116468600211

    # of new clusters generated based on limit-radius of 19990802/CRG-200021133333

    # of URLs not clustered with biggest radius0020164675

    metric 4 (create clusters for orphan URLs and replicate)29.36629.2628.226.25227.66827.08318.66423.95820.41613.466

    # of replica reclaimed948120513911606158217721805174817611739

    # of clusters whose URLs are all cold and reclaimed

    best perf. (with added replica#)25.17623.67322.82320.90921.21320.677

    Access freq clusteirng

    metric149.84449.76254.07548.59952.89155.93639.52252.42945.07828.511

    metric241.9442.36146.55442.04746.66849.41334.6946.21339.61424.551

    metric3 (put in new cluster)40.09838.48439.52234.8935.91934.54524.89831.22527.25817.721

    replica # in metric35381528751085136499253154969522551755394

    # of new URL replicas (non-orphan URLs)983109110501521150316181621161015921551

    # of new URL replicas (orphan URLs)003418579154184138169188

    # of URL replicas (optimal)4000400040004000400040004000400040004000

    after normalization by 4024.57527.27526.2538.02537.57540.4540.52540.2539.838.775

    008.5252.1251.9753.854.63.454.2254.7

    www02

    Replicate per Website

    Replicate per URL

    Number of replicas/URL

    Total latency of clients (Ksec)

    000

    000

    000

    000

    000

    000

    Replicate per Website

    Replicate per URL

    Number of replicas/URL

    Total latency of clients (Ksec)

    Replicate per Web site

    Replicated with spatial clustering

    Number of replicas/URL

    Total lantency of clients (Ksec)

    8/38/38/38/38/3

    8/48/48/48/48/4

    8/58/58/58/58/5

    8/108/108/108/108/10

    8/118/118/118/118/11

    9/279/279/279/279/27

    9/289/289/289/289/28

    9/299/299/299/299/29

    9/309/309/309/309/30

    10/110/110/110/110/1

    Static clustering 1

    Static clustering 2

    Reclustering, re-replication (optimal)

    Offline incremental clustering, step 1

    Offline incremental clustering, complete

    New traces

    Average retrieval cost

    8/38/38/38/3

    8/48/48/48/4

    8/58/58/58/5

    8/108/108/108/10

    8/118/118/118/11

    9/279/279/279/27

    9/289/289/289/28

    9/299/299/299/29

    9/309/309/309/30

    10/110/110/110/1

    Static clustering 1

    Static clustering 2

    Reclustering, re-replication (optimal)

    Offline incremental clustering

    New traces

    Average retrieval cost

    8/38/38/3

    8/48/48/4

    8/58/58/5

    8/108/108/10

    8/118/118/11

    9/279/279/27

    9/289/289/28

    9/299/299/29

    9/309/309/30

    10/110/110/1

    Static clustering 1

    Static clustering 2

    Reclustering, re-replication (optimal)

    New traces

    Average retrieval cost

    8/38/38/3

    8/48/48/4

    8/58/58/5

    8/108/108/10

    8/118/118/11

    9/279/279/27

    9/289/289/28

    9/299/299/29

    9/309/309/30

    10/110/110/1

    Offline incremental clustering, step 1

    Offline incremental clustering, step 2

    Re-clustering, re-replication (optimal)

    New traces

    Average retrieval cost

    8/38/38/3

    8/48/48/4

    8/58/58/5

    8/108/108/10

    8/118/118/11

    9/279/279/27

    9/289/289/28

    9/299/299/29

    9/309/309/30

    10/110/110/1

    Offline incremental clustering, step 1

    Offline incremental clustering, step 2

    Re-clustering, re-replication (optimal)

    New traces

    Average retrieval cost

    8/38/3

    8/48/4

    8/58/5

    8/108/10

    8/118/11

    9/279/27

    9/289/28

    9/299/29

    9/309/30

    10/110/1

    Offline incremental clustering, step 1

    Offline incremental clustering, step 2

    New date

    Replication cost, normalized by that of the optimal case

  • Incremental ClusteringGeneric frameworkIf new URL u match with existing clusters c, add u to c and replicate u to existing replicas of cElse create new clusters and replicate themTwo types of incremental clusteringOnline: without any access logsHigh availabilityOffline: with access logsClose-to-optimal performance

  • Online Incremental ClusteringGroups of siblings

    Groups of the same hyperlink depth(smallest # of links from root)Predict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!

  • Online Popularity PredictionExperimentsCrawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLsUse corresponding access logs to analyze the correlationGroups of siblings have the best correlationMeasure the divergence of URL popularity within a group:

  • Semantics-based Incremental ClusteringPut new URL into existing cluster with largest # of siblingsIn case of a tie, choose the cluster w/ more replicasSimulation on 5/3/2002 MSNBC8-10am trace: static popularity clustering + replicationAt 10am: 16 new URLs - online inc. clustering + replication Evaluation with 10-12am trace: 16 URLs has 33K requests

  • Online Incremental Clustering and Replication Results1/8 compared w/ no replication, and 1/5 for random replication

    Chart2

    457

    259

    56

    Average retrieval cost

    msnbc-19990802-5reps-clustering

    198.6898.6898.6898.6898.680.05

    1059.05397.38797.83363.5863.5786910.5

    2051.60595.93192.81257.4757.4673971

    5047.30894.33587.9451.1351.132982.55

    10046.4489.22276.02347.9543595.1

    30042.81771.10360.2384515.3

    100042.37242.37242.37242.37251

    MSNBC: above, NASA: below

    1141.03234141.03234141.03234141.03234

    6100.60196140.49318124.26101.07638

    1596.64702140.2433120.6897.01858

    3095.56142139.7855102.3296.38984

    9094.26264118.2724697.1695.4

    30093.0477493.0477493.0477493.04774

    4572595626.2

    msnbc-19990802-5reps-clustering

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Complete re-clustering & re-replication (oracle)

    Average retrieval cost

    Number of clusters

    Computational cost (hours)

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Average retrieval cost

  • Online Incremental Clustering and Replication ResultsDouble the optimal retrieval cost, but only 4% of its replication cost

    Chart1

    457

    259

    56

    26.2

    Average retrieval cost

    msnbc-19990802-5reps-clustering

    198.6898.6898.6898.6898.680.05

    1059.05397.38797.83363.5863.5786910.5

    2051.60595.93192.81257.4757.4673971

    5047.30894.33587.9451.1351.132982.55

    10046.4489.22276.02347.9543595.1

    30042.81771.10360.2384515.3

    100042.37242.37242.37242.37251

    MSNBC: above, NASA: below

    1141.03234141.03234141.03234141.03234

    6100.60196140.49318124.26101.07638

    1596.64702140.2433120.6897.01858

    3095.56142139.7855102.3296.38984

    9094.26264118.2724697.1695.4

    30093.0477493.0477493.0477493.04774

    4572595626.2

    msnbc-19990802-5reps-clustering

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Complete re-clustering & re-replication (oracle)

    Average retrieval cost

    Number of clusters

    Computational cost (hours)

    Spatial clustering: Euclidean distance

    Spatial clustering: cosine similarity

    Temporal clustering

    Access frequency clustering

    Number of clusters

    Average retrieval cost

    No replication of new URLs

    Random replication of new URLs

    Online incremental clustering & replication

    Average retrieval cost

  • ConclusionsCooperative, clustering-based replicationCooperative push: only 4 - 5% replication/update cost compared with existing CDNsURL Clustering reduce the management/computational overhead by two orders of magnitudeSpatial clustering and popularity-based clustering recommendedIncremental clustering to adapt to emerging URLsHyperlink-based online incremental clustering for high availability and performance improvementSelf-organize replicas into app-level multicast tree for update disseminationScalable overlay network monitoringO(M+N) instead of O(M*N), given M client groups and N servers

  • OutlineArchitectureProblem formulationGranularity of replicationIncremental clustering and replicationConclusions Future Research

  • Future Research (I)Measurement-based Internet study and protocol/architecture designUse inference techniques to develop Internet behavior modelsNetwork operators reluctant to reveal internal network configurationsRoot cause analysis: large, heterogeneous data miningLeverage graphics/visualization for interactive miningApply deeper understanding of Internet behaviors for reassessment/design of protocol/architectureE.g., Internet bottleneck peering links? How and Why? Implications?

  • Future Research (II)Network traffic anomaly characterization, identification and detectionMany unknown flow-level anomalies revealed from real router traffic analysis (AT&T)Profile traffic patterns of new applications (e.g. P2P) > benign anomaliesUnderstand the cause, pattern and prevalence of other unknown anomaliesIdentify malicious patterns for intrusion detectionE.g., fight against Sapphire/Slammer Worm

  • Backup Materials

  • SCANCoherence for dynamic contentCooperative clustering-based replicationXScalable network monitoring O(M+N)s1, s4, s5

  • Problem FormulationSubject to certain total replication cost (e.g., # of URL replicas)Find a scalable, adaptive replication strategy to reduce avg access cost

  • Simulation MethodologyNetwork TopologyPure-random, Waxman & transit-stub synthetic topologyAn AS-level topology from 7 widely-dispersed BGP peersWeb WorkloadAggregate MSNBC Web clients with BGP prefixBGP tables from a BBNPlanet routerAggregate NASA Web clients with domain namesMap the client groups onto the topology

  • Online Incremental ClusteringPredict access patterns based on semanticsSimplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!Groups of siblingsGroups of the same hyperlink depth: smallest # of links from root

  • Challenges for CDNOver-provisioning for replicationProvide good QoS to clients (e.g., latency bound, coherence)Small # of replicas with small delay and bandwidth consumption for updateReplica ManagementScalability: billions of replicas if replicating in URLO(104) URLs/server, O(105) CDN edge servers in O(103) networksAdaptation to dynamics of content providers and customersMonitoringUser workload monitoring End-to-end network distance/congestion/failures monitoringMeasurement scalabilityInference accuracy and stability

  • SCAN ArchitectureLeverage Decentralized Object Location and Routing (DOLR) - Tapestry forDistributed, scalable location with guaranteed successSearch with localitySoft state maintenance of dissemination tree (for each object)

    data planenetwork planedatasourceWeb serverSCAN serverRequest LocationDynamic Replication/Update and Content Management

  • Wide-area Network Measurement and Monitoring System (WNMMS)Cluster AClientsCluster BMonitorsCluster CSCAN edge serversSelect a subset of SCAN servers to be monitorsE2E estimation forDistanceCongestionFailuresnetwork plane

  • Dynamic ProvisioningDynamic replica placementMeeting clients latency and servers capacity constraintsClose-to-minimal # of replicasSelf-organized replicas into app-level multicast treeSmall delay and bandwidth consumption for update multicastEach node only maintains states for its parent & direct childrenEvaluated based on simulation ofSynthetic traces with various sensitivity analysisReal traces from NASA and MSNBCPublicationIPTPS 2002Pervasive Computing 2002

  • Effects of the Non-Uniform Size of URLsReplication cost constraint : bytesSimilar trends existPer URL replication outperforms per Website dramatically Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective1234

  • End HostLandmarkDiagram of Internet Iso-bar

  • Cluster AEnd HostCluster BMonitorCluster CLandmarkDiagram of Internet Iso-bar

  • Real Internet Measurement DataNLANR Active Measurement Project data set119 sites on US (106 after filtering out most offline sites)Round-trip time (RTT) between every pair of hosts every minuteRaw data: 6/24/00 12/3/01Keynote measurement dataMeasure TCP performance from about 100 agentsHeterogeneous core network: various ISPsHeterogeneous access network: Dial up 56K, DSL and high-bandwidth business connectionsTargetsWeb site perspective: 40 most popular Web servers27 Internet Data Centers (IDCs)

  • Related WorkInternet content delivery systemsWeb cachingClient-initiated Server-initiatedPull-based Content Delivery Networks (CDNs)Push-based CDNsUpdate disseminationIP multicast Application-level multicastNetwork E2E Distance Monitoring Systems

  • ClientISP 2ISP 1Web Proxy Caching

  • ClientISP 2Pull-based CDNISP 1

  • ClientISP 2Push-based CDNISP 1

  • Internet Content Delivery Systems

    Scalability for request redirectionPre-configured in browserUse Bloom filter to exchange replica locationsCentralized CDN name serverCentralized CDN name serverDecentra-lized P2P location

    PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCAN

    Efficiency (# of caches or replicas)No cache sharing among proxiesCache sharingNo replica sharing among edge serversReplica sharingReplica sharing

    Network- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring system

    Coherence supportNoNoYes NoYes

  • Previous Work: Update DisseminationNo inter-domain IP multicastApplication-level multicast (ALM) unscalableRoot maintains states for all children (Narada, Overcast, ALMI, RMX)Root handles all join requests (Bayeux)Root split is common solution, but suffers consistency overhead

  • Design PrinciplesScalabilityNo centralized point of control: P2P location services, TapestryReduce management states: minimize # of replicas, object clusteringDistributed load balancing: capacity constraintsAdaptation to clients dynamicsDynamic distribution/deletion of replicas with regarding to clients QoS constraintsIncremental clusteringNetwork-awareness and fault-tolerance (WNMMS)Distance estimation: Internet Iso-barAnomaly detection and diagnostics

  • Comparison of Content Delivery Systems (contd)

    PropertiesWeb caching (client initiated)Web caching (server initiated)Pull-based CDNs (Akamai)Push-based CDNsSCANDistributed load balancingNoYesYesNoYesDynamic replica placementYesYesYesNoYesNetwork- awarenessNoNoYes, unscalable monitoring systemNoYes, scalable monitoring systemNo global network topology assumptionYesYesYesNoYes

  • Network-awareness (contd)Loss/congestion predictionMaximize the true positive and minimize the false positiveOrthogonal loss/congestion paths discoveryWithout underlying topology

    How stable is such orthogonality?Degradation of orthogonality over timeReactive and proactive adaptation for SCAN

    # of Internet users: http://www.usabilitynews.com/news/article637.asp, will reach one billion by 2005. # of total Internet traffic: http://www.cs.columbia.edu/~hgs/internet/traffic.html

    Efficiency: design systems with growth potential

    Amazing growth in WWW trafficDaily growth of roughly 7M Web pagesAnnual growth of 200% predicted for next 4 years

    1M page growth Scientific American June 1999 issue.

    7M page growth rate: http://cyberatlas.internet.com/big_picture/traffic_patterns/article/0,,5931_413691,00.htmlTotally 4 billion pages

    The convergence in the digital world of voice, data and video is expected to lead to a compound annual growth rate of 200% of Web traffic over the next four years -- http://www.skybridgesatellite.com/l21_mark/cont_22.htm

    Define the term for replicaThere are large and popular Web servers, such as CNN.com and msnbc.com, which need to improve performance and scalability. Lets imagine that there is a Web server in Cambridge, London. And one client in Northwestern try to access some content.Later on, a nearby client from U of Chicago try to access similar content, then directly served from the CDN server.

    So, the CDN reduces the latency for the client, reduces the b/w for Web servers, and improve the scalability and availability for the Web content server. And it helps the Internet as a whole by reducing the long-haul traffic.In-efficient replication will have two effects: 1. wastes a lot of replication bandwidth, and consequently, update bandwidth;2. Working set includes all the Web objects. They are cached in, but constantly replaced before serving more clients.

    Questions on consistent caching: it is another type of hash table for directory scheme with high probability on query success. It mainly supports fixed # of replicas for each URL, and is not very flexible for hot URLs, have to continuously change the hash functions.Secondly, it doesnt really record the location of replicas. So cant update them when change occurs.CDN applications, not addressed in the thesisSCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.

    Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.It has full control of the replication cost and can also improve the caching effect.

    Define the term for replicaHow to evaluate the architecture/algorithms designed for Internet-scale services? We cant deploy it on thousands of nodes to test out the scalability or performance

    Analytical way, back-of-envelope calculation.Realistic simulation, real-world traces and measurement to understand the Internet behavior, and apply them for evaluation of algorithms/arch.

    In my thesis, I developed wide collaboration w/ industry and research labs to obtain these real-world traces and measurement.It include the network topo, web workload and e2e network dist measurement.

    Throughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.

    For MSNBC, 10K groups left, chooses top 10% covering >70% of requests

    Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Collect several months data, to study not only the performance, but also the stability of performance.

    First, given certain replication bandwidth cost constraint, to choose the optimal replica locations is an NP-complete problem. Greedy algorithm is proved to be efficient, and can achieve close-to-optimal performance.

    Previous work use per-Website based replication. By looking at finer granularity, we found that replication at per URL basis can significantly reduce the clients average latency. However, it suffers big management overhead.As a solution, we propose cluster the URLs, and replication in the unit of clusters.

    At what granularity? Per website or per URL? Here URL refers to the Web object that the URL address points to. The tradeoff is replication performance in term of users access latency vs. the management overhead.

    Of course, we dont have to replicate all Web contents, especially for those particular dynamic ones. According to Anja Feldman Infocomm 99 paper, about 40% of requests are dynamically generated. But we can still push the rest 60%.

    Give some numbers for scalability problemPush replicas with greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01]But expensive for replica location managementRemove the citations for greedy algorithms,

    Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThere are many ISPs, each . There are also CDN name servers and Web content server, which hosts the Web contents denoted as the green box. When client 1 has request for the green URL, the hostname resolution has to go through the CDN name server.CDN name server will return the IP address of the local CDN server of ISP 1.Then client 1 has to send request to its local CDN server, which essentially performs as a cache. The problem is that the CDN name servers dont track where the content has been replicated. So when client 2 request for green URL, the CDN name server still reply the IP address of CDN server of ISP 2.although the content has been replicated in CDN server 1, which could be quite close to client 2. 4%: explains that comparison is under similar average latency for two schemes.Then, the problem is how to efficiently do cooperative push?Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsArchitecture will compare uncooperative pull-based vs. cooperative push-based replicationTalk about using greedy algorithms for pre web site and per URL schemes.For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

    We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringFor per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

    We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clustering

    Although greedy algorithm has been proved to be the most cost-effective one for solving the replica location NP-complete problem, For per-URL scheme, it takes 102 hours to come up w/ 10 replicas/URL for MSNBC traces on a PII-400 machine.

    We show with K as only 1-2% of M, we can achieve performance close to MClustering costs will be addressed separately, since it varies for static clustering and incremental clusteringSemantics: contents in the same directory not necessarily have good correlation in terms of user access.

    We use K-split algorithm by Gonzalez. Cost O(NK)Intuition: each vector unique represents a URL in the high-dim space. Vector similarity represents the consin of the angle between two vectors, regardless of the length. In other words, it only characterize the relative ratio of the # of accesses from each client cluster, rather than the absolute value

    Mention semantics-based clustering.Define the term for replica

    Here spatial access vector essentially captures the aggregated access pattern to that URL.Temporal clustering tries to characterize how frequent two URLs are accessed together in the same session, normalized by the total # of access.

    Popularity-based clustering use pure access # (no matter where they are from) for clustering. It performs very well, it may imply that the popular URLs are globally popular.Tested over various topologies and traces

    Popularity-based clustering also performs very well, -> implies that popular URLs are globally popular, and URLs w/ similar popularity may have similar aggregated access patterns. Well use this conjecture in our incremental clustering later.

    1-2% of URLs, say 10 clusters

    Architecture will compare uncooperative pull-based vs. cooperative push-based replicationThe major reason for dynamic clustering is that b/c of the emergence of new contents, which need to be appropriately clustered and replicated, especially when there is no access history. The goal is two fold:1. With minimum protuberance to the existing clusters;2. Cluster and replicate to reduce retrieval cost.After framework, 2 options and different implementationsReplicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Results is groups of siblings has the best correlation.Then talk about the graph!We carried out the following experiments:First, we use popularity-based clustering and replication to distribute the URLs in 8-10am 2-hour traces.Then at the moment of 10am, 16 new URLs were created, and need to be replicated. We use the online incremental clustering algorithms above to cluster and replicate them. Then we observe their performance in the next 2 hours, and compare with some other schemes.First, compared with we use static clustering and no replication for the 16 URLs, our scheme reduce the average retrieval cost to be only 1/8Second, compared with random replication of the 16 URLs, with the same # of replicas, our scheme reduce the average retrieval cost to be about 20%.Finally, we compared with the oracle case, which assume the knowledge of next 2 hour accesses in advance, and use expensive static clustering and replication, our scheme doubles the cost.But that is the oracle case, and can never be achieved.

    1/8 1/5 compared w/ no replication, and random replicationDouble the optimalChange the term for static!!! Change it to two graphs1/8 1/5 compared w/ no replication, and random replicationDouble the optimalCompared w/ per-URL based replication, Architecture will compare uncooperative pull-based vs. cooperative push-based replicationMention wide collaboration in the beginning.SCAN puts objects into clusters. In each cluster, the objects are likely to be accessed by clients that r topologically close. For example, the red and yellow URLs Given millions of URLs, clustering-based replication can dramatically reduce the amount of replica location states, thus afford to build replica directories.Further, SCAN pushes the cluster replicas to certain strategic locations, then use replica directories to forward clients requests to these replicas. We call it cooperative push, which can significantly reduces the # of replicas, and consequently, the update cost.

    Define the term for replicaM client groups and N servers O(M+N) instead of O(M*N)Define the term for replicaIDC charge CDN by the amount of bandwidth used, so translated to the bytes replicated. Here we simplified it as the # of URL replicas as we assume the URLs are of the same size.As we will show later, the nature of non-uniform size doesnt really affect the resultsThroughout our study, we use measurement-based simulation and analysis. The method has two parts: topology and workload. MSNBC is consistently ranked among the top news sites, thus highly popular and dynamic contents.In comparison, the NASA 95 traces is more static and much less accessed. We use various n/w topology and web workload for simulation.All the results we got are based on this methodologyThe traces are applied throughout our experiments, except for the online incremental clustering, which requires the full Web contents.

    For MSNBC, 10K groups left, chooses top 10% covering >70% of requests

    Emphasize here it is the abstract cost, it could be latency, or # of hops, depending on simulations. Replicate the URLs before accessedThe challenge is how to predict the access patterns from semantics info only?From previous study, we know that popularity-based clustering has very good performance, i.e., URLs w/ similar popularity has similar aggregated access patterns.How to find groups of URLs with similar popularity? So that we can infer the new URL popularity by the popularity of old URLs in the same group.We explored two simple options, with hyperlink structures.One is groups of siblings, the other is group of same hyperlink depth.To compare the two methods, we measure the divergence of Every line start with How toWorld Cup log peak time: 209K req/min, 580MB transfer per minuteMSNBC SIGCOMM 2000 paper, number for order of URLs and clients.

    The CDN operators set up the latency constraints, based onDifferent classes of clientsHuman perception of latency

    What if no Tapestry? DHT, from operator point of view, divide the problems. The other problem is solvedWhy choose Tapestry?Introduce replicas, caches, etc.UpdatesTapestry, who r involved and functionalitiesIntroduce the dissemination tree, soft state maintentanceOngoing: better clustering methods, dynamic service modelOne reason is that the size doesnt differ much for top 1000 URLs (from several hundred bytes to tens of thousands bytes).Another interationScalability for summary cache is that they target O(100) proxies and O(1M) pages, which is not as good as we need.The key problem is proxy server is usually installed by ISPs, which may not accept the cooperative model.Distributed load balancing is implemented through request redirection mechanisms, such as server initiated Web caching, pull-based CDNs and SCAN.Orthogonality (AC, BC) = 1 co-occurrences of loss (AC, BC) -------------------------------------------------------loss occurrences (AC) + loss occurrences (BC)