high performance p2p web caching
Post on 14-Oct-2014
1.243 Views
Preview:
DESCRIPTION
TRANSCRIPT
High Performance P2P Web Caching
Erik GarrisonJared Friedman
CS264 PresentationMay 2, 2006
SETI@Home
● Basic Idea: people donate computer time to look for aliens
● Delivered more than 9 million CPU-years● Guinness BWR – largest computation ever● Many other successful projects (BOINC, Google
Compute)● The point: many people are willing to donate
computer resources for a good cause
Wikipedia
● About 200 servers required to keep the site live
● Hosting & Hardware costs over 1$M per year● All revenue from donations● Hard to make ends meet● Other not-for-profit websites in similar
situation
HelpWikipedia@Home
● What if people could donate idle computer resources to help host not-for-profit websites?
● They probably would!● This is the goal of our project
Prior Work
● This doesn't exist● But some things are similar
Content Distribution Networks (Akamai)● Distributed web hosting for big companies
CoralCDN/CoDeeN● P2P web caching, like our idea,● But a very different design● Both have some problems
Akamai, the opportunity
● Internet traffic is 'bursty'● Expensive to build infrastructure to handle
flash crowds● International audience, local servers
Sites run slowly in other countries
Akamai, how it works
● Akamai put >10,000 servers around the globe
● Companies subscribe as Akamai clients● Client content (mostly images, other media)
is cached on Akamai's servers● Tricks with DNS make viewers download
content from nearby Akamai servers● Result: Website runs fast everywhere, no
worries about flash crowds● But VERY expensive!
CoralCDN
● P2P web caching ● Probably the closest system to our goal● Currently in late-stage testing on PlanetLab● Uses an overlay and a 'distributed sloppy
hash table'● Very easy to use – just append '.nyud.net' to
a URL and Coral handles it● Unfortunately ...
Coral: Problems
● Currently very slow This might improve in later versions Or it might be due to the overlay structure
● Security: volunteer nodes can respond with fake data
● Any site can use Coral to help reduce load Just append .nyud.net to their internal links
● Decentralization makes optimization hard more on this later
Our Design Goals
● Fast: Akamai level performance● Secure: Pages served are always genuine● Fast updates possible● Must greatly reduce demands on main site
But this cannot compromise first 3
Our Design
● Node/Supernode structure Take advantage of extremely heterogeneous
performance characteristics● Custom DNS server redirects incoming
requests to nearby super node● Super node forwards request to nearby
ordinary node● Node replies to user
Our DesignUser goes to wikipedia.org
DNS server resolves wikipedia.org to a super node
Super node forwards request toordinary node that has the requested document
Node retrieves document and sends to user
Performance
● Requests are answered in only 2 hops● DNS server resolves to a geographically
close supernode● Supernode avoids sending requests to slow
or overloaded nodes● All parts of a page (e.g., html and images)
should be served by a single node
Security
● Have to check nodes' accuracy● First line of defense: encrypt local content● May delay attacks, but won't stop them
Security
● More serious defense: let users check the volunteer nodes!
● Add a javascript wrapper to the website that requests the pages using AJAX
● With some probability, the AJAX script will compute the MD5 of the page it got and send it to a trusted central node
● Central node kicks out nodes that frequently get invalid MD5sum's
● Offload processing not just to nodes, but to users, with zero-install
A Tricky Part
● Supernodes get requests, have to decide what node should answer what requests
● Have to load-balance nodes – no overloading● Popular documents should be replicated
across many nodes● But don't want to replicate unpopular
documents much – conserve storage space● Lots of conflicting goals!
On the plus side...
● Unlike Coral & CoDeeN, supernodes know a lot of nodes (maybe 100-1000?)
● They can track performance characteristics of each node
● Make object placement decisions from a central point
● Lots of opportunity to make really intelligent decisions Better use of resources Higher total system capacity Faster response times
Object Placement Problem
● This kind of problem is known as an object placement problem “What nodes do we put what files on?”
● Also related to the request routing problem “Given the files currently on the nodes, what
node do we send this particular request to?”● These problems are basically unsolved for
our scenario● Analytical solutions have been done for very
simplified, somewhat different cases● We suspect a useful analytic solution is
impossible here
Simulation
● Too hard to solve analytically, so do a simulation
● Goal is to explore different object placement algorithms under realistic scenarios
● Also want to model the performance of the whole system What cache hit ratios can we get? How does number/quality of peers affect cache
hit ratios? How is user latency affected?
● Built a pretty involved simulation in Erlang
Simulation Results
● So far, encouraging!● Main results using a heuristic object
placement algorithm● Can load-balance without creating hotspots
up to about 90% of theoretical capacity● Documents rarely requested more than once
from central server● Close to theoretical optimum
Next Steps
● Add more detail to simulation Node churn Better internet topology
● Explore update strategies● Obviously, an actual implementation would
be nice, but not likely to happen this week● What do you think?
top related