network computing laboratory feedex: collaborative exchange of news feeds seung jun, mustaque ahamad...
TRANSCRIPT
Network Computing Laboratory
FeedEx: Collaborative Exchange FeedEx: Collaborative Exchange of News Feedsof News Feeds
Seung Jun, Mustaque AhamadGeorgia Institute of Technology
WWW 2006
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 2
OutlineOutline
One line comment
Motivation/Problem
Approach
Analysis of feed publishing
Challenges
Experiments
Critique
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 3
One line commentOne line comment
Disseminate web feeds in a distributed (P2P) manner to increase scalability of web servers
RSS reveals visitors to content providers
RSS decoupled fetch operation from read
RSS A B
Traditional method
P2P method
A B
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 4
Motivation & ProblemMotivation & ProblemRSS/Atom feeds have become increasingly popular
Published by most traditional media and blogs
Feeding mechanismhttp://nyt.com/../feed.xml
Update page as contents are added
HTTP request
HTTP response
nyt.com
RSS reader:
Poll server to check updates
……
Scalability
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 5
ApproachApproachThe Approach
P2P overlay + gossip based protocolP2P: Scalable growth in resources with service demandGossip: Scalable, Robustness (Join & Leave)
Feature of this overlayDon’t have to guarantee delivery or delay
Challenges
Overlay construction
Fetching interval
determination
Data disseminationFree riding
prevention
?content
searching
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 6
Analysis of Feed PublishingAnalysis of Feed Publishing
Methodology245 popular feeds monitored for 10 days
Most popular feeds – information from Gmail’s web clips, Bloglines
Feeds fetched every 2 minutes
Measured..Publishing rate
Entry count in a feed
Entry lifetime
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 7
Publishing Rate by RankPublishing Rate by Rank
Great difference between publishers
Partly zipf distribution●
● ●
● ● ●
● ● ● ●●
●●
● ● ● ●●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●
●
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 8
Entry CountEntry Count
High publish rate, More entry counts? – NO
Lifetime of entries are short Entries can be lost with infrequent requests
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 9
Publishing Rate by TimePublishing Rate by Time
4 types of publishing patterns
010
25 Reuters
05
10
Yahoo(M)
04
8
Motley Fool
04
812 NPR
0 1 2 3 4 5 6 7Sat Sun
Time (day)
Entr
ies
pub
lishe
d per
hou
r
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 10
Challenges Challenges – Overlay Construction – Overlay Construction
(1/2) –(1/2) –
Goal: Minimize network management overhead
Join1. Well known host
OR Contact previous neighbors
2. Share subscription set info3. Update subscription set info to the network
LeaveSoft-state
Update subscription set periodically
Gateway
Neighbor list
Subscription setdest hop
CNN 0dest hop
YAHO 0
HANI 1
dest hop
CNN 1
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 11
Challenges Challenges – Overlay Construction – Overlay Construction (1/2) –(1/2) –
Neighbor selectionMany neighbors may incur overhead
Need to adapt to my resource status select “useful” neighbors to me
Whose subscription set is similar to me
HANI 0
CNN 0
YAHOO 0
DAUM 0
A
BNCLAB 0
CNN 0
HANI 1
DAUM 2
1 direct,
1 one-hop,
1 two-hop
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 12
Challenges Challenges – Fetching interval – Fetching interval determination –determination –
Adaptive FetchingProblem: Little hints about the publishing rate or entry lifetime
Frequent polling: overload servers, consume clients’ net bandwidth
Lazy polling: increase delay or miss entries
Adaptive AlgorithmIntuition: Frequent fetching few new entries
Freshness rate: fraction of new entries in the fetched document
If Freshness rate < target freshness Halve the fetching rate
If Freshness rate > target freshness Double the fetching rate
Fetch
HANI 1. Report 12. Report 23. Report 34. …
Entries in a feed
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 13
Challenges Challenges – Data dissemination–– Data dissemination–
Goal: Minimize bandwidth consumption1. Limit the boundary of delivery
Forward only to matching neighbors (subscription set, hop_count)
reduce forwarding overhead2. Reduce the unit of delivery
Unit of delivery : Entry bundleA set of new entries (Filter out old entries)
Reduce redundant content delivery3. Check before forwarding
Exchange id of an entry bundle (ID: SHA-1 digest of the bundle) If it is an undelivered bundle deliver it
HANI
2
Fetch
HANIHANI 0
HANI
0
HANI
1
Max subset hops = 1
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 14
Challenges Challenges – Free riding prevention–– Free riding prevention–
Nodes may manifest selfish behaviorOnly receive, without forwardingLie subscription set to become a preferred neighbor
Solution: Provide a neighbor evaluation methodContribution metric
Nodes who forwards feeds I subscribe, and my near neighbors subscribeLevel of contribution: direct subscription, 1 hop subscription, 2 hop sub, …
cmi, j += wf −hf
Cut out unhelpful neighbors: I helped, but it doesn’t helped medi,j = cmi,j − cmj,i
Feature Uses local information only
Easy to implement and enforce the mechanism
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 15
Challenges Challenges – Entry searching –– Entry searching –
Overlay as a distributed storageIterative searching
Strong points: Searching latency, query traffic
Recursive searching (flooding)Strong points: low overhead of a requester, caching for popular queries, reflect to neighbor evaluation
?
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 16
Benefits of FeedExBenefits of FeedEx
1. Scalability
2. ArchivabilityStorage of entries
3. ControllabilityCompared to web based readers : e.g. Fetch interval
4. Filtering and recommendationShare opinions on entries (e.g. voting)
Feed recommendation
5. PrivacyUsers can fetch documents for others
anonymize actual users
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 17
Architecture of FeedExArchitecture of FeedEx
To News Feed Servers
To Neighbors
Neighbor
Server
RPC
From Neighbors
To List ServerConnector
Feed Fetch Scheduler
Prototpye: python
Networking: Twisted
Protocol : XML-RPC
Interoperability, fast-prototyping
Entry Storage: SQLite (Lightweight RDB)
RSS parser : feedparser.org
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 18
Experimental SetupExperimental SetupTwo modes
Stand-alone mode SLNFeedEx mode XCH
MetricsTime lagMissing entriesCommunication cost
ExperimentsUse 189 PlanetLab nodesRun 22 hours on a weekdayPrimary factor: 6 fetching intervalsLet each node subscribe 20 out of 70 feeds
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 19
Results: Time LagResults: Time Lag
Average Time LagAverage of node averages
Without applying adaptive fetching algorithm
Despite of fetching interval, contents are delivered soon
Fetching interval (hours)
Tim
e la
g (h
ours
)
0 5 10 15
02
46
8
●
●
●
●
●
●
● ● ● ● ●●
15.8times
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 20
Rate of Missing entries# enrtries in a node / # of entries in a reference node
Low missing rate despite of a problem(DNS error or routing error) in the network Sometimes better than the reference node
Fetching interval (hours)
Mis
sing e
ntr
ies
(%)
.5 1 2 4 8 16
020
4060
8010
0
● ● ●●
●●
● ● ● ● ● ●
●●
●
●
●
●
● ● ● ● ● ●
XCH miss
Results: Missing EntriesResults: Missing Entries
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 21
Two most frequently called precedures: check_did, put_entries
Check_did call: single IP packet
Put_entries: 2 calls / minute deliver 2.67 entries / call
Low communication cost
Results: Communication CostResults: Communication Cost
Fetching interval (hours)
Rece
ived
cal
ls p
er m
iniu
te
.5 1 2 4 8 16
04
812
16
●
●
●
●
●
●
●●
●●
●●
check_did
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 22
CritiqueCritique
Strong pointsMade an new problem from an old domain “web caching”
Free from delay / failure of nodes
Draw out possible benefits/extensions
simple!Practically deployable
Tried to find a mechanism both good for servers and clients
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 23
CritiqueCritiqueWeak points
Overload due to RSS feed delivery?Only a small text file delivery
Should have considered podcasting(Multimedia RSS)
Will the clients donate their resource? Is “short delay” a strong incentive?
Is “low bandwidth consumption” a strong incentive?
Will the subscription sets of people really overlap a lot?Net effective to SPs providing diverse RSS feeds
e.g. Naver blog, egloos..
Is it really robust to frequent leave and join?
Lack of server side evaluationServer load & network resource
Delivering critical data (e.g. timely news) using RSS?
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 24
Supplementary slidesSupplementary slides
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 25
Entry LifetimeEntry Lifetime
Generally CNN,
Publishers have policies (probably)
Lifetime (hours)
Cum
ulat
ive
prob
abili
ty
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
CNN
FOX News
Techbargains.com
Beta News
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 26
New ideaNew idea
Topic based feed pub/sub systemWhy should we register the address of a feed?
Need to find addresses providing contents I want
A feed may contain contents that I don’t want
Web Content providers
feeds
feeds
Topic based feed pub/sub(P2P based)
Topic of interest(Maybe Tags?)
Contents related to the topic
Korea Advanced Institute of Science and Technology
Network Computing Laboratory | 27
New ideaNew idea
Topic based feeding services are already launched
Baebo Create new feeds by keywords from the Amazon, Yahoo, eBay feeds
Say4Extract entries containing sentences in the bible from the BBC feed.
But centralized server runs the serviceLimitation in the number of input feeds
Hard to add input feed dynamically compared to P2P approach