© 2008 at&t intellectual property. all rights reserved. xtreenet: a framework for flexible...

17
© 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho, Divesh Srivastava, K. K. Ramakrishnan, Yin Zhang and many others AT&T Labs Research, NJ USA August 2011

Upload: diana-goodman

Post on 29-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved.

XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval

TaeWon Cho, Divesh Srivastava, K. K. Ramakrishnan, Yin Zhang and many others

AT&T Labs Research, NJ USAAugust 2011

Page 2: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 2

Network as the Vehicle for Information Dissemination

• The ‘network’ will (has) become increasingly Information-centric– Information of all types becoming electronic and network accessible– Access of information based on content of interest, instead of location

• Information Overload - Scale: Producers and Consumers face challenges– Large number of producers (publishers; data sources)– Even larger number of consumers (subscribers, users querying/looking

for content)o Tremendous number of information producers makes it difficult for a

consumer to know where to find relevant information

– Significant challenge: “whom and what to ask” & “whom and what to tell”

• XTreeNet looks at the various problems related to a network-based Information Dissemination and Retrieval environment– Obtain “information” of interest by asking the network to find it– Tell the network to deliver “information” of interest– Ask the network as to what “information” I should be interested in

Page 3: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 3

Role of the Network in Information Dissemination• Success of information aggregators (search engines etc.) unquestionable

– Information aggregators do play a key role• Limitation:

– Dis-intermediates producers: constrains business model of producers• Timeliness and Coverage are also key criteria for information

dissemination– Timeliness: Need information (including real-time) to be available right away

o E.g., for a consumer to access real-time media contento Ability for the content to be withdrawn is also desirable

– Coverage: Availability of information depends on set of information that is made available to the consumer by intermediaries, like an aggregator

o Information providers can be “dynamic”/ transient. Complete coverage by an aggregator may be difficult

o Desirable to enable information producers themselves to make it available on an as-needed basis

• Publish-subscribe based access has become somewhat popular– (E.g., news groups, RSS feeds)

• Information dissemination and Query-Response for Information Retrieval in a scalable manner is essential

– Inherently N-to-N communication– We seek to exploit XML-tagging of information

Page 4: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 4

XML Routing: Overlay Services based on XML

• An XML Network: overlay network of XML switches/routers • XTreeNet project: investigate the design for a large-scale

integrated publish/subscribe + query/response application • how can we partition functions between the overlay and underlay?

IP NetworkInfrastructure

Database

XML OverlayNetworkXML

router

Publisher

Subscriberfor alerts

Subscriber forinformation

Data querygeneration

Page 5: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 5

XTreeNet Overview

• Publishers and Subscribers submit Content Descriptors (CD’s) to the network

• As soon as CD (from producer or consumer) hits network, map into single hash-id at first overlay router

– Subsequent routers forward based on hash-id downstream

much more efficient than matching against aggregated query filters

• XTreeNet builds a common Core-based tree(CBT) on a per-”CD” basis; integrate both producers and consumers of information

– Dynamically create CBT on first arrival of CD from producer

• Groups (overlay multicast) formed on an as-needed basis for each CD

– Very fine grained distribution tree connecting producers & consumers

– Branches to subscribers for disseminating published content & branches to publishers for forwarding queries

– Different cores for different CDs – reduce likelihood of traffic concentration

Page 6: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 6

Content Descriptors

• CD can be an element of a topic hierarchy; multiple hierarchies may be supported (e.g., topics, geographic location)– An XML schema path (root-to-leaf path) may also be used as basis of

hierarchically structured domain for constructing CDso Disambiguate between multiple XML documents using string values at

leaves

<rss> <channel> <editor> Jupiter </editor> <item> <title> ReutersNews </title> <link> reuters.com </link> </item> <description> abc </description></channel> </rss>

rss

channel

editor item description

title link

Jupiter

ReutersNewsreuters.com

abc

• Content Descriptors (CDs) act like “indexes” in a distributed data base environment– Each data item generated by a producer and each consumer query filter

are independently mapped to a set of CDs

– A data item matches a query when respective sets of CDs have at least one CD in common

• CDs decouple producers from the consumers– Can support heterogeneous producer schemas

Page 7: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 7

Scalability of CDs • Publisher guidance

o Information publisher provides guidance on what XML tags of potential interest

• Strategieso Fullpath: /rss/channel/item/title/ReutersNews

o Last Tag: /title/ReutersNews

o Keyword: ReutersNews

• Estimated by extracting CDs from XML version of WikipediaUnique CDs genereated by Wikipedia articles

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

# of Wikipedia articles

# o

f uniq

ue C

ds

Fullpath

Last Tag

Keyword

Last Tag + Keyword

• ~ 5M CDs for about 1M articles and grows slowly – duplication of CDs in documents

Page 8: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 8

Scalable Multicast: Multicast Architecture with Adaptive Dual-state

• Multicast is key to efficient information dissemination• Requirements for Information-centric Multicast:

– Scalability in group membershipo Fine granularity of access support for large number of groups

– Persistent access to groupo Network should be responsible for maintaining group membership unless

users explicitly un-subscribe from group– Minimize loss of information– Keep control traffic scalable

• Limitations of existing IP / Overlay Multicasto Forwarding state grows linearly with number of groups

– State overhead (at multiple routers)o Soft-state needs to be refreshed

– Control overheado Hence, limits scalability and has inadequate persistence

• How to achieve scalable and persistent multicast?• MAD seeks to solve issues of scale and persistence with

multicast

Page 9: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

Group Memberships Lifetime & Activity Level

• Group activity can vary widely– Analyzed publishing activity of RSS feeds

o Only 5% RSS feeds publish more than 100 updates/month

o Median rate is 10 updates/month– 10% most active feeds contribute 75% updates

• IP multicast: Inactive groups usually treated the same as an active group

o But can’t afford loss of information

© 2008 AT&T Intellectual Property. All rights reserved. Page 9

RSS: Publishing rate (# updates/month)

Subscription count to YouTube channels •Membership (e.g., in a pub-sub environment) likely to be long-lived

•Users subscribe, and remain interested in receiving info’ even when publishers distribute infrequently

•Only 2.3% groups see reduction

•Long-lived membership results in•Network state grows for group; increased group size

Page 10: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 10

Using an IP-Multicast Style Approach

00

15

01 03

1205 06

07

11

09

10

08

04

1314

02

00

15

01 03

1205 06

07

11

09

10

08

04

1314

02

First-hop router (FH)

Forwarder

Router not participating

User

First-hop router (FH)

Forwarder

Router not participating

User

• A lot of routers maintain forwarding state:

• 6 intermediate routers keep state that has to be constantly refreshed

•4 first hop routers also keep state

• Every intermediate router has to maintain state o Forwarding state grows linearly with number of groups

– State overhead (at multiple routers)

o Soft-state needs to be refreshed– Control overhead

Page 11: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

The MAD environment• MAD multicast service overlay consists of a set of logical

overlay routers

• Each logical router serves as a single aggregated local subscriber for all users attached to it

• Subscription manager responsible for all the users’ subscription management – maintains subscriptions for users connected to site

© 2008 AT&T Intellectual Property. All rights reserved. Page 11

Page 12: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 12

Differentiate the Roles of Multicast State

• Membership State vs. Forwarding State• Group membership can be separated from

forwarding state– Group membership must be stored scalably and

persistentlyo Especially for groups that have low frequency of information

flow

– Forwarding state: efficient forwarding of active groupso Can be re-generated when a group becomes active

• Active and inactive groups can be treated differently– Small percent of (active) groups generate data at a high

rate: forward efficiently– Large percent of (inactive) groups generate low traffic

volume

Page 13: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 13

The MAD Solution• Group membership is separated from forwarding state:

Multicast with Adaptive Dual State• Use Membership Tree (MT) for scalable state maintenance– Store group membership information in MT

o Minimize number of intermediate routers keeping group state– Impose static virtual hierarchy => no control overhead

o But, static hierarchy may not result in optimal delivery path• Use Dissemination Tree (DT) for forwarding efficiency– Use DT for active groups

o Can use any “state-of-art” multicast protocol

• MAD may begin as an overlay multicast service– Use IP multicast to improve forwarding efficiency for DT– MT may also eventually evolve to being supported by the underlay

• MAD achieves best of both worlds - scalability and forwarding efficiency

Page 14: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

MAD Membership Tree protocol overview

• Goal of Membership Tree: reduce # routers keeping multicast group state

• MT selects the core (root) based on hash of group ID– Define a single base tree at this root (static)

– All groups selecting this root use the base tree to construct MT

• Subscriber join is forwarded up on the base tree until it reaches first on-tree node for this group’s MT– When a subtree rooted at an en-route router has more than a

min. # of first-hop routers with attached subscribers, the parent node on the MT requires that the en-route router join the MT

• MAD protocol provides for seamless transition to switch from DT to MT as level of group activity changes (reduces) over time

© 2008 AT&T Intellectual Property. All rights reserved. Page 14

Page 15: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 15

Routers Maintaining State in MAD

• Fewer routers maintain state:

– 2 intermediate routers and 4 FH routers

• Forwarding by multicast/unicast – not necessarily efficient

• MT reduces number of routers keeping Multicast State by aggregating subscriber state in a virtual sub-tree

00

15

01 03

1205 06

07

11

09

10

08

04

1314

02

00

15

01 03

1205 06

07

11

09

10

08

04

1314

02

Membership Tree

(4 First-hops, 5 users)

00

1109 151210 14

070302

13

0804 0601 05

00

1109 151210 14

070302

13

0804 0601 05

Virtual membership tree

(fan-out 8, aggregation threshold 2)

Base Tree

Page 16: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 16

Scalability of Multicast with MAD

Number of First-Hop Routers in a Group

Num

ber

of

Gro

ups

(Tri

llions)

Number of First-Hop Routers in a Group

Tota

l D

ela

y

(mse

c)

• State efficiency with MAD is significantly better than IP multicast-like approaches (DT)

• Forwarding efficiency with MAD is as good as IP multicast (DT)

• Evaluation using simulation and measurements with implementation– Implementation measured on Emulab with about 100 routers

– Simulation with 16,000 routers; Power-law topology

• MAD achieves both efficient state maintenance and efficient forwarding

Page 17: © 2008 AT&T Intellectual Property. All rights reserved. XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval TaeWon Cho,

© 2008 AT&T Intellectual Property. All rights reserved. Page 17

Summary

• XTreeNet: project we have been working on – primarily focused on the meta-data plane– XTreeNet Architecture – complex processing at the edges;

efficient forwarding in the core

– MAD: Scalable Multicast – Large # groups; Large # subscribers

– QDTs: Query Distribution Trees for Distribution of Complex Queries – Load Balancing, Privacy preservation, Censorship Resistant

– Recommendation Systems: Scalable, Privacy Preserving

• More recent work: “COPSS: An Efficient Content-Oriented Publish/Subscribe System” in collaboration with folks from University of Goettingen, Germany