hku csis db seminar: hku csis db seminar: efficient filtering of xml documents for selective...

HKU CSIS DB Seminar:HKU CSIS DB Seminar:Efficient Filtering of XML Documents for Selective

Dissemination of Information

Mehmet Altinel, Micheal J. FranklinVLDB2000

Speaker: Eric Lo

Introduction Increasing volume of data available in

electronic forms and the proliferation of Internet have accelerated the development of SDI (Selective Dissemination of Information)

Selective dissemination of information is to avoid sending users/subscribers unnecessary information

The SDI applications:- - timely received/collected new data such as stock

quotes, traffic news, sports tickers and music- - filter against subscribers profile- - delivering relevant data to interested subscribers

Introduction Current SDI… …

- - based of simple keyword matching and typical IR techniques

- - e.g. a subscriber profile has the keyword “NBA” will match all those news with the keyword “NBA” exists

HOWEVER… … - Still suffering from typical problems:

Subscriber will also receive irrelevant information such as news with headline “Bill Gate loves to watch NBA”

Even the current system drawn large concern on improving the effectiveness, they miss out the EFFICIENCY!

Introduction One of the usage of XML is to be a

standard information exchange mechanism

XML allows encoding of structural information within documents and can create more focused and accurate profiles of user interests.

“XFilter” in this paper addressed the mentioned concerns

XML-based SDI Architecture

Subscribers has a GUI interface to specify the profiles

The underlying language is XPath

E.g. /sports/nba//news

Input

XFilter Architecture 4 major

components 1. Event-base

parser for XML document

2. XPath parser for user profiles

3. Filter engine, matching between profile and XML documents

4. Dissemination engine, for delivery the filtered data

Generally, how the system work?

<sports> <nba> <chicago>…</chicago> </nba></sports>

New_incoming_document.xml

Q1: /sports / nba //news [Q1-1] [Q1-2] [Q1-3]Q2: //nba/*/ news [Q2-1] [Q2-2]Q3: /stocks/quotes/PCCW [Q3-1] [Q3-2] [Q3-3]

3 subscribers

sports

nba

news

stocks

quotes

PCCW

Q1-1

Q2-1

Q1-2

Q1-3 Q2-2

Q3-1

Q3-2

Q3-3

Candidate List Wait List

Q1-1

Q1-2

Filter Engine of XFilter XFilter convert the XPath query to a

Finite State Machine A subscriber XPath (Profile) is MATCH

with the XML document WHEN the FSM of the XPath query reach its final state

A Query Index is built over the states of the (FSM) XPath queries.

Inside Filter Engine

Path Nodes XPath parser decompose XPath to set of path nodes Elements are nodes (no attribute) and act as state

of FSM /sports/nba//news Wildcard (*) is ignored

sports nba news

Path Nodes InformationQuery IDPositionRelative Position:

=0 for 1st node if 1st node is not follow by “//”

=-1 if any node followed by “//”

Else =1+ (no of “*” nodes between itself and predecessor node)

Level:If 1st node and have absolute

distance from the root, then level = 1+ distance from root

If Rel. Pos. is –1, it is also –1, else =0

Q1=/sports/nba//news

Q1 Q1 Q1

1 2 3

0 1 -1

1 0 -1Q1-1 Q1-2 Q1-3

Q2 Q2 Q2

1 2 3

-1 2 1-1 0 0

Q2-1 Q2-2 Q2-3

Q2=//nba/*/news/Bulls

Query Index All the nodes added

to the Query Index(a hash table based on element names)

Each unique element name associate with two lists: Candidate List and Wait List

The current node of each query is placed in CL, others are in WL

The FSM will move to next state when a path node promote to CL from WL

sports

nba

news

stocks

quotes

PCCW

Q1-1

Q2-1

Q1-2

Q1-3 Q2-2

Q3-1

Q3-2

Q3-3

Candidate List Wait List

XML Parsing and Filtering When a XML document arrives, it run thru the

SAX XML Parser (event-driven) and will check with the Query Index when encountering:

A begin element tag An end element tag Data internal to an element

Input XML SAX API

<?xml version=“1.0”><sports><news><ball games><nba>Michael Jordan … </nba></ball games></news></sports>

Start documentStart element: sportsStart element: newsStart element: ball gamesStart element: nbaCharacters: Michael JordonEnd element: nba …

XML Parsing and Filtering (cont) Start_Element_Handler

(element_name, element level, attribute name, attribute values) { Lookup the element name in the

Query Index and examines all nodes in the CL and perform LEVEL CHECK and ATTRIBUTE FILTER CHECK

}

Q1

1

0

1Q1-1

Level Check and Attribute Check Level check is to ensure the element

appears in the document matches the expected level in the user query

Recall: - the level of a path node is –1 relative

pos is –1 a “//” is before this node unrestricted

- else the level of path node must = the level of the input element

The attribute filter check applies any simple predicates that reference the attributes of the element

Level Check and Attribute Check If both level check and attribute check

succeed, that node is pass. If that node is the final path node (final

state) of the query (e.g. Q1-3) then the document is match the query, if that node is not the final path node, the query is then moved the next state.

State move is done by copying the next node of the query from WL to CL and update the corresponding relative position and level

End element handler and character handler When an end element is encounter in

SAX parser, the path node of that element is deleted from CL

When element data is encounter in SAX parser, it works like start element handler except it performs a content check rather than attribute check

List Balancing Recall:

The first path node of the XPath query is placed on the CL and remaining path node are placed on WL

Inefficient for many situations as the 1st element usually have poor selectively

Some CL has long length, some CL has short length, and not balancing! (e.g. the length of CL of element “news” usually much longer than the length of CL of element “NBA”

List Balancing List balancing introduce a “pivot” node

When a new query is adding to the index, the element node of the query whose entry in the index has shortest CL is chosen as pivot and placed it on the CL (instead of the 1st node)

E.g. When a new subscriber add /sports/worldcup//news, if the length of “worldcup” element is shortest compare with “sports” and “news”, “worldcup” is the pivot and add to CL

The prefix “sports” will then be a precondition and use a stack to hold it, the filter will stop is the precondition for the node fails

List Balancing

Q3=/*/sports/news//bulls

Q3 Q3 Q3

1 2 3

0 1 -1

1 0 -1Q1-1 Q1-2 Q1-3

Q3 Q3

1 2

0 -1

1 -1Q1-1 Q1-2

Assume the element “news” has the shortest CL among the 3 elements

Stack: “sport”

List Balancing

Prefiltering Prefiltering is to eliminate from

consideration, any query that contains an element name that is not present in the input document to avoid unnecessary work done

Done before order and filter checking (thus every incoming XML is parsed twice)

Prefiltering A “key” element is chosen for each

query when initially parsed The key is chosen like List Balancing

whereas a hash table(call occurrence table) containing an entry of <element name, QueryID1, …, QueryIDn> is constructed when a document arrives

The queries referenced by the table are checked to see if all of the element names exist in the document, only the successful queries would go further

Prefiltering Assume the key is in blue color Q1: /sports/nba//news/scores Q2: /sports/NHL//news Q3: /sports/nba/Bulls//news Q4: /sports//Bulls/ranking

<sports><nba> <Lakers> <news>O’ Neal…</news> </Lakers> <Bulls> <news>Bulls beat Lakers</news> </Bulls></nba></sports>

Sports18012002.xml

sports

nba Q1

Lakers

news

Bulls Q3,Q4Occurrence Table

Q3All elements inQueries exists inThe document?

Performance evaluation Evaluate the performance by varying: Number of subscribers profile Depth of subscribers queries and

incoming XML document Probability of wildcards Filter placement and selectively List Balance with Prefiltering has the

best performance

Related Work Enhance XFilter by considering not only

element but also attributes Enhance XFilter by reordering the input

profiles (XPath queries of subscribers) when building the index so as to have more well-balance Candidates List

Refer to “Indexing Attributes and Reordering Profiles for XML Document Filtering and Information Devliery” by Wang Lian, David Cheung and S.M. Yiu, WAIM 2001

hku csis db seminar: hku csis db seminar: efficient filtering of xml documents for selective...

Documents

xpath query

documentstart element

sportsstart element

newsstart element

xml document2

usage of xml

xml documents4

subscriber profile