enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks

Soumen ChakrabartiMukul JoshiVivek Tawde

www.cse.iitb.ac.in/~soumen

IIT Bombay 2

Topic distillation Given a query or some

example URLs Collect a relevant

subgraph (community) of the Web

Bipartite reinforcement between hubs and authorities

Prototypes:• HITS and Clever• Bharat and Henzinger

Expanded set

Keyword query

Searchengine

Root set

1u

2u

3u

v

)()()()( 321 uhuhuhva

)()()()( 321 vavavauh

u

1v

2v

3v

IIT Bombay 3

Challenges and limitations Web authoring style in flux since 1996

• Complex pages generated from templates• File or page boundary less meaningful• “Clique attacks”—rampant multi-host

‘nepotism’ via rings, ads, banner exchanges Models are too simplistic

• Hub and authority symmetry is illusory• Coarse-grain hub model ‘leaks’ authority• Ad-hoc linear segmentation not content-

aware Deteriorating results of topic distillation

IIT Bombay 4

Clique attacks!

Relevant regionsthat lead to inclusionof page in base set

Irrelevantlinks formpseudo-community

IIT Bombay 5

Benign drift and generalization

This sectionspecializes on‘Shakespeare’

Remainingsectionsgeneralizeand/or drift

IIT Bombay 6

<html>…<body>…<table …><tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table></td></tr><tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>…</td></tr></table>…</body></html>

A new fine-grained model

ski.qaz.com Toncheese.co.uk

html

headbody

tabletr trtd

td

table ul

tr

td

tr

td

tr

td… li li li

…

a a a a

art.qaz.com www.fromages.com

Relevantsubtree

Irrelevantsubtree

Frontier ofdifferentiation

DocumentObject Model(DOM)

http://art.qaz.com/

http://ski.qaz.com/

http://www.fromages.com/

http://www.teddingtoncheese.co.uk/

http://www.teddingtoncheese.co.uk/

IIT Bombay 7

Generative model for hub text Global hub text

distribution 0 relevant to given query

Authors use internal DOM nodes to specialize 0 into I

At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees

Global termdistribution 0

Progressive‘distortion’Model

frontier

Other pages

I

IIT Bombay 8

A balanced cost measure

Dv

v

u

Reference distribution 0

Cumulative distortion cost =KL(0; u) + … + KL(u; v)

Data encoding cost is roughly

vDd

vd )|Pr(log

1log);(KL

2

1

1

221

(for exponential distribution)

Goal: Find minimumcost frontier

IIT Bombay 9

Marking ‘hot’ subtrees Hard to solve exactly (knapsack) (1+) dynamic programming solution Too slow for 10 million DOM nodes Greedy expansion approach: at each

node v, compare the cost of• Directly encoding Dv w.r.t. model v at v• First distorting v to w for each child w of

v, then encoding all Dw w.r.t. respective w If latter is smaller expand v, else prune Mark relevance subtrees as “must-

prune”

IIT Bombay 10

Exploiting co-citation in our model

‘Known’authorities

Have reasonto believethese couldbe good too

0.10

0.20

0.01

0.06

0.05

0.13

Initial values ofleaf hub scores =

target auth scores

Must-prune nodesare marked

Frontier microhubsaccumulate scores

1 2

3 4

0.10

0.20

0.12

0.13

0.10

0.20

0.12

0.12

0.12

0.13

Aggregate hubscores are copied

back to leaves

Non-linear transform, unlike HITS

IIT Bombay 11

Complete algorithm Collect root set and base set Pre-segment using text and mark relevant

micro-hubs to be pruned Assign only root set authority scores to 1s Iterate

• Transfer from authority to hub leaves• Re-segment hub DOM trees using link + text• Smooth and redistribute hub scores• Transfer from hub leaves to authority roots

Report top authority and ‘hot’ microhubs

IIT Bombay 12

Experimental setup Large data sets

• 28 queries from Clever, >20 topics from Dmoz

• Collect 2000…10000 pages per query/topic• Several million DOM nodes and fine links

Find top authorities using various algos For ad-hoc query, measure cosine

similarity of authorities with root-set centroid in vector space

For Dmoz, use an automatic classifier…

IIT Bombay 13

Avoiding topic drift via micro-hubs

0

500

1000

1500

2000

2500

3000

3500

4000

0 1 2 3 4 5 6 7 8 9 10

#Prune

#Expand

Query 5

Iteration

Data

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10

#Prune

#Expand

Query 1

Iteration

Data

Query: cyclingNo danger of topic drift

Query: affirmative actionTopic drift from software sites

IIT Bombay 14

Results for the Clever benchmark

Take top 40 auths Find average

cosine similarity to root set centroid

HITS < DOM+Text < DOM similarity

DOM alone cannot prune well enough: most top auths from root set

HITS drifts often

012345678

2 3 4 5 6 7 8 91

01

11

21

31

41

51

61

71

81

92

02

12

22

32

4A

vQid

Sca

led

co

sin

e to

ro

ots

et

HitsSimDomTextSimDomSim

05

1015202530354045

2 3 4 5 6 7 8 91

01

11

21

31

41

51

61

71

81

92

02

12

22

32

4A

vQid

#R

oo

tSe

tHitsRoot DomTextRoot DomRoot

IIT Bombay 15

Dmoz experiments and results 223 topics from

http://dmoz.org Sample root set

URLs from a class c Top authorities not

in root set submitted to Rainbow classifier

d Pr(c |d) is the expected number of relevant documents

DOM+Text best

Expanded set

DMoz

Music

Root set

Sample

Rainbowclassifier

Train

Test

Top authority

0

5

10

15

20

25

30

35

40

Music VisualArts HR SecuritySu

m o

f ro

ot c

lass

pro

ba

bili

ties

HITS

DomHITS

DomTextHITS

IIT Bombay 16

Anecdotes “amusement parks”:

http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc.

New algorithm reduces drift Mixed hubs accurately segmented,

e.g. amusement parks, classical guitar, Shakespeare and sushi

Mixed hubs in top 50 for 13/28 queries

IIT Bombay 17

Conclusion and ongoing work Hypertext shows complex idioms,

missed by coarse-grained graph model Enhanced fine-grained distillation

• Identifies content-bearing ‘hot’ micro-hubs• Disaggregates hub scores• Reduces topic drift via mixed hubs and

pseudo-communities Application: topic-based focused

crawling Need probabilistic combination of

evidence from text and links

enhanced topic distillation using text, markup tags, and hyperlinks

Documents