enhanced topic distillation using text, markup tags, and hyperlinks
DESCRIPTION
Enhanced topic distillation using text, markup tags, and hyperlinks. Soumen Chakrabarti Mukul Joshi Vivek Tawde www.cse.iitb.ac.in/~soumen. Topic distillation. Keyword query. Given a query or some example URLs Collect a relevant subgraph (community) of the Web - PowerPoint PPT PresentationTRANSCRIPT
Enhanced topic distillation using text, markup tags, and hyperlinks
Soumen ChakrabartiMukul JoshiVivek Tawde
www.cse.iitb.ac.in/~soumen
IIT Bombay 2
Topic distillation Given a query or some
example URLs Collect a relevant
subgraph (community) of the Web
Bipartite reinforcement between hubs and authorities
Prototypes:• HITS and Clever• Bharat and Henzinger
Expanded set
Keyword query
Searchengine
Root set
1u
2u
3u
v
)()()()( 321 uhuhuhva
)()()()( 321 vavavauh
u
1v
2v
3v
IIT Bombay 3
Challenges and limitations Web authoring style in flux since 1996
• Complex pages generated from templates• File or page boundary less meaningful• “Clique attacks”—rampant multi-host
‘nepotism’ via rings, ads, banner exchanges Models are too simplistic
• Hub and authority symmetry is illusory• Coarse-grain hub model ‘leaks’ authority• Ad-hoc linear segmentation not content-
aware Deteriorating results of topic distillation
IIT Bombay 4
Clique attacks!
Relevant regionsthat lead to inclusionof page in base set
Irrelevantlinks formpseudo-community
IIT Bombay 5
Benign drift and generalization
This sectionspecializes on‘Shakespeare’
Remainingsectionsgeneralizeand/or drift
IIT Bombay 6
<html>…<body>…<table …><tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table></td></tr><tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>…</td></tr></table>…</body></html>
A new fine-grained model
ski.qaz.com Toncheese.co.uk
html
headbody
tabletr trtd
td
table ul
tr
td
tr
td
tr
td… li li li
…
a a a a
art.qaz.com www.fromages.com
Relevantsubtree
Irrelevantsubtree
Frontier ofdifferentiation
DocumentObject Model(DOM)
IIT Bombay 7
Generative model for hub text Global hub text
distribution 0 relevant to given query
Authors use internal DOM nodes to specialize 0 into I
At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees
Global termdistribution 0
Progressive‘distortion’Model
frontier
Other pages
I
IIT Bombay 8
A balanced cost measure
Dv
v
u
Reference distribution 0
Cumulative distortion cost =KL(0; u) + … + KL(u; v)
Data encoding cost is roughly
vDd
vd )|Pr(log
1log);(KL
2
1
1
221
(for exponential distribution)
Goal: Find minimumcost frontier
IIT Bombay 9
Marking ‘hot’ subtrees Hard to solve exactly (knapsack) (1+) dynamic programming solution Too slow for 10 million DOM nodes Greedy expansion approach: at each
node v, compare the cost of• Directly encoding Dv w.r.t. model v at v• First distorting v to w for each child w of
v, then encoding all Dw w.r.t. respective w If latter is smaller expand v, else prune Mark relevance subtrees as “must-
prune”
IIT Bombay 10
Exploiting co-citation in our model
‘Known’authorities
Have reasonto believethese couldbe good too
0.10
0.20
0.01
0.06
0.05
0.13
Initial values ofleaf hub scores =
target auth scores
Must-prune nodesare marked
Frontier microhubsaccumulate scores
1 2
3 4
0.10
0.20
0.12
0.13
0.10
0.20
0.12
0.12
0.12
0.13
Aggregate hubscores are copied
back to leaves
Non-linear transform, unlike HITS
IIT Bombay 11
Complete algorithm Collect root set and base set Pre-segment using text and mark relevant
micro-hubs to be pruned Assign only root set authority scores to 1s Iterate
• Transfer from authority to hub leaves• Re-segment hub DOM trees using link + text• Smooth and redistribute hub scores• Transfer from hub leaves to authority roots
Report top authority and ‘hot’ microhubs
IIT Bombay 12
Experimental setup Large data sets
• 28 queries from Clever, >20 topics from Dmoz
• Collect 2000…10000 pages per query/topic• Several million DOM nodes and fine links
Find top authorities using various algos For ad-hoc query, measure cosine
similarity of authorities with root-set centroid in vector space
For Dmoz, use an automatic classifier…
IIT Bombay 13
Avoiding topic drift via micro-hubs
0
500
1000
1500
2000
2500
3000
3500
4000
0 1 2 3 4 5 6 7 8 9 10
#Prune
#Expand
Query 5
Iteration
Data
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8 9 10
#Prune
#Expand
Query 1
Iteration
Data
Query: cyclingNo danger of topic drift
Query: affirmative actionTopic drift from software sites
IIT Bombay 14
Results for the Clever benchmark
Take top 40 auths Find average
cosine similarity to root set centroid
HITS < DOM+Text < DOM similarity
DOM alone cannot prune well enough: most top auths from root set
HITS drifts often
012345678
2 3 4 5 6 7 8 91
01
11
21
31
41
51
61
71
81
92
02
12
22
32
4A
vQid
Sca
led
co
sin
e to
ro
ots
et
HitsSimDomTextSimDomSim
05
1015202530354045
2 3 4 5 6 7 8 91
01
11
21
31
41
51
61
71
81
92
02
12
22
32
4A
vQid
#R
oo
tSe
tHitsRoot DomTextRoot DomRoot
IIT Bombay 15
Dmoz experiments and results 223 topics from
http://dmoz.org Sample root set
URLs from a class c Top authorities not
in root set submitted to Rainbow classifier
d Pr(c |d) is the expected number of relevant documents
DOM+Text best
Expanded set
DMoz
Music
Root set
Sample
Rainbowclassifier
Train
Test
Top authority
0
5
10
15
20
25
30
35
40
Music VisualArts HR SecuritySu
m o
f ro
ot c
lass
pro
ba
bili
ties
HITS
DomHITS
DomTextHITS
IIT Bombay 16
Anecdotes “amusement parks”:
http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc.
New algorithm reduces drift Mixed hubs accurately segmented,
e.g. amusement parks, classical guitar, Shakespeare and sushi
Mixed hubs in top 50 for 13/28 queries
IIT Bombay 17
Conclusion and ongoing work Hypertext shows complex idioms,
missed by coarse-grained graph model Enhanced fine-grained distillation
• Identifies content-bearing ‘hot’ micro-hubs• Disaggregates hub scores• Reduces topic drift via mixed hubs and
pseudo-communities Application: topic-based focused
crawling Need probabilistic combination of
evidence from text and links