title extraction from bodies of html documents and its application to web page retrieval yunhua hu...
Post on 18-Dec-2015
225 views
TRANSCRIPT
![Page 1: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/1.jpg)
Title Extraction from Bodies of HTML Documents and its
Application to Web Page Retrieval
Yunhua Hu1, Guomao Xin2, Ruihua Song, Guoping Hu3,Shuming Shi, Yunbo Cao, and Hang Li
Microsoft Research Asia1: Xi’an Jiaotong University
2: Peking University3: University of Science and Technology of China
![Page 2: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/2.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 3: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/3.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 4: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/4.jpg)
Motivation Title of HTML document should be
defined in title filed Title fields of HTML documents are not
reliable
Data Set
Num. of HTML docs
Empty title fields
Duplicated title fields
TREC 1,053,111 5.8% 26.9%
![Page 5: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/5.jpg)
Can We Extract Title from Body of HTML?
![Page 6: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/6.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 7: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/7.jpg)
Related Work: Web Information Extraction
Information type: data record, news article, summary
Data structure: DOM tree, block Approach: rule-based approach vs machi
ne learning based approach Domain specific vs domain independent Not clear how to extract title from body
![Page 8: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/8.jpg)
Related Work: Web Information Retrieval
Title filed, anchor text, and URL are useful for web page retrieval
Not clear whether extracted title is useful
![Page 9: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/9.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 10: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/10.jpg)
Input: HTML document (web page) Output: title(s) from body of HTML document
Condition: domain independent
Title Extraction Task
National Weather Service Oxnard
Los Angeles Marine Weather Statement
HTML document
Extracted titles
![Page 11: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/11.jpg)
Intuitively, title is ‘most conspicuous’ part Can have 0-2 titles Must be on top region Font size, font weight, etc are noticeable Can cross several lines, but usually in same
format Cannot be in bullets and list Cannot be expressions like “under construction”,
… Image is not considered
Spec on HTML Title
![Page 12: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/12.jpg)
Examples
![Page 13: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/13.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 14: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/14.jpg)
Title Extraction Processing
Title extraction as information extraction Using DOM tree Leaf node containing ‘text’ as unit
(instance) Mainly using format information
Title
![Page 15: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/15.jpg)
DOM Tree
HTML document DOM tree
![Page 16: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/16.jpg)
General framework for Information Extraction
1x
Learning Tool
Extraction Tool
n
n
yyy
xxx
21
21
)|(maxarg 11 nn xxyyP
)|( 11 nn XXYYP
Model
nxx 1
![Page 17: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/17.jpg)
HTML Title Extraction
1x
Learning Tool
Extraction Tool
n
n
yyy
xxx
21
21
)|(maxarg 11 mm xxyyP
)|( 1 ni XXYP
Perceptron
Classifier
mxx 1
x: unitY: title?
![Page 18: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/18.jpg)
Information Used in Features (1)
Rich format information Font size: 1~7 levels Font weight: bold face or not Font family: Times New Roman, Arial, etc Font style: normal or italic Font color: #000000, #FF0000, etc Background color: #FFFFFF, #FF0000, etc Alignment: center, left, right, and justify.
Tag information H1,H2,…,H6: levels as header LI: a listed item DIR: a directory list A: a link or anchor U: an underline BR: a line break HR: a horizontal ruler IMG: an image Class name: ‘sectionheader’, ‘title’, ‘titling’,’ header’,
etc.
![Page 19: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/19.jpg)
Information Used in Features (2)
Position information Position from beginning of body Width of unit in page
DOM tree information Number of sibling nodes in the DOM tree. Relations with root node, parent node and sibling nodes in
terms of font size change, etc. Relations with previous leaf node and next leaf node, in
terms of font size change, etc. Linguistic information
Length of text: number of characters Length of real text: number of alphabetic letters Negative words: ‘by’, ‘date’, ‘phone’, ‘fax’, ‘email’,
‘author’, etc. Positive words: ‘abstract’, ‘introduction’, ‘summary’,
‘overview’, ‘subject’, ‘title’, etc.
![Page 20: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/20.jpg)
Use of Extracted Title in Web Page Retrieval
Employing BM25 framework BasicField: texts in body and title are used BaiscField+Title
BasicField+ExtTitle
BasicField+CombTitle
TitleBasicField )1( SS
ExtTitleBasicField )1( SS
CombTitleBasicField )1( SS
![Page 21: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/21.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 22: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/22.jpg)
Data for Title Extraction Experiments
NameNum. of
HTML DocsTitle
labeled
Docs having titles
TREC about 1 million 4,258 78.3%
MS about 1 million 4,137 63.8%
![Page 23: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/23.jpg)
Title Extraction Results (TREC, Cross-Validation)
Approach Precision Recall F1-Score Accuracy
Largest font (baseline)
0.528 0.643 0.580 0.523
First unit 0.327(-38.1%)
0.402(-37.5%)
0.360(-37.8%)
0.327(-37.5%)
Title-field 0.270(-48.8%)
0.324(-49.6%)
0.295(-49.1%)
0.261(-50.0%)
Perceptron 0.698(+32.3%)
0.703(+9.3%)
0.701(+20.9%)
0.698(+33.5%)
![Page 24: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/24.jpg)
Title Extraction Results(MS, Cross Validation)
Approach Precision Recall F1-Score Accuracy
Largest font (baseline)
0.584 0.840 0.689 0.582
First unit 0.606(+3.7%)
0.875(+4.1%)
0.716(+3.9%)
0.606(+4.1%)
Title-field 0.656(+12.3%)
0.834(-0.7%)
0.735(+6.6%)
0.673(+15.6%)
Perceptron 0.910(+55.7%)
0.919(+9.4%)
0.914(+32.6%)
0.909(+56.1%)
![Page 25: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/25.jpg)
Title Extraction:Feature Contribution
0%
1%
3%
9%
31%
31%
69%
78%
82%
86%
88%
91%
0%
0%
0%
0%
0%
0%
0%
41%
54%
50%
59%
70%
0. 00 0. 20 0. 40 0. 60 0. 80 1. 00
App_FontStyle
App_Background
App_Color
App_Alignment
App_FontFamily
App_FontWeight
Con
Pos
App_FontSize
Nei
App
All
Eac
h fe
atur
e su
bset
F1-Score
TREC
CAMS
![Page 26: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/26.jpg)
Training Set
Test Set
Precision
Recall
F1-ScoreAccurac
yMS TREC 0.698 0.615 0.654 0.642
TREC MS 0.852 0.883 0.867 0.871
TREC TREC 0.698 0.703 0.701 0.698
MS MS 0.910 0.919 0.914 0.909
Title Extraction:Domain Adaptation
![Page 27: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/27.jpg)
Query Data for Retrieval Experiments
Year Task Num. of queries2002 NP 150
2003
TD 50
HP 150
NP 150
2004
TD 75
HP 75
NP 75
![Page 28: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/28.jpg)
Web Page Retrieval Results (TREC)
TREC-2003 NP
0. 35
0. 4
0. 45
0. 5
0. 55
0. 6
0. 65
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1Al pha
Mean Average Precision (MAP) BaseFi el ds+Ti t l e BaseFi el ds+ExtTi t l e BaseFi el ds+CombTi t l es
![Page 29: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/29.jpg)
Web Page Retrieval Results(TREC)
TREC-2003 HP
0. 15
0. 2
0. 25
0. 3
0. 35
0. 4
0. 45
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1
Al pha
Mean Average Precision (MAP)
BaseFi el ds+Ti t l e BaseFi el ds+Ext Ti t l e BaseFi el ds+CombTi t l es
![Page 30: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/30.jpg)
Web Page Retrieval Results (TREC)
2003 TD
0. 08
0. 09
0. 1
0. 11
0. 12
0. 13
0. 14
0. 15
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1Al pha
Mean Average Precision (MAP)
Basi cFi el ds+Ti t l e Basi cFi el ds+ExtTi t l e Basi cFi el ds+CombTi t l e
![Page 31: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/31.jpg)
Average Precision for Each Method
Year TaskBaiscField
+Title+ComTitle
2003
TD 0.528 0.6060.650 (>>)(+23.1%)
HP 0.3020.397 (>>)
(+31.4%)
0.435 (>>)(+44.0%)
NP 0.0960.127
(+32.3%)0.145
(+51.0%)
![Page 32: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/32.jpg)
Outline
Motivation Related work Problem description Our approach Experimental results Conclusions
![Page 33: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/33.jpg)
Conclusions
Title fields of HTML documents are not reliable
We propose conducting title extraction from bodies of HTML documents
Construct domain-independent model using machine learning and format features
Use of extracted titles can help improve precision of web page retrieval, particularly TREC name page finding
![Page 34: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649d255503460f949fc19b/html5/thumbnails/34.jpg)
Thanks!