Download - Content extraction via tag ratios
![Page 1: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/1.jpg)
Content Extraction via Tag RatiosAuthor: Tim Weninger, William H. Hsu, Jiawei Han
Publication: WWW’2010
Presenter: Jhih-Ming Chen
1
![Page 2: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/2.jpg)
Outline
• Introduction
• Tag Ratio
• Threshold Method
• Smoothing
• Selecting Content from the Threshold
• Selecting Content via Clustering
• 2D Modal
• Constructing 2D Tag Ratios
• Clustering Tag Ratios
• Implementation details
• Experiments
• Summary2
![Page 3: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/3.jpg)
Introduction
• Typical webpage contains:
1. A title banner (or something similar) towards the top of the
page.
2. A list of hyperlinks on the left or right side of the page with
advertisements interspersed.
3. Most usually the meaningful content of the page is located in
the middle.
• Of course, this layout is not standard among all webpages;
therefore a flexible, robust content extraction tool is necessary.
3
![Page 4: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/4.jpg)
Introduction
• Most current content extraction techniques make use of
particular HTML cues such as tables, fonts, sizes, lines, etc..
• since modern webpages no longer include these cues, many
content extraction algorithms have begun to perform poorly.
• One difference:
• This paper make no assumptions about the particular structure of a
given webpage.
4
![Page 5: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/5.jpg)
Introduction
• Content Extraction via Tag Ratios (CETR)
• This paper construct a tag ratio (TR) array with the contention that
for each line i in the array.
• The higher the tag ratio is for the ith line the more likely that i
represents a line of content-text within the HTML document.
• TR array can be represented as a histogram, wherein each
histogram bucket represents the tag ratio of a line in the document.
• The problem is reduced to a histogram clustering task.
• Appropriate clusters should discriminate between TR-lines which
correspond to webpage content and those TR-lines which do not.
5
![Page 6: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/6.jpg)
Introduction
• Three clustering approaches are investigated in this work.
• The first two (CETR-TM, CETR-KM) are preliminary versions of
the approach.
• The final version (CETR) is claimed to be the most advanced and
best performing.
6
![Page 7: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/7.jpg)
Tag Ratio
• Tag Ratios, essentially, are the ratios of the count of non-
HTML-tag characters to the count of HTML-tags per line.
• The count of HTML-tags on a particular line is 0 then the ratio is
set to the length of the line.
7
![Page 8: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/8.jpg)
Tag Ratio
• Example 1 shows a snippet of code from an article published
on The Hutchinson News’ website with the corresponding tag
ratios.
• Figure 2 shows the resulting TR-histogram T.
• the high tag ratio portion to be indicative of the webpage’s
content.
8
![Page 9: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/9.jpg)
Threshold Method
• Determine a threshold τ which discriminates TRs into content
and non-content sections.
• That is, any TR value greater than or equal to τ should be
labeled content, and conversely, any TR value less than τ
should be labeled not content.
• The problem becomes a task of finding the best value for τ.
9
![Page 10: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/10.jpg)
Smoothing
• Without smoothing, many important content lines might be
lost.
• i.e. the page title, the news article byline or dateline, short or one
sentence paragraphs.
• To resolve this problem, this paper apply a Gaussian
smoothing pass to T.
1. Construct a Gaussian kernel k with a radius of 1 standard
deviation 1σ, giving a total window size of 2(σ) + 1.
• The size of and values within k vary according to σ because as the
variance of T increases, smoothing necessity also increases. 10
).(20,2
2
2 iekj
j
i
![Page 11: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/11.jpg)
Smoothing
2. k is normalized to form k’.
3. The Gaussian kernel k is convolved with T in order to form a
smoothed histogram (T’)
11
).(2,
0
' ik
kk
j j
ii
.)(,'' TleniTkT ji
j
ji
![Page 12: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/12.jpg)
Smoothing
12
![Page 13: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/13.jpg)
Selecting Content from the Threshold
• Let C be the set of content lines such that Di ∈ C iff Ti’ ≥ τ
where Di ≌ Ti’ and τ ← λσ where λ is a parameter and σ is the
standard deviation.
• λ ← 1.
• λ is discussed further in the experiment result.
• This threshold method is hereafter referred to as CETR-TM.
13
![Page 14: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/14.jpg)
Selecting Content via Clustering
• Alternatively, this paper apply the k-means clustering method
to group content C and non-content N lines by using T’ as the
only similarity measure.
• k ← 3.
• The resulting k clusters S1, S2 . . . Sk are labeled by selecting
the cluster which has its centroid closest to the origin (i.e. zero
in 1-dimensional space) Smin and assigning N ← Smin. The
remaining clusters are assigned to C.
• This 1-dimensional k-means clustering method is hereafter
referred to as CETR-KM.
14
![Page 15: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/15.jpg)
2D Modal
• One shortcoming of the Threshold Clustering and k-Means:
• They view the TR histogram as a set of values rather than an
ordered sequence of values, and as a result they are not sensitive
to jumps in the TR histogram.
• Transforming the histogram data so that it may be represented
in 2-dimensions we can capture the ordered nature of the
histogram data and obtain more accurate results.
• Define the two dimensions:
1. A smoothed TR histogram (T’).
2. The absolute smoothed derivatives of the smoothed TR
histogram (G).
15
ˆ
![Page 16: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/16.jpg)
2D Modal
• To compute G:
• First smooth T.
• Find the derivatives for each element in the array
• subtract T i from the mean of the next α elements in order to
differentiate on the moving average instead of line-by-line.
• α = 3.
• Next, smooth G by the above way to get G’.
• Finally, compute G such that Gi = |Gi’| for each i in G’.
16
.)(0,)( ''0
'
'' TleniTT
GTf i
j ji
ii
ˆ ˆ
![Page 17: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/17.jpg)
2D Modal
17
![Page 18: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/18.jpg)
Constructing 2D Tag Ratios
• Combining the Smoothed Tag Ratios T’ from Figure 3and the
Absolute Smoothed Derivatives of Smoothed Tag Ratios G
from Figure 4.
• Manually label each point
• content (×)
• non-content (+)
18
ˆ
Giˆ
Ti’
![Page 19: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/19.jpg)
Clustering Tag Ratios
• Based on the k-Means algorithm:
• k ← 3.
• One cluster has a centroid which is always set to the origin.
• Define mji to be the centroid of Si at iteration j and then force
m1j = (0, 0).
• This approach is beneficial in 2 ways:
1. It forces the remaining clusters to migrate away from the origin
where the non-content points are located.
2. It provides an easy means for labeling the resulting clusters.
• i.e. N ← S1.
• i.e. C ← S2, . . . , Sk.
19
![Page 20: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/20.jpg)
Implementation details
• Without HTML tags, the method will fail.
• Assume that tagless webpages contain only content and then
return the entire text.
• There exist some webpages wherein the HTML markup is
written in a single line.
• Resolve this issue by detecting these instances and inserting line
breaks every 65 characters.
• If the 65th character is located within a tag, then the line break is
inserted at the next non-tag location.
20
![Page 21: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/21.jpg)
Experiments
• Data Set
1. News site data from Pasternak and Roth’s 2009 WWW paper on
maximum subsequence segmentation (MSS).
2. Training and test data sets from the CleanEval competition.
• CleanEval:
• The CleanEval project is a shared task for cleaning arbitrary
webpages.
• This corpus includes four divisions: a development/training set
and an evaluation set in both English and Chinese languages
which are all hand-labeled.
• Because our approach does not require training there is no need to
separate between training/development and evaluation documents.21
![Page 22: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/22.jpg)
Results(CETR-TM)
• The threshold τ is set to 1.0σ, i.e., λ ← 1.0.
• If λ is increased (e.g., 1.1σ, 1.2σ), then the selectivity of the
threshold would increase causing the precision to increase and the
recall to decrease.
• If λ is decreased (e.g., 0.9σ, 0.8σ), then the selectivity of the
threshold would decrease resulting in a lower precision and a
higher recall.
22
![Page 23: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/23.jpg)
Results (CETR-KM)
• The CETR-KM method typically achieves either a high recall
or a high precision but rarely both at the same time.
• These results show that CETR-KM typically outperforms
CETR-TM.
23
![Page 24: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/24.jpg)
Results (CETR)
• The relatively low precision reported by NY Post, Freep and
Techweb is likely due to the fact that these sources contain user
comments, feedback, etc. after each article.
• CETR typically does not include short comments as content
whereas the gold standard extractions include these comments.
24
![Page 25: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/25.jpg)
Methods Comparison
• Some of the MSS results are not listed because the original
work did not perform experiments on these data sources.
• The MSS algorithm performs highest on the Big 5 data set.
• comment effect!
25
![Page 26: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/26.jpg)
Discussion
1. Unlike MSS, CETR is a completely unsupervised algorithm
and therefore does not require labeled training examples.
2. Finding a good threshold value is difficult because λ must be
empirically found for each domain.
26
![Page 27: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/27.jpg)
Summary
• Results show that when compared to several other leading
content extraction methods CETR performs best on average.
• The complete CETR algorithm contains:
• no parameters to adjust (k ← 3 and α ← 3 works for most cases).
• no training to be done.
• no classifier models to build.
• all that is required is to give, as input, an HTML document and the
approximate content will be returned.
27
![Page 28: Content extraction via tag ratios](https://reader033.vdocuments.net/reader033/viewer/2022042614/55a25e141a28ab822b8b4869/html5/thumbnails/28.jpg)
Future Work
• The authors intend to incorporate this method into standard
search engines in order to see what effect, if any, it has on the
result relevance.
• For instance, many webpages include word strings and links in
order to boost their search engine rank, if we can filter the
irrelevant text from the page during indexing then it may be
possible to present more relevant search results.
• Another area for further investigation is the clustering
algorithm used in CETR.
• In fact, a linear max-margin clustering approach may give better
results.
28