web page classification using visual layout analysis · web page classification using visual ......
TRANSCRIPT
Web page classification using visual layout analysis
Miloš Kovaèeviæ1, Michelangelo Diligenti2, Marco Gori2, Marco Maggini2, Veljko Milutinoviæ3
1School of Civil Engineering, University of Belgrade, [email protected]
2Dipartimento di Ingegneria dell’Informazione, University of Siena, Italy{diligmic, maggini, marco}@dii.unisi.it
3School of Electrical Engineering, University of Belgrade, [email protected]
Abstract
Automatic processing of Web documents is an important issue in the design of search engines, of
Web mining tools, and of applications for Web information extraction. Simple text-based
approaches are typically used in which most of the information provided by the page visual layout
is discarded. Only some visual features, as the font face and size, are effectively used to weigh the
importance of the words in the page. In this paper, we propose to use a hierarchical representation,
which includes the visual screen coordinates for every HTML object in the page. The use of the
visual layout allows us to identify common page components such as the header, the navigation
bars, the left and right menus, the footer, and the informative parts of the page. The recognition of
the functional role of each object is performed by a set of heuristic rules. The experimental results
show that page areas are correctly classified in 73% of the cases. The identification of different
functional areas on the page allows the definition of a more accurate method for representing the
page text contents, which splits the text features into different subsets according to the area they
belong to. We show that this approach can improve the classification accuracy for page topic
categorization by more than 10% with respect to the use of a flat “bag-of-words” representation.
2
1. Introduction
The target of Web page design is to organize the information to make it attractive and usable by
humans. Web authors organize the page contents in order to facilitate the users’ navigation and to
give a nice and readable look to the page contents. The complexity of the Web page structure has
increased in the last few years with the diffusion of sophisticated Web authoring tools which make it
easy to use different styles, to position text areas anywhere in the page, to automatically include
menu or navigation bars. Even if the new trends in Web page design allow authors to enrich the
information contents of published documents and to improve the presentation and usability for
human users, the automatic processing of Web documents is becoming increasingly more difficult.
In fact, most of the features which are attractive for humans are just noise for machine learning and
information retrieval techniques when they rely only on a flat text-based representation of the page.
Since the page layout is used by Web authors to carry important information, visual analysis can
introduce many benefits in the design of systems for the automatic processing of web pages. The
visual information associated to an HTML page consists essentially of the spatial position of each
HTML object in a browser window.
Visual information can be useful in many tasks. Consider the problem of feature selection for web
page classification. There are several methods, based on word-frequency analysis, to perform
feature selection, for example using Information Gain [1] or TF-IDF (term frequency – inverse
document frequency) [2]. In both cases we try to estimate which are the most relevant words that
describe document D, i.e. the best vector representation of D that will be used in the classification
process. It is evident that some words may represent noise with respect to the real page contents if
they belong to a menu, a navigation bar, a banner link or a page footer. Also, we can suppose that
the words in the central part of the page (screen) carry more information than the words from the
down right corner. Hence, there should be a way to weight differently words from different layout
contexts. Moreover, visual analysis can allow us to identify areas related to different page contents.
3
The analysis of the page layout can be the basis for designing efficient crawling strategies in
focused search engines. Given a specific topic T and a set of seed pages S, a focused crawler visits
the Web graph starting from the pages in S, choosing the links to follow in order to retrieve only
on-topic pages. The policies for focused crawlers require to estimate whether an outgoing link is
promising or not [3,4,5]. Usually link selection is based on the analysis of the anchor text. However,
the position of the link can improve the selection accuracy: links that belong to menus or navigation
bars can be less important than links in the center of the page; links that are surrounded by “more”
text are probably more important to the topic than links positioned in groups; anyway groups of
links can identify “hub” areas on the page.
Finally, many tricks have been used recently for cheating search engines in order to gain visibility
on the Web. A widely used technique consists in inserting irrelevant keywords into the HTML
source. While it is relatively easy to detect and reject false keywords where their foreground color is
the same as the background color, there is no way to detect keywords of regular color but covered
with images. Moreover, topologically based rankings, as the PageRank used in the Google search
engine [6], and indexing techniques based on the keywords contained in the anchor text of the links
to a given page can suffer from link spamming. The target page can gain positions in the ranking list
by constructing a “promoting” Web substructure which links the target page with fake on-topic
anchors and an ad hoc topology.
Visual page analysis is also motivated by the common design patterns of Web pages which result
from usability studies. Users expect to find certain objects in predefined areas on the browser screen
[7]. Figure 1 shows the areas where users expect to find internal and external links. This simple
observations allowed us to define a set of heuristics for the recognition of specific page areas
(menus, header, footer and “center” of a page).
4
Figure 1: User expectance in percents concerning positions of internal (left) and external (right) links in the
browser window [6]. It is clear that menus are supposed to be either inside the left or the right margin of a page.
We used the visual layout analysis and the classification of the page areas into functional
components to build a modular classifier for page categorization. The visual partition of the page
into components produces subsets of features which are processed by different Naïve Bayes
classifiers. The outputs of the classifiers are combined to obtain the class score for the page. We
used a feed-forward two-layer neural net to estimate the optimal weights to be used in the mixture.
The resulting classification system shows a significantly better accuracy than a single Naïve Bayes
processing the page using a “bag-of-words” representation.
The paper is organized as follows. Section 2 defines the M-Tree representation of an HTML
document used to render the page on the virtual screen, i.e. to obtain the screen coordinates of
every HTML object. Section 3 describes the heuristics for the recognition of some functional
components (header, footer, left and right menu, and “center” of the page). In Section 4, the
experimental results for component recognition on a predefined dataset are reported. In Section 5,
we present the page classification system based on visual analysis. In Section 6, the experimental
results for the classification of HTML pages in a predefined dataset are shown. Finally, the
conclusions are drawn in Section 7.
5
2. Extracting visual information from an HTML source
We introduce a virtual screen (VS) that defines a coordinate system for specifying the positions
of HTML objects inside pages. The VS is a rectangle with a predefined width measured in pixels.
The VS width is set 1000 pixels which corresponds to a page display area in a maximized browser
window on a standard monitor with resolution of 1024x768 pixels. Of course one can set any other
resolution but this one is compatible with most of the common web page patterns. The actual VS
height depends on the page at hand. The top left corner of the VS represents the origin of the VS
coordinate system.
PARSER
TREEBUILDER
RENDERINGMODULE
...(TE,DE),(TE,DE),...
PAGE
m-TREE
M-TREE
FIRST PASSSECOND and THIRD
PASS
Figure 2: Constructing the M-Tree (main steps).
Figure 2 shows the steps of the visual analysis algorithm. In the first step the page is parsed using
an HTML parser that extracts two different types of elements – tags and data. Tag elements (TE)
are delimited by the characters “<” and “>”, while data elements (DE) are contained between two
consecutive tags. Each TE element is labeled by the name of the corresponding tag and a list of
attribute-value pairs. A DE element is represented as a list of text tokens (words), which are
extracted using the space characters as separators. A DE contains the empty list if no token is
placed between the two consecutive tags. The parser skips the <SCRIPT> and <!--…> tags.
6
In the second step, the (TE,DE) sequence which is extracted from the HTML code is processed
by a tree builder. The output of this module is a structure we named m-Tree. Many different
algorithms to construct the parsing tree of an HTML page are described in the literature [8,9]. We
adopted a single pass technique based on a set of rules which allow us to properly nest TEs into the
hierarchy according to the HTML 4.01 specification [10]. Additional efforts were made to design
the tree builder that is immune to bad HTML source.
Definition 1: a m-Tree (mT) is a directed n-ary tree defined by a set of nodes N and a set of edges
E with following characteristics:
1. N = Ndesc ∪ Ncont ∪ Ndata
where:
- Ndesc (description nodes) is the set of nodes which correspond to the TEs of the HTML tags
{<TITLE>, <META>};
- Ncont (container nodes) is the set of nodes which correspond to the TEs of the HTML tags
{<TABLE>, <CAPTION>, <TH>, <TD>, <TR>, <P>, <CENTER>, <DIV>,
<BLOCKQUOTE>, <ADDRESS>, <PRE>, <H1>, <H2>, <H3>, <H4>, <H5>, <H6>,
<OL>, <UL>, <LI>, <MENU>, <DIR>, <DL>, <DT>, <DD>, <A>, <IMG>, <BR>,
<HR>} ;
- Ndata (data nodes) is the set of nodes which correspond to the DEs.
Each node n∈ N has the following attributes: name is the name of the corresponding tag except
for the nodes in Ndata where name = “TEXT”; attval is a list of attribute-value pairs extracted
from the corresponding tag and it can be empty (e.g. for the nodes in Ndata). Additionally, each
node in Ndata has four more attributes: value which contains tokens from the corresponding DE;
fsize which describes the font size of the tokens; emph which defines the text appearance as
7
specified by the tags {<B>, <I>, <U>, <STRONG>, <EM>, <SMALL>, <BIG>}; align
which describes the alignment of the text (left, right or centered). In the following, we denote a
node n which corresponds to tag X as n<X>.
2. The root of mT, nROOT ∈∈ Ncont represents the whole page and its name is set to “ROOT” while
attval contains only the pair (“URL”, url of the page).
3. The set of the edges E = {(nx , ny) | nx , ny ∈ N } contains:
a) (nROOT, ndesc), ∀∀ ndesc ∈ Ndesc
b) (ncont1, ncont2), ∀ ncont1 ∈ Ncont \ {n<IMG>}, ∀ ncont2 ∈ Ncont \ {nROOT} iff ncont2 belongs to the
context of ncont1 according to the nesting rules of the HTML 4.01 specification;
c) (ncont, ndata), ∀ ncont ∈ Ncont \ {n<IMG>},∀ ndata ∈ Ndata iff the node ndata belongs to the
context of the node ncont.
From the definition it follows that image and text nodes can be only leafs in an mT. Figure 3
shows an example of a simple page and its corresponding mT.
Figure 3: An HTML source (right) and its mT (left).
The coordinates of every object of interest are computed by the rendering module using the mT.
We followed some recommendations from W3C [10] and the visual behavior of one of the most
8
popular browsers (Microsoft Internet Explorer) to design the renderer. In order to simplify and to
speed the rendering process, we assumed some simplifications which do not influence significantly
the page representation for our specific task. The simplifications are the following:
1) The rendering module (RM) calculates only coordinates for the nodes in mT, i.e. some HTML
tags are not considered.
2) The RM does not support layered HTML documents.
3) The RM does not support frames.
4) The RM does not support style sheets.
The rendering module produces the final representation of a page as a M-Tree (MT) which
extends the concept of mT by incorporating the VS coordinates for each node n ∈ N \Ndesc .
Definition 2: A MT is the extension of a mT in which ∀n ∈ N \ Ndesc there are two additional
attributes: X and Y. These are arrays which contain the x and y coordinates of the corresponding
object polygon on the VS with following characteristics:
1. If n ∈ Ncont \ {n<A>} then it is assumed that the corresponding object occupies a rectangular area
on the VS. Thus X and Y have dimension 4. The margins of the rectangle are:
- the bottom margin is equal to the top margin of the left sibling node if it exists. If n does not
have a left sibling or n = n<TD> , then the bottom margin is equal to the bottom margin of its
parent node. If n = nROOT then the bottom margin is the x-axes of the VS coordinate system.
- the top margin is equal to the top margin of the rightmost leaf node of the sub-tree having n as
the root node.
9
- the left margin is equal to the left margin of the parent node of n, shifted to the right by a
correction factor. This factor depends on the name of the node (e.g. if name = “LI” this factor
is set to 5 times the current font width because of the indentation of list items). If n = n<TD> and
n has a left sibling then the left margin is equal to right margin of the this left sibling. If n =
nROOT then the left margin is the y-axes of the VS coordinate system.
- the right margin is equal to the right margin of the parent of node n. If n = n<TABLE> or n =
n<TD> then the right margin is set to correspond to table/cell width.
3. If n ∈ Ndata or n = n<A> then X and Y can have dimensions from 4 to 8 depending on the area on
the VS occupied by the corresponding text/link (see Figure 4). Coordinates are calculated
considering the number of characters contained in the value attribute and the current font width.
Text flow is restricted to the right margin of the parent node and then a new line is started. The
height of the line is determined by current font height.
this is the first paragraph!this is the first paragraph!
x0,y0 x1,y1
x2,y2x3,y3
this is the secondexample
x0,y0
x5,y5
thisis the thirdcase
x0,y0x7,y7
Figure 4: Some examples of TEXT polygons.
The definition of the M-Tree covers most of the aspects of the rendering process, but not all of
them because of the complexity of the process. For example, if the page contains tables then the RM
implements a modified auto-layout algorithm [11] for calculating table/column/cell widths. When a
n<TABLE> node is encountered, the RM goes down the mT to calculate the cell/column/table widths
to determine the table width. If there are other n<TABLE> nodes down on the path (nesting of tables)
the width computation is performed recursively. Before resolving a table, artificial cells (nodes) are
10
inserted in order to simplify the cases where cell spanning is present (colspan and rowspan attributes
in the tag <TD>).
3. Defining heuristics for the recognition of the page components
Given the MT of a page and assuming the common web page design patterns, it is possible to
define a set of heuristics for the recognition of standard components of a page, such as menus or
footers. We considered five different types of components: header (H), footer (F), left menu (LM),
right menu (RM), and center of the page (C). We focused our attention on these particular
components because they are frequently found in Web pages regardless of the page topic. We
adopted intuitive definitions for each class, which rely exclusively on the VS coordinates of logical
groups of objects in the page.
After a careful examination of many different Web pages, we restricted the areas in which
components of type H, F, LM, RM, and C can be found. We introduced a specific partition of a
page into locations as it is shown in figure 5.
W1 W2
H1
H2end of page
start ofpage
H
F
RMLM
C
Figure 5: Partition of the page into locations.
11
We set W1 = W2 to be 30% of the page width in pixels determined by the rightmost margin
among the nodes in MT. W1 (W2) defines the location LM (RM) where the LM (RM) components
can be exclusively found. We set H1 = 200 pixels and H2 = 150 pixels. H1 and H2 define H and F
respectively which are the locations where the components H and F can be exclusively found. These
values were found by a statistical analysis on a sample of Web pages. Components are recognized
using the following heuristics:
Heuristic 1: H consists of the nodes in each sub-tree S of MT whose root rS lies in H and satisfies
one or more of the following conditions:
1. rS is of type n<TABLE> and the corresponding table lies completely in H (i.e. the upper margin of the
table is less than or equal to H1).
2. the upper margin of the object associated to rS is less than or equal to the maximum upper bound
of all the n<TABLE> nodes which satisfy condition 1 and rS is not in a sub-tree satisfying condition
1.
Heuristic 2: LM consists of the nodes in each sub-tree S of MT whose root rS lies in LM, it is not
contained in H and it satisfies one or more of the following conditions:
1. rS is of type n<TABLE> and n<TABLE> lies completely in LM (i.e. the right bound of the table is less
than or equal to W1).
2. rS is of type n<TD>, it completely lies in LM, and the node n<TABLE> to which it belongs has a lower
bound less than or equal to H1 and the upper bound greater than or equal to H2.
12
Heuristic 3: RM consists of the nodes in each sub-tree S of MT whose root rS lies in RM, it is not
contained in H, LM and it satisfies one or more of the two conditions which are obtained from the
conditions of heuristic 2 by substituting LM by RM and W1 by W2 .
Heuristic 4: F consists of the nodes in each sub-tree S of MT whose root rS lies in F, it is not
contained in H, LM, RM, and it satisfies one or more of the following conditions:
1. rS is of type n<TABLE> and the node lies completely in F (i.e. the bottom margin of the table is
greater than or equal to H2).
2. the lower bound of rS is greater than or equal to the maximum lower bound of all the n<TABLE>
nodes which satisfy condition 1 and rS does not belong to any sub-tree satisfying condition 1.
3. the lower bound of rS is greater than or equal to the upper bound of the lowest of all nodes n
completely contained in F, where n ∈∈ {n<BR>, n<HR>} or n is in the scope of the central text
alignment, and rS does not belong to any sub-tree satisfying one of the two previous conditions.
Heuristic 5: C consists of all nodes in MT that are not in H, LM, RM, and F.
These heuristics are strongly dependent on table objects. In fact, tables are commonly used (≈
88% of the cases) to organize the layout of the page and the alignment of other objects. Thus pages
usually include a lot of tables and every table cell often represents a small amount of logically
grouped information. Often tables are used to group menu objects, footers, search and input forms,
and text areas.
13
4. Experimental results for component recognition
We constructed a complete dataset by downloading nearly 1000 files from the first level of each
root category of the open source directory www.dmoz.org . The total number of pages in the
dataset was about 16000. In order to test the performance of the component recognition algorithm,
we extracted a subset D of the complete dataset by randomly choosing 515 files, uniformly
distributed among the categories and the file sizes. Two experts collaboratively labeled the areas in
the pages that could be considered of type H, F, LM, RM, and C. The areas of the pages in D were
automatically labeled by using the Siena Tree 1 tool that includes an MT builder and the logic for
applying the component recognition heuristics. In these experiments, we selected values for the
margins H1, H2, W1, and W2 according to the statistics from [6].The performance of the region
classifier was evaluated by comparing the labels assigned by the experts and the labels produced
automatically by the recognition algorithm. The results are shown in Table 1.
Header Footer Left M Right M Overall
Not recognized 25 13 6 5 3
Bad 16 17 15 14 24
Good 10 15 3 2 50
Excellent 49 55 76 79 23
Table 1: Recognition rate (in %) for different levels of accuracy. Shaded rows represent successful
recognition.
1 Siena Tree is written in Java 1.3 and can be used to visualize objects of interest from a web page. One can enter any sequenceof HTML tags to obtain the picture (visualization) of their positions. To obtain a demo version contact [email protected]
14
The possible outcomes of the recognition process were summarized in 4 different categories. A
component X is “not recognized” if the component X exists but it is detected, or if X does not exist
but some part of the page is labeled as X. If only less than 50% of the objects in X are labeled, or if
some objects out of X are labeled too, then the recognition is considered “bad”. A “good”
recognition is obtained if more than 50% but less than 90% of the objects in X are labeled and no
objects out of X are labeled. Finally, the recognition is considered “excellent” if more than 90% of
the objects from X and no objects out of X are labeled.
The recognition of components of type C is the complement of the recognition of the other areas
according to heuristic 5. So we did not include it in performance measurements. The column
“overall” is obtained by introducing a total score S for a given page as the sum of the scores
assigned for the recognition of all areas of interest. If X∈ {H, F, LM, RM} is “not recognized” then
the corresponding score is 0. Categories “bad”, “good”, and “excellent” are mapped to the score
values 1, 2, and 3 respectively. Hence, if S=12 we considered the overall recognition for particular
page as “excellent”. Similarly, “good” refers to the case 8 ≤≤ S < 12, “bad” stands for the case 4 ≤≤
S < 8, and “not recognized” represents the cases in which S < 4.
By analyzing pages for which the overall recognition was “bad” or “not recognized”, we found
that in nearly 20% of the cases, the MT was not completely correct because of the approximations
used in the rendering process. A typical error is that the single subparts of the page are internally
well rendered, but they are scrambled as a whole. For the rest 80% of the cases, the bad
performance was due to the heuristic rules used in the classifier. We are currently investigating the
possibility of writing more accurate rules.
15
5. Page classification using visual information
The rendering module provides an enhanced document representation, which can be used in all
the tasks (i.e. page ranking, crawling, document clustering and classification), where it is important
to preserve the complex structure of a Web page which is not captured by the traditional bag-of-
words representation. In this paper, we have performed some document classification experiments
using a MT representation for each page. In this section, we first describe the employed feature
selection algorithm. Then, we describe the architecture of a novel classification system dealing with
visual information. In particular, the proposed system is composed by a pool of Naïve Bayes
classifiers which outputs are combined by a neural network (NN).
5.1. Feature selection process
Let D be a set of pages and d∈∈ D be a generic document in the collection. We assume that each
document in D belongs to one of n mutually exclusive categories. We indicate with ci the set
containing all documents belonging to the ith category. In order to use any classification technique, d
must be represented as a feature vector relative to a D-specific vocabulary V. A common approach
is to construct a feature vector v representing d that has |V| coordinates, the ith entry of v is equal to
1 iff the ith feature from V is in d, and equal to 0 otherwise. In order to reduce the dimensionality of
the feature space and thus to improve classification accuracy, numerous techniques were developed
[13]. Our feature selection methodology starts dividing each page into 6 parts: header, footer, left
and right menu, central part and a set of features from meta tags such as a <TITLE> tag. When this
procedure is applied on a dataset of documents, 6 datasets are obtained where each one contains
only a portion of the initial documents. For each dataset, we construct a vocabulary of features and
we compute the Information Gain for each feature. In the reduced document representation we
want to maintain, the same percentage of features belonging to a specific region . For example,
suppose that a page contains 100 different features and that we want to reduce the page to contain
16
only 50 features. If header, footer, left and right menus, central part and title respectively contain
20, 6, 20, 4 and 40 features, then we pick the 20∆50/100=10 most informative features (features
providing the highest information gain) from the header, 3 from the footer and so on.
5.2. Classification system architecture
We decided to use Naïve Bayes (NB) classification technique for classifying each element of the
page. The Naïve Bayes classifier [11] is the simplest instance of a probabilistic classifier. The output
p(c|d) of a probabilistic classifier is the probability that a pattern d belongs to a class c after
observing the data d (posterior probability). It assumes that text data comes from a set of parametric
models (each single model is associated to a class). Training data are used to estimate the unknown
model parameters. During the operative phase, the classifier computes (for each model) the
probability p(d|c) expressing the probability that the document is generated using the model. The
Bayes theorem allows the inversion of the generative model and the computation of the posterior
probabilities (probability that the model generated the pattern). The final classification is performed
selecting the model yielding the maximum posterior probability. In spite of its simplicity, a Naïve
Bayes classifier is almost as accurate as state-of-the-art learning algorithms for text categorization
tasks [12]. The Naïve Bayes classifier is the most used classifier in many different Web applications
such as focus crawling, recommending systems, etc. For all these reasons, we have selected such
classifier to measure the accuracy improvement provided by taking into account visual information.
17
PageP
MT of P
Visualisation RecognitionP
Header
FooterLeftM
RightM
Center
Title +Metatags
Figure 6: Page representation used as an input for 6 Naïve Bayes classifiers.
A Naïve Bayes classifier computes the conditional probability P(cj|d) that given d belongs to cj.
According to the Bayes theorem such probability is:
)(
)|()()|(
dP
cdPcPdcP jj
j = (3)
The Naïve Bayes assumption states that features are independent of each other, given a category
cj:
∏=
∝||
1
)|()()|(v
ijijj cwPcPdcP (4)
where wi ∈∈ V is the ith feature from v . In the learning phase we estimate P(cj) and P(wk|cj) from the
set of training examples. Estimates are given through the following equations:
||
1)(
||
1
Dn
ycP
D
iij
j +
+=
∑= (5)
∑∑
∑
==
=
+
+=
||
1
||
1
||
1
),(||
),(1)|(
V
s
D
i isij
ik
D
iij
jk
dwNyV
dwNycwP (6)
18
where yij =1 iff di ∈ cj, else yij =0. N(w,di) is equal to the frequency of the feature w in di. In (5) and
(6) we used Laplace smoothing to prevent zero probabilities for infrequently occurring features.
After splitting a page into 6 constituents, each constituent is classified by a NB classifier.
Threfore, 6 distinct class memberships are obtained for each class, yielding a vector of 6∆n
probability components for each processed page (n is the number of distinct classes in the dataset).
In a first prototype, we decided to calculate a linear combination of probability estimates for each
class and then to assign the final class label *jc to a class with a maximum value of that combination:
∑=
=6
1
* )|(maxargi
ijijj dcPkc (7)
The probabilities )|( ij dcP are the outputs from the ith classifier, and di denotes the ith constituent
of page d. In our experiments i=1,2,3,4,5,6 has respectively the meaning of a left menu, right menu,
header, footer, center and meta-tags respectively. The weight ki ( 6,...,1,10 =≤≤ iki and 16
1
=∑=i
ik )
denotes the influence of the i-th classifier in the final decision. Such influence should be proportional
to the expected relevance of the information stored into the di part of the page. In particular, after
some tuning, we have assigned the following weights to each classifier: header 0.1, footer 0.01, left
menu 0.05, right menu 0.04, center 0.5, title and meta-tags 0.3. The results of classification will be
shown in Section 6.
In a second prototype we employed an artificial neural network to estimate the optimal weights
of the mixture. Such approach allows the system to automatically adapt on different datasets
without any manual tuning of the parameters.
19
M-Treebuilder
Arearecognizer
Featureselection
unit
NaiveBayes
classifierH
NaiveBayes
classifierF
NaiveBayes
classifierLM
NaiveBayes
classifierRM
NaiveBayes
classifierC
NaiveBayes
classifierM
Neural Net
n n n n n n
n
page
Classprobabilities
Figure 7: Architecture of a classification system based on a visual representation of a page.
We used a two layer feed-forward neural net with sigmoidal units, trained with the
Backpropagation learning algorithm [14]. The basic idea for Backpropagation learning is to
implement a gradient descent search through the space of possible network weights wi, iteratively
reducing the error between the training example target values and the network outputs. The 6∆n
dimensional output of the Naive Bayes classifiers is input of the network. In our experimental setup,
the neural network featured 20 hidden units and 15 output units, one per class. Each network
output represents the probabilities of belonging to the corresponding class. The ordinal number of
the output with maximum value corresponds to the winning class. The learning rate and the
momentum term were both set equal to 0.5. A validation set was used to learn the network
parameters.
20
6. Classification results
At the time of writing, there was not a dataset of Web pages, which has been commonly accepted
as a standard reference for classification tasks. Maybe the best known dataset is called WebKB2.
After some inspection of the pages in this dataset we concluded that the dataset is not appropriate
because pages are of very simple layout without complex HTML structures. This dataset originates
from January 1997 and collected pages only from educational domains. Educational pages often
features a much simpler structure than commercial pages, that moreover cover more than 80% of
the Web [16]. Since 1997 Web design has been developing significantly, nowadays, many popular
software packages allow a fast and easy design of complex Web pages. Often real pages are very
complex in their presentational concept (just look at the CNN or BBC home pages and compare it
with the pages from the WebKB dataset). Thus, we decided to create our own dataset. After
extracting all the URLs provided by the first 5 levels of the DMOZ topic taxonomy, we selected the
15 topics at the first level of the hierarchy (the first level topic “Regional” was discarded since it
mostly contains non-english documents). Each URL has been associated to the class (topic) from
which it has been discovered. Finally, all classes have been randomly pruned, keeping only 1000
URLs for each class. Using a Web crawler, we downloaded all the documents associated to the
URLs. Many links were broken (server down or pages not available anymore), thus only about
10.000 pages could be effectively retrieved (an average of 668 pages for each class). These pages
have been used to create the dataset. Such dataset3 can be easily replicated, enlarged and updated
(the continuous changing of Web format and styles does not allow to employ a frozen dataset since
after a few months it would be not representative of the real documents that can be found on the
Internet).
2 Can be downloaded from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/3 The data set can be downloaded from http://nautilus.dii.unisi.it/download/webDataset.tar.gz
21
The dataset was split into a training set, test set and a validation set, the latter one used for
training the neural network. In order to compare the results provided by our method, we used a
standard NB classifier (as described in section 5), dealing with all words filtered by a feature
selection methods. In particular, we employed a feature selection algorithm either based on TF-IDF
or Information Gain. The results are shown in the figure 8.
Classifier performance (100 training examples per category)
0
10
20
30
40
50
60
70
25 50 75 100 125
Number of features
Cla
ssifi
ed c
orr
ect
ly(%
)
NN LM IG stop-list IG TF-IDF
Classifier performance (200 training examples per category)
0
10
20
30
40
50
60
70
25 50 75 100 125
Number of features
Cla
ssifi
ed
co
rre
ctly
(%)
NN LM IG stop-list IG TF-IDF
Figure 8: Classifier performance: NN and LM indicates the results of methods taking into account
visual information with respectively a neural network and a linear combination, performing the final
mixture. IG indicates the result of a NB classifier dealing with the entire page (no visual information is
considered), filtered using Information Gains of words. IG stop-list uses also a stop-list of features.
Finally, TF-IDF indicates the results of a NB classifier which input is filtered by a TF-IDF based
algorithm. Visual based methods clear outperform methods based on a non-visual representation.
Feature selection based on TF-IDF is worse when compared to Information Gain. Classification
accuracy can be improved using feature selection based on Information Gain in combination with a
stop-list, containing common terms such as articles, pronouns etc. However, it is clear that methods
using visual information clearly outperform methods purely based on text analysis (the improvement
22
was around 10 even where classical methods reach their maximum). Increasing the number of
training examples from 100 to 200 per group helps in improving the performance for all the
methods. The interesting effect is that if we increase the number of features, after some limit
(around 75) precision of classic approaches decreases. Our methods are more robust to that
phenomenon, since noisy features are likely to lay into marginal areas of the page and are then
filtered by the final mixture.
Even if the manually tuned mixture provides slightly better performances than the mixture
performed by the neural network, the difference is always less than 2%, clearly showing that the
network well estimates the intrinsic importance of a region.
7. Conclusion
This paper describes a possible representation, called M-Tree, for a Web page in which objects
are placed into well-defined tree hierarchy according to where they belong to in an HTML structure
of a page. Further, each object (node from M-Tree) carries information about its position in a
browser window. This visual information enables us to define heuristics for recognition of common
areas such as header, footer, left and right menus, and center of a page. The crucial difficulty was to
develop sufficiently good rendering algorithm i.e. to imitate behavior of popular user agents such as
Internet Explorer.
Unfortunately, HTML source is often not conform to the standard, posing additional problems in
the rendering process. After applying some techniques for error recovery in construction of the
parsing tree and introducing some rendering simplifications (we do not deal with frames, layers and
style sheets) we defined recognition heuristics based only on visual information. The overall results
in recognizing targeted areas was around 73%. In the future we plan to improve rendering process
and recognition heuristics. We also plan to recognize logical groups of objects that are not
23
necessarily in the same area like titles of figures, titles and subtitles in text, all commercial add-ins
etc.
Our experimental results provide evidence to claim that spatial information is of crucial
importance to classify Web documents. Classification accuracy of a Naive Bayes classifier was
increased of more than 10%, when taking into account the visual information. In particular, we
constructed a mixture of classifiers each one trained to recognize words appearing in a specific
portion of the page. In the future, we plan to use our system to improve link selection in focus
crawling sessions [4] by estimating importance of hyperlinks using their position and neighborhood.
However, we believe that our visual page representation can find its application in many other areas
related to search engines, information retrieval and data mining from the Web.
Acknowledgements
We would like to thank Nicola Baldini for fruitful discussions on the Web page representations
adopted in the focuseek project (www.focuseek.com), which inspired some of the ideas proposed in
this paper.
References
[1] Quinlan, J.R., “Induction of decision trees”, Machine Learning, 1986, pp. 81-106.
[2] Salton, G., McGill, M.J., An Introduction to Modern Information Retrieval, McGraw-Hill,
1983.
[3] Chakrabarti S., van den Berg M., Dom B., “Focused crawling: A new approach to topic-specific
web resource discovery”, In Proceedings of the 8th Int. World Wide Web Conference, Toronto,
Canada, 1999.
[4] Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M., “Focused crawling using context
graphs”, In Proceedings of the 26th Int. Conf. On Very Large Databases, Cairo, Egypt, 2000.
24
[5] Rennie J., McCallum A., “Using reinforcement learning to spider the web efficiently”, In
Proceedings of the Int. Conf. On Machine Learning, Bled, Slovenia, 1999.
[6] Brin S., Page L., “The anatomy of a large-scale hypertextual web search engine”, In
Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1997,
vol.3, ACM Press.
[7] Bernard L.M., “Criteria for optimal web design (designing for usability)”,
http://psychology.wichita.edu/optimalweb/position.htm, 2001
[8] Embley D.W., Jiang Y.S., Ng Y.K., “Record-Boundary Discovery in Web Documents”, In
Proceedings of SIGMOD, Philadelphia, USA, 1999.
[9] Lim S. J., Ng Y. K., “Extracting Structures of HTML Documents Using a High-Level Stack
Machine”, In Proceedings of the 12th International Conference on Information Networking
ICOIN, Tokyo, Japan, 1998
[10] World Wide Web Consortium (W3C), “HTML 4.01 Specification”,
http://www.w3c.org/TR/html401/ , December 1999.
[11] James F., “Representing Structured Information in Audio Interfaces: A Framework for
Selecting Audio Marking Techniques to Represent Document Structures”, Ph.D. thesis,
Stanford University, available online at http://www-pcd.stanford.edu/frankie/thesis/, 2001.
[12] Mitchell T., “Machine Learning”, McGraw Hill, 1997.
[13] Sebastiani F., “Machine learning in automated text categorization”, ACM Computing Surveys,
34(1), pp. 1-47
[14] Yang, Y., Pedersen J.P. “A Comparative Study on Feature Selection in Text Categorization”,
In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97),
1997, pp. 412-420.