web page classification using visual layout analysis · web page classification using visual ......

25
Web page classification using visual layout analysis Miloš Kovaèeviæ 1 , Michelangelo Diligenti 2 , Marco Gori 2 , Marco Maggini 2 , Veljko Milutinoviæ 3 1 School of Civil Engineering, University of Belgrade, Serbia [email protected] 2 Dipartimento di Ingegneria dell’Informazione, University of Siena, Italy {diligmic, maggini, marco}@dii.unisi.it 3 School of Electrical Engineering, University of Belgrade, Serbia [email protected] Abstract Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat “bag-of-words” representation.

Upload: vuhuong

Post on 12-May-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Web page classification using visual layout analysis

Miloš Kovaèeviæ1, Michelangelo Diligenti2, Marco Gori2, Marco Maggini2, Veljko Milutinoviæ3

1School of Civil Engineering, University of Belgrade, [email protected]

2Dipartimento di Ingegneria dell’Informazione, University of Siena, Italy{diligmic, maggini, marco}@dii.unisi.it

3School of Electrical Engineering, University of Belgrade, [email protected]

Abstract

Automatic processing of Web documents is an important issue in the design of search engines, of

Web mining tools, and of applications for Web information extraction. Simple text-based

approaches are typically used in which most of the information provided by the page visual layout

is discarded. Only some visual features, as the font face and size, are effectively used to weigh the

importance of the words in the page. In this paper, we propose to use a hierarchical representation,

which includes the visual screen coordinates for every HTML object in the page. The use of the

visual layout allows us to identify common page components such as the header, the navigation

bars, the left and right menus, the footer, and the informative parts of the page. The recognition of

the functional role of each object is performed by a set of heuristic rules. The experimental results

show that page areas are correctly classified in 73% of the cases. The identification of different

functional areas on the page allows the definition of a more accurate method for representing the

page text contents, which splits the text features into different subsets according to the area they

belong to. We show that this approach can improve the classification accuracy for page topic

categorization by more than 10% with respect to the use of a flat “bag-of-words” representation.

2

1. Introduction

The target of Web page design is to organize the information to make it attractive and usable by

humans. Web authors organize the page contents in order to facilitate the users’ navigation and to

give a nice and readable look to the page contents. The complexity of the Web page structure has

increased in the last few years with the diffusion of sophisticated Web authoring tools which make it

easy to use different styles, to position text areas anywhere in the page, to automatically include

menu or navigation bars. Even if the new trends in Web page design allow authors to enrich the

information contents of published documents and to improve the presentation and usability for

human users, the automatic processing of Web documents is becoming increasingly more difficult.

In fact, most of the features which are attractive for humans are just noise for machine learning and

information retrieval techniques when they rely only on a flat text-based representation of the page.

Since the page layout is used by Web authors to carry important information, visual analysis can

introduce many benefits in the design of systems for the automatic processing of web pages. The

visual information associated to an HTML page consists essentially of the spatial position of each

HTML object in a browser window.

Visual information can be useful in many tasks. Consider the problem of feature selection for web

page classification. There are several methods, based on word-frequency analysis, to perform

feature selection, for example using Information Gain [1] or TF-IDF (term frequency – inverse

document frequency) [2]. In both cases we try to estimate which are the most relevant words that

describe document D, i.e. the best vector representation of D that will be used in the classification

process. It is evident that some words may represent noise with respect to the real page contents if

they belong to a menu, a navigation bar, a banner link or a page footer. Also, we can suppose that

the words in the central part of the page (screen) carry more information than the words from the

down right corner. Hence, there should be a way to weight differently words from different layout

contexts. Moreover, visual analysis can allow us to identify areas related to different page contents.

3

The analysis of the page layout can be the basis for designing efficient crawling strategies in

focused search engines. Given a specific topic T and a set of seed pages S, a focused crawler visits

the Web graph starting from the pages in S, choosing the links to follow in order to retrieve only

on-topic pages. The policies for focused crawlers require to estimate whether an outgoing link is

promising or not [3,4,5]. Usually link selection is based on the analysis of the anchor text. However,

the position of the link can improve the selection accuracy: links that belong to menus or navigation

bars can be less important than links in the center of the page; links that are surrounded by “more”

text are probably more important to the topic than links positioned in groups; anyway groups of

links can identify “hub” areas on the page.

Finally, many tricks have been used recently for cheating search engines in order to gain visibility

on the Web. A widely used technique consists in inserting irrelevant keywords into the HTML

source. While it is relatively easy to detect and reject false keywords where their foreground color is

the same as the background color, there is no way to detect keywords of regular color but covered

with images. Moreover, topologically based rankings, as the PageRank used in the Google search

engine [6], and indexing techniques based on the keywords contained in the anchor text of the links

to a given page can suffer from link spamming. The target page can gain positions in the ranking list

by constructing a “promoting” Web substructure which links the target page with fake on-topic

anchors and an ad hoc topology.

Visual page analysis is also motivated by the common design patterns of Web pages which result

from usability studies. Users expect to find certain objects in predefined areas on the browser screen

[7]. Figure 1 shows the areas where users expect to find internal and external links. This simple

observations allowed us to define a set of heuristics for the recognition of specific page areas

(menus, header, footer and “center” of a page).

4

Figure 1: User expectance in percents concerning positions of internal (left) and external (right) links in the

browser window [6]. It is clear that menus are supposed to be either inside the left or the right margin of a page.

We used the visual layout analysis and the classification of the page areas into functional

components to build a modular classifier for page categorization. The visual partition of the page

into components produces subsets of features which are processed by different Naïve Bayes

classifiers. The outputs of the classifiers are combined to obtain the class score for the page. We

used a feed-forward two-layer neural net to estimate the optimal weights to be used in the mixture.

The resulting classification system shows a significantly better accuracy than a single Naïve Bayes

processing the page using a “bag-of-words” representation.

The paper is organized as follows. Section 2 defines the M-Tree representation of an HTML

document used to render the page on the virtual screen, i.e. to obtain the screen coordinates of

every HTML object. Section 3 describes the heuristics for the recognition of some functional

components (header, footer, left and right menu, and “center” of the page). In Section 4, the

experimental results for component recognition on a predefined dataset are reported. In Section 5,

we present the page classification system based on visual analysis. In Section 6, the experimental

results for the classification of HTML pages in a predefined dataset are shown. Finally, the

conclusions are drawn in Section 7.

5

2. Extracting visual information from an HTML source

We introduce a virtual screen (VS) that defines a coordinate system for specifying the positions

of HTML objects inside pages. The VS is a rectangle with a predefined width measured in pixels.

The VS width is set 1000 pixels which corresponds to a page display area in a maximized browser

window on a standard monitor with resolution of 1024x768 pixels. Of course one can set any other

resolution but this one is compatible with most of the common web page patterns. The actual VS

height depends on the page at hand. The top left corner of the VS represents the origin of the VS

coordinate system.

PARSER

TREEBUILDER

RENDERINGMODULE

...(TE,DE),(TE,DE),...

PAGE

m-TREE

M-TREE

FIRST PASSSECOND and THIRD

PASS

Figure 2: Constructing the M-Tree (main steps).

Figure 2 shows the steps of the visual analysis algorithm. In the first step the page is parsed using

an HTML parser that extracts two different types of elements – tags and data. Tag elements (TE)

are delimited by the characters “<” and “>”, while data elements (DE) are contained between two

consecutive tags. Each TE element is labeled by the name of the corresponding tag and a list of

attribute-value pairs. A DE element is represented as a list of text tokens (words), which are

extracted using the space characters as separators. A DE contains the empty list if no token is

placed between the two consecutive tags. The parser skips the <SCRIPT> and <!--…> tags.

6

In the second step, the (TE,DE) sequence which is extracted from the HTML code is processed

by a tree builder. The output of this module is a structure we named m-Tree. Many different

algorithms to construct the parsing tree of an HTML page are described in the literature [8,9]. We

adopted a single pass technique based on a set of rules which allow us to properly nest TEs into the

hierarchy according to the HTML 4.01 specification [10]. Additional efforts were made to design

the tree builder that is immune to bad HTML source.

Definition 1: a m-Tree (mT) is a directed n-ary tree defined by a set of nodes N and a set of edges

E with following characteristics:

1. N = Ndesc ∪ Ncont ∪ Ndata

where:

- Ndesc (description nodes) is the set of nodes which correspond to the TEs of the HTML tags

{<TITLE>, <META>};

- Ncont (container nodes) is the set of nodes which correspond to the TEs of the HTML tags

{<TABLE>, <CAPTION>, <TH>, <TD>, <TR>, <P>, <CENTER>, <DIV>,

<BLOCKQUOTE>, <ADDRESS>, <PRE>, <H1>, <H2>, <H3>, <H4>, <H5>, <H6>,

<OL>, <UL>, <LI>, <MENU>, <DIR>, <DL>, <DT>, <DD>, <A>, <IMG>, <BR>,

<HR>} ;

- Ndata (data nodes) is the set of nodes which correspond to the DEs.

Each node n∈ N has the following attributes: name is the name of the corresponding tag except

for the nodes in Ndata where name = “TEXT”; attval is a list of attribute-value pairs extracted

from the corresponding tag and it can be empty (e.g. for the nodes in Ndata). Additionally, each

node in Ndata has four more attributes: value which contains tokens from the corresponding DE;

fsize which describes the font size of the tokens; emph which defines the text appearance as

7

specified by the tags {<B>, <I>, <U>, <STRONG>, <EM>, <SMALL>, <BIG>}; align

which describes the alignment of the text (left, right or centered). In the following, we denote a

node n which corresponds to tag X as n<X>.

2. The root of mT, nROOT ∈∈ Ncont represents the whole page and its name is set to “ROOT” while

attval contains only the pair (“URL”, url of the page).

3. The set of the edges E = {(nx , ny) | nx , ny ∈ N } contains:

a) (nROOT, ndesc), ∀∀ ndesc ∈ Ndesc

b) (ncont1, ncont2), ∀ ncont1 ∈ Ncont \ {n<IMG>}, ∀ ncont2 ∈ Ncont \ {nROOT} iff ncont2 belongs to the

context of ncont1 according to the nesting rules of the HTML 4.01 specification;

c) (ncont, ndata), ∀ ncont ∈ Ncont \ {n<IMG>},∀ ndata ∈ Ndata iff the node ndata belongs to the

context of the node ncont.

From the definition it follows that image and text nodes can be only leafs in an mT. Figure 3

shows an example of a simple page and its corresponding mT.

Figure 3: An HTML source (right) and its mT (left).

The coordinates of every object of interest are computed by the rendering module using the mT.

We followed some recommendations from W3C [10] and the visual behavior of one of the most

8

popular browsers (Microsoft Internet Explorer) to design the renderer. In order to simplify and to

speed the rendering process, we assumed some simplifications which do not influence significantly

the page representation for our specific task. The simplifications are the following:

1) The rendering module (RM) calculates only coordinates for the nodes in mT, i.e. some HTML

tags are not considered.

2) The RM does not support layered HTML documents.

3) The RM does not support frames.

4) The RM does not support style sheets.

The rendering module produces the final representation of a page as a M-Tree (MT) which

extends the concept of mT by incorporating the VS coordinates for each node n ∈ N \Ndesc .

Definition 2: A MT is the extension of a mT in which ∀n ∈ N \ Ndesc there are two additional

attributes: X and Y. These are arrays which contain the x and y coordinates of the corresponding

object polygon on the VS with following characteristics:

1. If n ∈ Ncont \ {n<A>} then it is assumed that the corresponding object occupies a rectangular area

on the VS. Thus X and Y have dimension 4. The margins of the rectangle are:

- the bottom margin is equal to the top margin of the left sibling node if it exists. If n does not

have a left sibling or n = n<TD> , then the bottom margin is equal to the bottom margin of its

parent node. If n = nROOT then the bottom margin is the x-axes of the VS coordinate system.

- the top margin is equal to the top margin of the rightmost leaf node of the sub-tree having n as

the root node.

9

- the left margin is equal to the left margin of the parent node of n, shifted to the right by a

correction factor. This factor depends on the name of the node (e.g. if name = “LI” this factor

is set to 5 times the current font width because of the indentation of list items). If n = n<TD> and

n has a left sibling then the left margin is equal to right margin of the this left sibling. If n =

nROOT then the left margin is the y-axes of the VS coordinate system.

- the right margin is equal to the right margin of the parent of node n. If n = n<TABLE> or n =

n<TD> then the right margin is set to correspond to table/cell width.

3. If n ∈ Ndata or n = n<A> then X and Y can have dimensions from 4 to 8 depending on the area on

the VS occupied by the corresponding text/link (see Figure 4). Coordinates are calculated

considering the number of characters contained in the value attribute and the current font width.

Text flow is restricted to the right margin of the parent node and then a new line is started. The

height of the line is determined by current font height.

this is the first paragraph!this is the first paragraph!

x0,y0 x1,y1

x2,y2x3,y3

this is the secondexample

x0,y0

x5,y5

thisis the thirdcase

x0,y0x7,y7

Figure 4: Some examples of TEXT polygons.

The definition of the M-Tree covers most of the aspects of the rendering process, but not all of

them because of the complexity of the process. For example, if the page contains tables then the RM

implements a modified auto-layout algorithm [11] for calculating table/column/cell widths. When a

n<TABLE> node is encountered, the RM goes down the mT to calculate the cell/column/table widths

to determine the table width. If there are other n<TABLE> nodes down on the path (nesting of tables)

the width computation is performed recursively. Before resolving a table, artificial cells (nodes) are

10

inserted in order to simplify the cases where cell spanning is present (colspan and rowspan attributes

in the tag <TD>).

3. Defining heuristics for the recognition of the page components

Given the MT of a page and assuming the common web page design patterns, it is possible to

define a set of heuristics for the recognition of standard components of a page, such as menus or

footers. We considered five different types of components: header (H), footer (F), left menu (LM),

right menu (RM), and center of the page (C). We focused our attention on these particular

components because they are frequently found in Web pages regardless of the page topic. We

adopted intuitive definitions for each class, which rely exclusively on the VS coordinates of logical

groups of objects in the page.

After a careful examination of many different Web pages, we restricted the areas in which

components of type H, F, LM, RM, and C can be found. We introduced a specific partition of a

page into locations as it is shown in figure 5.

W1 W2

H1

H2end of page

start ofpage

H

F

RMLM

C

Figure 5: Partition of the page into locations.

11

We set W1 = W2 to be 30% of the page width in pixels determined by the rightmost margin

among the nodes in MT. W1 (W2) defines the location LM (RM) where the LM (RM) components

can be exclusively found. We set H1 = 200 pixels and H2 = 150 pixels. H1 and H2 define H and F

respectively which are the locations where the components H and F can be exclusively found. These

values were found by a statistical analysis on a sample of Web pages. Components are recognized

using the following heuristics:

Heuristic 1: H consists of the nodes in each sub-tree S of MT whose root rS lies in H and satisfies

one or more of the following conditions:

1. rS is of type n<TABLE> and the corresponding table lies completely in H (i.e. the upper margin of the

table is less than or equal to H1).

2. the upper margin of the object associated to rS is less than or equal to the maximum upper bound

of all the n<TABLE> nodes which satisfy condition 1 and rS is not in a sub-tree satisfying condition

1.

Heuristic 2: LM consists of the nodes in each sub-tree S of MT whose root rS lies in LM, it is not

contained in H and it satisfies one or more of the following conditions:

1. rS is of type n<TABLE> and n<TABLE> lies completely in LM (i.e. the right bound of the table is less

than or equal to W1).

2. rS is of type n<TD>, it completely lies in LM, and the node n<TABLE> to which it belongs has a lower

bound less than or equal to H1 and the upper bound greater than or equal to H2.

12

Heuristic 3: RM consists of the nodes in each sub-tree S of MT whose root rS lies in RM, it is not

contained in H, LM and it satisfies one or more of the two conditions which are obtained from the

conditions of heuristic 2 by substituting LM by RM and W1 by W2 .

Heuristic 4: F consists of the nodes in each sub-tree S of MT whose root rS lies in F, it is not

contained in H, LM, RM, and it satisfies one or more of the following conditions:

1. rS is of type n<TABLE> and the node lies completely in F (i.e. the bottom margin of the table is

greater than or equal to H2).

2. the lower bound of rS is greater than or equal to the maximum lower bound of all the n<TABLE>

nodes which satisfy condition 1 and rS does not belong to any sub-tree satisfying condition 1.

3. the lower bound of rS is greater than or equal to the upper bound of the lowest of all nodes n

completely contained in F, where n ∈∈ {n<BR>, n<HR>} or n is in the scope of the central text

alignment, and rS does not belong to any sub-tree satisfying one of the two previous conditions.

Heuristic 5: C consists of all nodes in MT that are not in H, LM, RM, and F.

These heuristics are strongly dependent on table objects. In fact, tables are commonly used (≈

88% of the cases) to organize the layout of the page and the alignment of other objects. Thus pages

usually include a lot of tables and every table cell often represents a small amount of logically

grouped information. Often tables are used to group menu objects, footers, search and input forms,

and text areas.

13

4. Experimental results for component recognition

We constructed a complete dataset by downloading nearly 1000 files from the first level of each

root category of the open source directory www.dmoz.org . The total number of pages in the

dataset was about 16000. In order to test the performance of the component recognition algorithm,

we extracted a subset D of the complete dataset by randomly choosing 515 files, uniformly

distributed among the categories and the file sizes. Two experts collaboratively labeled the areas in

the pages that could be considered of type H, F, LM, RM, and C. The areas of the pages in D were

automatically labeled by using the Siena Tree 1 tool that includes an MT builder and the logic for

applying the component recognition heuristics. In these experiments, we selected values for the

margins H1, H2, W1, and W2 according to the statistics from [6].The performance of the region

classifier was evaluated by comparing the labels assigned by the experts and the labels produced

automatically by the recognition algorithm. The results are shown in Table 1.

Header Footer Left M Right M Overall

Not recognized 25 13 6 5 3

Bad 16 17 15 14 24

Good 10 15 3 2 50

Excellent 49 55 76 79 23

Table 1: Recognition rate (in %) for different levels of accuracy. Shaded rows represent successful

recognition.

1 Siena Tree is written in Java 1.3 and can be used to visualize objects of interest from a web page. One can enter any sequenceof HTML tags to obtain the picture (visualization) of their positions. To obtain a demo version contact [email protected]

14

The possible outcomes of the recognition process were summarized in 4 different categories. A

component X is “not recognized” if the component X exists but it is detected, or if X does not exist

but some part of the page is labeled as X. If only less than 50% of the objects in X are labeled, or if

some objects out of X are labeled too, then the recognition is considered “bad”. A “good”

recognition is obtained if more than 50% but less than 90% of the objects in X are labeled and no

objects out of X are labeled. Finally, the recognition is considered “excellent” if more than 90% of

the objects from X and no objects out of X are labeled.

The recognition of components of type C is the complement of the recognition of the other areas

according to heuristic 5. So we did not include it in performance measurements. The column

“overall” is obtained by introducing a total score S for a given page as the sum of the scores

assigned for the recognition of all areas of interest. If X∈ {H, F, LM, RM} is “not recognized” then

the corresponding score is 0. Categories “bad”, “good”, and “excellent” are mapped to the score

values 1, 2, and 3 respectively. Hence, if S=12 we considered the overall recognition for particular

page as “excellent”. Similarly, “good” refers to the case 8 ≤≤ S < 12, “bad” stands for the case 4 ≤≤

S < 8, and “not recognized” represents the cases in which S < 4.

By analyzing pages for which the overall recognition was “bad” or “not recognized”, we found

that in nearly 20% of the cases, the MT was not completely correct because of the approximations

used in the rendering process. A typical error is that the single subparts of the page are internally

well rendered, but they are scrambled as a whole. For the rest 80% of the cases, the bad

performance was due to the heuristic rules used in the classifier. We are currently investigating the

possibility of writing more accurate rules.

15

5. Page classification using visual information

The rendering module provides an enhanced document representation, which can be used in all

the tasks (i.e. page ranking, crawling, document clustering and classification), where it is important

to preserve the complex structure of a Web page which is not captured by the traditional bag-of-

words representation. In this paper, we have performed some document classification experiments

using a MT representation for each page. In this section, we first describe the employed feature

selection algorithm. Then, we describe the architecture of a novel classification system dealing with

visual information. In particular, the proposed system is composed by a pool of Naïve Bayes

classifiers which outputs are combined by a neural network (NN).

5.1. Feature selection process

Let D be a set of pages and d∈∈ D be a generic document in the collection. We assume that each

document in D belongs to one of n mutually exclusive categories. We indicate with ci the set

containing all documents belonging to the ith category. In order to use any classification technique, d

must be represented as a feature vector relative to a D-specific vocabulary V. A common approach

is to construct a feature vector v representing d that has |V| coordinates, the ith entry of v is equal to

1 iff the ith feature from V is in d, and equal to 0 otherwise. In order to reduce the dimensionality of

the feature space and thus to improve classification accuracy, numerous techniques were developed

[13]. Our feature selection methodology starts dividing each page into 6 parts: header, footer, left

and right menu, central part and a set of features from meta tags such as a <TITLE> tag. When this

procedure is applied on a dataset of documents, 6 datasets are obtained where each one contains

only a portion of the initial documents. For each dataset, we construct a vocabulary of features and

we compute the Information Gain for each feature. In the reduced document representation we

want to maintain, the same percentage of features belonging to a specific region . For example,

suppose that a page contains 100 different features and that we want to reduce the page to contain

16

only 50 features. If header, footer, left and right menus, central part and title respectively contain

20, 6, 20, 4 and 40 features, then we pick the 20∆50/100=10 most informative features (features

providing the highest information gain) from the header, 3 from the footer and so on.

5.2. Classification system architecture

We decided to use Naïve Bayes (NB) classification technique for classifying each element of the

page. The Naïve Bayes classifier [11] is the simplest instance of a probabilistic classifier. The output

p(c|d) of a probabilistic classifier is the probability that a pattern d belongs to a class c after

observing the data d (posterior probability). It assumes that text data comes from a set of parametric

models (each single model is associated to a class). Training data are used to estimate the unknown

model parameters. During the operative phase, the classifier computes (for each model) the

probability p(d|c) expressing the probability that the document is generated using the model. The

Bayes theorem allows the inversion of the generative model and the computation of the posterior

probabilities (probability that the model generated the pattern). The final classification is performed

selecting the model yielding the maximum posterior probability. In spite of its simplicity, a Naïve

Bayes classifier is almost as accurate as state-of-the-art learning algorithms for text categorization

tasks [12]. The Naïve Bayes classifier is the most used classifier in many different Web applications

such as focus crawling, recommending systems, etc. For all these reasons, we have selected such

classifier to measure the accuracy improvement provided by taking into account visual information.

17

PageP

MT of P

Visualisation RecognitionP

Header

FooterLeftM

RightM

Center

Title +Metatags

Figure 6: Page representation used as an input for 6 Naïve Bayes classifiers.

A Naïve Bayes classifier computes the conditional probability P(cj|d) that given d belongs to cj.

According to the Bayes theorem such probability is:

)(

)|()()|(

dP

cdPcPdcP jj

j = (3)

The Naïve Bayes assumption states that features are independent of each other, given a category

cj:

∏=

∝||

1

)|()()|(v

ijijj cwPcPdcP (4)

where wi ∈∈ V is the ith feature from v . In the learning phase we estimate P(cj) and P(wk|cj) from the

set of training examples. Estimates are given through the following equations:

||

1)(

||

1

Dn

ycP

D

iij

j +

+=

∑= (5)

∑∑

==

=

+

+=

||

1

||

1

||

1

),(||

),(1)|(

V

s

D

i isij

ik

D

iij

jk

dwNyV

dwNycwP (6)

18

where yij =1 iff di ∈ cj, else yij =0. N(w,di) is equal to the frequency of the feature w in di. In (5) and

(6) we used Laplace smoothing to prevent zero probabilities for infrequently occurring features.

After splitting a page into 6 constituents, each constituent is classified by a NB classifier.

Threfore, 6 distinct class memberships are obtained for each class, yielding a vector of 6∆n

probability components for each processed page (n is the number of distinct classes in the dataset).

In a first prototype, we decided to calculate a linear combination of probability estimates for each

class and then to assign the final class label *jc to a class with a maximum value of that combination:

∑=

=6

1

* )|(maxargi

ijijj dcPkc (7)

The probabilities )|( ij dcP are the outputs from the ith classifier, and di denotes the ith constituent

of page d. In our experiments i=1,2,3,4,5,6 has respectively the meaning of a left menu, right menu,

header, footer, center and meta-tags respectively. The weight ki ( 6,...,1,10 =≤≤ iki and 16

1

=∑=i

ik )

denotes the influence of the i-th classifier in the final decision. Such influence should be proportional

to the expected relevance of the information stored into the di part of the page. In particular, after

some tuning, we have assigned the following weights to each classifier: header 0.1, footer 0.01, left

menu 0.05, right menu 0.04, center 0.5, title and meta-tags 0.3. The results of classification will be

shown in Section 6.

In a second prototype we employed an artificial neural network to estimate the optimal weights

of the mixture. Such approach allows the system to automatically adapt on different datasets

without any manual tuning of the parameters.

19

M-Treebuilder

Arearecognizer

Featureselection

unit

NaiveBayes

classifierH

NaiveBayes

classifierF

NaiveBayes

classifierLM

NaiveBayes

classifierRM

NaiveBayes

classifierC

NaiveBayes

classifierM

Neural Net

n n n n n n

n

page

Classprobabilities

Figure 7: Architecture of a classification system based on a visual representation of a page.

We used a two layer feed-forward neural net with sigmoidal units, trained with the

Backpropagation learning algorithm [14]. The basic idea for Backpropagation learning is to

implement a gradient descent search through the space of possible network weights wi, iteratively

reducing the error between the training example target values and the network outputs. The 6∆n

dimensional output of the Naive Bayes classifiers is input of the network. In our experimental setup,

the neural network featured 20 hidden units and 15 output units, one per class. Each network

output represents the probabilities of belonging to the corresponding class. The ordinal number of

the output with maximum value corresponds to the winning class. The learning rate and the

momentum term were both set equal to 0.5. A validation set was used to learn the network

parameters.

20

6. Classification results

At the time of writing, there was not a dataset of Web pages, which has been commonly accepted

as a standard reference for classification tasks. Maybe the best known dataset is called WebKB2.

After some inspection of the pages in this dataset we concluded that the dataset is not appropriate

because pages are of very simple layout without complex HTML structures. This dataset originates

from January 1997 and collected pages only from educational domains. Educational pages often

features a much simpler structure than commercial pages, that moreover cover more than 80% of

the Web [16]. Since 1997 Web design has been developing significantly, nowadays, many popular

software packages allow a fast and easy design of complex Web pages. Often real pages are very

complex in their presentational concept (just look at the CNN or BBC home pages and compare it

with the pages from the WebKB dataset). Thus, we decided to create our own dataset. After

extracting all the URLs provided by the first 5 levels of the DMOZ topic taxonomy, we selected the

15 topics at the first level of the hierarchy (the first level topic “Regional” was discarded since it

mostly contains non-english documents). Each URL has been associated to the class (topic) from

which it has been discovered. Finally, all classes have been randomly pruned, keeping only 1000

URLs for each class. Using a Web crawler, we downloaded all the documents associated to the

URLs. Many links were broken (server down or pages not available anymore), thus only about

10.000 pages could be effectively retrieved (an average of 668 pages for each class). These pages

have been used to create the dataset. Such dataset3 can be easily replicated, enlarged and updated

(the continuous changing of Web format and styles does not allow to employ a frozen dataset since

after a few months it would be not representative of the real documents that can be found on the

Internet).

2 Can be downloaded from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/3 The data set can be downloaded from http://nautilus.dii.unisi.it/download/webDataset.tar.gz

21

The dataset was split into a training set, test set and a validation set, the latter one used for

training the neural network. In order to compare the results provided by our method, we used a

standard NB classifier (as described in section 5), dealing with all words filtered by a feature

selection methods. In particular, we employed a feature selection algorithm either based on TF-IDF

or Information Gain. The results are shown in the figure 8.

Classifier performance (100 training examples per category)

0

10

20

30

40

50

60

70

25 50 75 100 125

Number of features

Cla

ssifi

ed c

orr

ect

ly(%

)

NN LM IG stop-list IG TF-IDF

Classifier performance (200 training examples per category)

0

10

20

30

40

50

60

70

25 50 75 100 125

Number of features

Cla

ssifi

ed

co

rre

ctly

(%)

NN LM IG stop-list IG TF-IDF

Figure 8: Classifier performance: NN and LM indicates the results of methods taking into account

visual information with respectively a neural network and a linear combination, performing the final

mixture. IG indicates the result of a NB classifier dealing with the entire page (no visual information is

considered), filtered using Information Gains of words. IG stop-list uses also a stop-list of features.

Finally, TF-IDF indicates the results of a NB classifier which input is filtered by a TF-IDF based

algorithm. Visual based methods clear outperform methods based on a non-visual representation.

Feature selection based on TF-IDF is worse when compared to Information Gain. Classification

accuracy can be improved using feature selection based on Information Gain in combination with a

stop-list, containing common terms such as articles, pronouns etc. However, it is clear that methods

using visual information clearly outperform methods purely based on text analysis (the improvement

22

was around 10 even where classical methods reach their maximum). Increasing the number of

training examples from 100 to 200 per group helps in improving the performance for all the

methods. The interesting effect is that if we increase the number of features, after some limit

(around 75) precision of classic approaches decreases. Our methods are more robust to that

phenomenon, since noisy features are likely to lay into marginal areas of the page and are then

filtered by the final mixture.

Even if the manually tuned mixture provides slightly better performances than the mixture

performed by the neural network, the difference is always less than 2%, clearly showing that the

network well estimates the intrinsic importance of a region.

7. Conclusion

This paper describes a possible representation, called M-Tree, for a Web page in which objects

are placed into well-defined tree hierarchy according to where they belong to in an HTML structure

of a page. Further, each object (node from M-Tree) carries information about its position in a

browser window. This visual information enables us to define heuristics for recognition of common

areas such as header, footer, left and right menus, and center of a page. The crucial difficulty was to

develop sufficiently good rendering algorithm i.e. to imitate behavior of popular user agents such as

Internet Explorer.

Unfortunately, HTML source is often not conform to the standard, posing additional problems in

the rendering process. After applying some techniques for error recovery in construction of the

parsing tree and introducing some rendering simplifications (we do not deal with frames, layers and

style sheets) we defined recognition heuristics based only on visual information. The overall results

in recognizing targeted areas was around 73%. In the future we plan to improve rendering process

and recognition heuristics. We also plan to recognize logical groups of objects that are not

23

necessarily in the same area like titles of figures, titles and subtitles in text, all commercial add-ins

etc.

Our experimental results provide evidence to claim that spatial information is of crucial

importance to classify Web documents. Classification accuracy of a Naive Bayes classifier was

increased of more than 10%, when taking into account the visual information. In particular, we

constructed a mixture of classifiers each one trained to recognize words appearing in a specific

portion of the page. In the future, we plan to use our system to improve link selection in focus

crawling sessions [4] by estimating importance of hyperlinks using their position and neighborhood.

However, we believe that our visual page representation can find its application in many other areas

related to search engines, information retrieval and data mining from the Web.

Acknowledgements

We would like to thank Nicola Baldini for fruitful discussions on the Web page representations

adopted in the focuseek project (www.focuseek.com), which inspired some of the ideas proposed in

this paper.

References

[1] Quinlan, J.R., “Induction of decision trees”, Machine Learning, 1986, pp. 81-106.

[2] Salton, G., McGill, M.J., An Introduction to Modern Information Retrieval, McGraw-Hill,

1983.

[3] Chakrabarti S., van den Berg M., Dom B., “Focused crawling: A new approach to topic-specific

web resource discovery”, In Proceedings of the 8th Int. World Wide Web Conference, Toronto,

Canada, 1999.

[4] Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M., “Focused crawling using context

graphs”, In Proceedings of the 26th Int. Conf. On Very Large Databases, Cairo, Egypt, 2000.

24

[5] Rennie J., McCallum A., “Using reinforcement learning to spider the web efficiently”, In

Proceedings of the Int. Conf. On Machine Learning, Bled, Slovenia, 1999.

[6] Brin S., Page L., “The anatomy of a large-scale hypertextual web search engine”, In

Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1997,

vol.3, ACM Press.

[7] Bernard L.M., “Criteria for optimal web design (designing for usability)”,

http://psychology.wichita.edu/optimalweb/position.htm, 2001

[8] Embley D.W., Jiang Y.S., Ng Y.K., “Record-Boundary Discovery in Web Documents”, In

Proceedings of SIGMOD, Philadelphia, USA, 1999.

[9] Lim S. J., Ng Y. K., “Extracting Structures of HTML Documents Using a High-Level Stack

Machine”, In Proceedings of the 12th International Conference on Information Networking

ICOIN, Tokyo, Japan, 1998

[10] World Wide Web Consortium (W3C), “HTML 4.01 Specification”,

http://www.w3c.org/TR/html401/ , December 1999.

[11] James F., “Representing Structured Information in Audio Interfaces: A Framework for

Selecting Audio Marking Techniques to Represent Document Structures”, Ph.D. thesis,

Stanford University, available online at http://www-pcd.stanford.edu/frankie/thesis/, 2001.

[12] Mitchell T., “Machine Learning”, McGraw Hill, 1997.

[13] Sebastiani F., “Machine learning in automated text categorization”, ACM Computing Surveys,

34(1), pp. 1-47

[14] Yang, Y., Pedersen J.P. “A Comparative Study on Feature Selection in Text Categorization”,

In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97),

1997, pp. 412-420.

25

[15] Chauvin Y., Rumelhart D., “Backpropagation: Theory, architectures, and applications”,

Hillsdale, NJ, Lawrence Erlbaum Assoc.

[16] Lawrence, S., Giles, L., “Accessibility of Information on the Web”, Nature, 400, pp. 107-109,

July 1999.