igor yakymenko [email protected] department of computer science and engineering suny at buffalo

Learning to Construct Knowledge Bases from the

World Wide Webby Mark Craven, Dan DiPasquo, Dayne

Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery

Igor [email protected]

Department of Computer Science and EngineeringSUNY at Buffalo

mailto:[email protected]

Fig. 1. An overview of the WebKB system

• Two of the entities automatically extracted from the CMU computer science department Web site after training on four other university computer science sites. These entities were added as new instances of faculty and project to the knowledge base from Web hypertext.

Part II

• Pages 11-29• Appendix B• Appendix C

Learning to Recognize Class InstancesTask:

• to identify new instances of ontology classes from the text sources on the Web.

Discussion:• A statistical bag-of-words approach to classifying

Web pages. This method is used along with 3 different representation of pages.

• Learning first-order rules to classify Web pages.• Evaluation of the effectiveness of combining the

predictions made by all 4 of these classifiers.

Naive Bayes2 common approaches:

• multi-variate Bernoulli model (binary word count);• multinomial model (integer word count).

Given a set of classes C = {c1, …cn} and a document consisting of n words, (w1, w2,…wn), we classify the document as a member of the class, c*, that is the most probable, given the words in the document:

Transform Pr(c|w1,…wn) by applying Bayes Rule

Rewrite the expression using the product rule and dropping the denominator

Assume that words are independent of each other

Naive Bayes Classifier Limitations

1. The Naïve Bayes is not suitable to estimate the level of confidence for all classes.

2. The winning class tends to have probability 1 (the artifact of the naïve assumption) .

3. The losing classes tend to have posterior probabilities close to 0.

Authors’ proposals is to modify the existing formulae to overcome those limitations.

Modifications to Naive Bayes Goal- scores that accurately reflect the uncertainty in each

prediction and enable to sensibly compare the scores of multiple documents (smooth function of confidence):

Begin with naive Bayes, rewrite the sum to an equivalent expression that sums over all words in the vocabularly T instead of just the words in the document (B.1), take the log (B.2), and divide by the number of words in the document (B.3).

Modifications to Naive Bayes (continued)By substituting N(wi|d)/n as Pr(wi|d), the authors derived the following formula:

where:n – number of words in a document;Pr(c) – prior probability of any class;Pr(wi|d) – probability (frequency) of word that encountered in the document d;T – the whole vocabulary;Pr(wi|c) – probability (frequency) of word that encountered in the class c.

ndwN i )|(

Modifications to Naive Bayes (continued)Subtracting optimal encoding for a given document gives us the final formula of the score for all classes. The biggest score will determine the entity for a specific document:

The right side of the equation is negative relative (cross) entropy:

- Measure of how different two probability distributions are;- The average number of bits that are “wasted” by encoding events from a distribution p with a code based on a not-quite-right distribution q.

p

qpqppD(p||q) loglog

)|Pr(log dwi

Approach: Building a probabilistic model of each class using labeled training data, and then classifying new pages by selecting the most appropriate class.

Given a document d to classify, we calculate a score for each class c (The class predicted by the method is the class with the greatest score):

• n is number of words in d• T is the size of the vocabulary• wi is the ith word in the vocabulary

Pr(wi|c) is the probability of random word w given class c;

Pr(wi|d) is the proportion of a word w in document d.

Naive Bayes Classifier (conclusion)

Experimental Evaluation

To obtain insight into the learned classifiers by asking which words contribute most highly to the quantity Score c(d) for each class:

Many words which are conventionally included in stop list are highly weighted by the model and was included into the vocabulary.


Most of the highly weighted words are intuitively prototypical for their class.

Coverage – the percentage of pages of a given class that are correctly classifiedAccuracy – the percentage of pages classified into a given class that are actually members of that class

First-Order Text Classification Quinlan’s FOIL algorithm, Introduction

Two families of First-order Learning Systems:

1. Successive relation method:A faulty theory is too general if it covers negative examples, and too specific if it

does not cover all positive examples. The theory is revised until all examples are

covered.

2. Separate and Conquer Strategy (greedy algorithms) :All examples are considered together and each iteration new element (a.k.a literal)

is added that covers some positive examples, but no negative examples.

J.R. Quinlan, R.M Cameron-Jones: “Introduction to Logic programs: FOIL and Related Systems”

Description of FOILAs an example of a task, consider learning a definition of the membership relation on lists from a small world containing just the lists [ ], [1], [2], [3], [1,2], [2,3], and [1,2,3]. The target relation member(E,L) contains pairs whose pairs constant denotes an element that belongs in the list denoted by the second. In this small world there are just ten elements in member: <1,[1]> <2,[2]> <3,[3]> <1,[1,2]> <2,[1,2]><2,[2,3]> <3,[2,3]> <1,[1,2,3]> <2,[1,2,3]> <3,[1,2,3]>

As far as foil is concerned, lists like [1,2,3] are just constants, so a background relation components(L,H,T) is required to show how to find the head H and tail T of a list L. The elements making up components are: <[1],1,[ ]> <[2],2,[ ]> <[3],3,[ ]> <[1,2],1,[2]> <[2,3],2,[3]> <[1,2,3],1,[2,3]>

where the first states that list [1] has head 1 and tail [ ].

Description of FOIL (continued)

/* Top down Approach */

Initialization:theory := null program /* learning concept member(E,L)*/remaining := all positive elements of target relation R /* <1,[1]>….. <2,[2]> */Iteration: While remaining is not empty /* some positive examples are not classified */clause := R(A,B; :::) :-While clause covers “-” negative elements Find appropriate literal(s) L (a.k.a background relationship) Add L to right-hand side of clauseRemove positive “+” elements covered by clause from remaining RAdd clause to theory

Description of FOIL (continued)Initialization:We illustrate the process using the member(E,L) relation. The initial clause consists of just the first Literal: member(A,B) : (where A is Element, B – List ) The set of examples corresponding to this initialpartial clause are just the all possible positive and negative elements of relation member(A,B). All 10 positive examples:<1,[1]>(+) <2,[2]>(+) <3,[3]>(+) <1,[1,2]>(+) <2,[1,2]>(+) <2,[2,3]>(+) <3,[2,3]>(+) <1,[1,2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+) some negative examples:<1,[ ]>(-) <1,[2]>(-) <1,[3]>(-) <1,[2,3]>(-) <2,[ ]>(-)

Description of FOIL (continued)The literal L components(L,H,T) is now added to the clause body to give:This is intermediate theory: member(A,B) :- components(B,A,C) the new clause has three variables and is satisfied the following elements: <A,[B],[C]> <1,[1],[ ]>(+) <2,[2],[ ]>(+) <3,[3],[ ]>(+) <1,[1,2],[2]>(+) <2,[2,3],[3]>(+) <1,[1,2,3],[2,3]>(+) For instance, <1,[1],[ ]> is removed from remaining because the values A=1, B=[1], C=[ ]. Another words, if an element is a header H, it is a member of the list L.

We have only 4 positive examples that cannot be described by above relationships : member(A,B) :- components(B,A,C)

<2,[1,2]>(+) <3,[2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+)

Description of FOIL (continued)Adding a further literal to give the new partial clause member(A,B) :- components(B,A,C) member(A,B) :- components(B,C,D), member(A,D) eliminates the rest 4 positive examples:

<2,[1,2],1,[2]>(+) <3,[2,3],2,[3]>(+) <2,[1,2,3],1,[2,3]>(+) <3,[1,2,3],1,[2,3]>(+)

Each example that makes the relationship member(E,L) is moved to target relation:member(A,B) :- components(B,A,C).member(A,B) :- components(B,C,D), member(A,D)So the definition of member(E,L) is complete and can be used for other than 1,2, and 3 elements.Example: member(4,[1,2,3,4]) member(4,[1,2,3,4]) = components([1,2,3,4],[1],[2,3,4]), member(4,[2,3,4])member(4,[2,3,4]) = components([2,3,4],[2],[3,4]), member(4,[3,4])member(4,[2,3,4]) = components([2,3,4],[2],[3,4]), member(4,[3,4])member(4,[3,4]) = components([3,4],[3],[4]), member(4,[4])

First-Order Text ClassificationQuinlan’s FOIL algorithm for WebKB

• A greedy algorithm for learning function-free clauses.

• nc – the number of instances correctly classified by the rule• n – the total number of instances classified by the rule• p – a prior estimate of the rule’s accuracy• m – a constant called equivalent sample size which determines how heavily p is

weighted relative to the observed data (m = 2)

Background relations (stemmed words with 200 occurrences)• has_word(Page): indicates which words occur in which pages.• link_to(Page, Page): represents the hyperlinks that interconnect

the pages in the data set.

For all FOIL class classifiers the m-estimate accuracy was calculated to determine the winning class for each document (d):

A few of the rules learned by FOIL

For relationship course(A) the FOIL algorithm learnt the following:

1. Page has instructor word, but not the word good.

2. The page has link to other page which doesn’t contain any links.

3. This linked page contained the word assign.

Combining Learners

Method for combining predictions of classifiers:• Simple voting scheme among all four classifiers (majority of

votes made by the individual classifiers);

• In case of tie the confidence level is used as a tie-breaker;

To ensure comparability:• Calibrate each classifier by inducing a mapping from its output

scores to the probability of a prediction being correct• Partitioning the score produced by each classifier into bins and

then measuring the training-set accuracy of the scores that fall into each bin.

Identifying Multi-Page Segments To develop methods for identifying sets of

interlinked pages that represent a single knowledge base instance:

• Prior assumption: one page one instance (primary page and others);

• New approach of grouping related pages together (using regularities in URL structures);

• Identifying the most representative page of a group (for example: “/~*/” naming pattern identifies entity person);

• Main page could be identified by file name index, home, cs???;

The URL Grouping Algorithm (Appendix C)

Future work

Methods for classification documents:

• Bayesian Learning: Minimum Description Length (MDL);

• Symbolic Learning: Decision Trees;

• k-NN (Nearest Neighbor ) algorithm;

igor yakymenko [email protected] department of computer science and engineering suny at buffalo

Documents

given document

n words

web pages

document dt

specific document

naive bayes continuedby

n number of words

bayes rulerewrite