igor yakymenko [email protected] department of computer science and engineering suny at buffalo

33
Learning to Construct Knowledge Bases from the World Wide Web by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery Igor Yakymenko [email protected] Department of Computer Science and Engineering SUNY at Buffalo

Upload: casey

Post on 25-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Igor Yakymenko [email protected] Department of Computer Science and Engineering SUNY at Buffalo. Learning to Construct Knowledge Bases from the World Wide Web by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Learning to Construct Knowledge Bases from the

World Wide Webby Mark Craven, Dan DiPasquo, Dayne

Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery

Igor [email protected]

Department of Computer Science and EngineeringSUNY at Buffalo

Page 2: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Fig. 1. An overview of the WebKB system

Page 3: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

• Two of the entities automatically extracted from the CMU computer science department Web site after training on four other university computer science sites. These entities were added as new instances of faculty and project to the knowledge base from Web hypertext.

Page 4: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Part II

• Pages 11-29• Appendix B• Appendix C

Page 5: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Learning to Recognize Class InstancesTask:

• to identify new instances of ontology classes from the text sources on the Web.

Discussion:• A statistical bag-of-words approach to classifying

Web pages. This method is used along with 3 different representation of pages.

• Learning first-order rules to classify Web pages.• Evaluation of the effectiveness of combining the

predictions made by all 4 of these classifiers.

Page 6: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Naive Bayes2 common approaches:

• multi-variate Bernoulli model (binary word count);• multinomial model (integer word count).

Given a set of classes C = {c1, …cn} and a document consisting of n words, (w1, w2,…wn), we classify the document as a member of the class, c*, that is the most probable, given the words in the document:

Transform Pr(c|w1,…wn) by applying Bayes Rule

Rewrite the expression using the product rule and dropping the denominator

Assume that words are independent of each other

Page 7: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Naive Bayes Classifier Limitations

1. The Naïve Bayes is not suitable to estimate the level of confidence for all classes.

2. The winning class tends to have probability 1 (the artifact of the naïve assumption) .

3. The losing classes tend to have posterior probabilities close to 0.

Authors’ proposals is to modify the existing formulae to overcome those limitations.

Page 8: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Modifications to Naive Bayes Goal- scores that accurately reflect the uncertainty in each

prediction and enable to sensibly compare the scores of multiple documents (smooth function of confidence):

Begin with naive Bayes, rewrite the sum to an equivalent expression that sums over all words in the vocabularly T instead of just the words in the document (B.1), take the log (B.2), and divide by the number of words in the document (B.3).

Page 9: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Modifications to Naive Bayes (continued)By substituting N(wi|d)/n as Pr(wi|d), the authors derived the following formula:

where:n – number of words in a document;Pr(c) – prior probability of any class;Pr(wi|d) – probability (frequency) of word that encountered in the document d;T – the whole vocabulary;Pr(wi|c) – probability (frequency) of word that encountered in the class c.

ndwN i )|(

Page 10: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Modifications to Naive Bayes (continued)Subtracting optimal encoding for a given document gives us the final formula of the score for all classes. The biggest score will determine the entity for a specific document:

The right side of the equation is negative relative (cross) entropy:

- Measure of how different two probability distributions are;- The average number of bits that are “wasted” by encoding events from a distribution p with a code based on a not-quite-right distribution q.

p

qpqppD(p||q) loglog

)|Pr(log dwi

Page 11: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Approach: Building a probabilistic model of each class using labeled training data, and then classifying new pages by selecting the most appropriate class.

Given a document d to classify, we calculate a score for each class c (The class predicted by the method is the class with the greatest score):

• n is number of words in d• T is the size of the vocabulary• wi is the ith word in the vocabulary

Pr(wi|c) is the probability of random word w given class c;

Pr(wi|d) is the proportion of a word w in document d.

Naive Bayes Classifier (conclusion)

Page 12: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Experimental Evaluation

Page 13: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

To obtain insight into the learned classifiers by asking which words contribute most highly to the quantity Score c(d) for each class:

Many words which are conventionally included in stop list are highly weighted by the model and was included into the vocabulary.

Experimental Evaluation

Most of the highly weighted words are intuitively prototypical for their class.

Page 14: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo
Page 15: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Coverage – the percentage of pages of a given class that are correctly classifiedAccuracy – the percentage of pages classified into a given class that are actually members of that class

Page 16: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo
Page 17: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo
Page 18: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

First-Order Text Classification Quinlan’s FOIL algorithm, Introduction

Two families of First-order Learning Systems:

1. Successive relation method:A faulty theory is too general if it covers negative examples, and too specific if it

does not cover all positive examples. The theory is revised until all examples are

covered.

2. Separate and Conquer Strategy (greedy algorithms) :All examples are considered together and each iteration new element (a.k.a literal)

is added that covers some positive examples, but no negative examples.

J.R. Quinlan, R.M Cameron-Jones: “Introduction to Logic programs: FOIL and Related Systems”

Page 19: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Description of FOILAs an example of a task, consider learning a definition of the membership relation on lists from a small world containing just the lists [ ], [1], [2], [3], [1,2], [2,3], and [1,2,3]. The target relation member(E,L) contains pairs whose pairs constant denotes an element that belongs in the list denoted by the second. In this small world there are just ten elements in member: <1,[1]> <2,[2]> <3,[3]> <1,[1,2]> <2,[1,2]><2,[2,3]> <3,[2,3]> <1,[1,2,3]> <2,[1,2,3]> <3,[1,2,3]>

As far as foil is concerned, lists like [1,2,3] are just constants, so a background relation components(L,H,T) is required to show how to find the head H and tail T of a list L. The elements making up components are: <[1],1,[ ]> <[2],2,[ ]> <[3],3,[ ]> <[1,2],1,[2]> <[2,3],2,[3]> <[1,2,3],1,[2,3]>

 where the first states that list [1] has head 1 and tail [ ].

Page 20: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Description of FOIL (continued)

/* Top down Approach */

Initialization:theory := null program /* learning concept member(E,L)*/remaining := all positive elements of target relation R /* <1,[1]>….. <2,[2]> */Iteration: While remaining is not empty /* some positive examples are not classified */clause := R(A,B; :::) :-While clause covers “-” negative elements Find appropriate literal(s) L (a.k.a background relationship) Add L to right-hand side of clauseRemove positive “+” elements covered by clause from remaining RAdd clause to theory

Page 21: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Description of FOIL (continued)Initialization:We illustrate the process using the member(E,L) relation. The initial clause consists of just the first Literal: member(A,B) : (where A is Element, B – List ) The set of examples corresponding to this initialpartial clause are just the all possible positive and negative elements of relation member(A,B). All 10 positive examples:<1,[1]>(+) <2,[2]>(+) <3,[3]>(+) <1,[1,2]>(+) <2,[1,2]>(+) <2,[2,3]>(+) <3,[2,3]>(+) <1,[1,2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+) some negative examples:<1,[ ]>(-) <1,[2]>(-) <1,[3]>(-) <1,[2,3]>(-) <2,[ ]>(-)

Page 22: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Description of FOIL (continued)The literal L components(L,H,T) is now added to the clause body to give:This is intermediate theory:  member(A,B) :- components(B,A,C) the new clause has three variables and is satisfied the following elements: <A,[B],[C]> <1,[1],[ ]>(+) <2,[2],[ ]>(+) <3,[3],[ ]>(+) <1,[1,2],[2]>(+) <2,[2,3],[3]>(+) <1,[1,2,3],[2,3]>(+) For instance, <1,[1],[ ]> is removed from remaining because the values A=1, B=[1], C=[ ]. Another words, if an element is a header H, it is a member of the list L.

We have only 4 positive examples that cannot be described by above relationships : member(A,B) :- components(B,A,C)

<2,[1,2]>(+) <3,[2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+)

Page 23: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Description of FOIL (continued)Adding a further literal to give the new partial clause member(A,B) :- components(B,A,C) member(A,B) :- components(B,C,D), member(A,D) eliminates the rest 4 positive examples:

<2,[1,2],1,[2]>(+) <3,[2,3],2,[3]>(+) <2,[1,2,3],1,[2,3]>(+) <3,[1,2,3],1,[2,3]>(+)

Each example that makes the relationship member(E,L) is moved to target relation:member(A,B) :- components(B,A,C).member(A,B) :- components(B,C,D), member(A,D)So the definition of member(E,L) is complete and can be used for other than 1,2, and 3 elements.Example: member(4,[1,2,3,4]) member(4,[1,2,3,4]) = components([1,2,3,4],[1],[2,3,4]), member(4,[2,3,4])member(4,[2,3,4]) = components([2,3,4],[2],[3,4]), member(4,[3,4])member(4,[2,3,4]) = components([2,3,4],[2],[3,4]), member(4,[3,4])member(4,[3,4]) = components([3,4],[3],[4]), member(4,[4])

Page 24: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

First-Order Text ClassificationQuinlan’s FOIL algorithm for WebKB

• A greedy algorithm for learning function-free clauses.

• nc – the number of instances correctly classified by the rule• n – the total number of instances classified by the rule• p – a prior estimate of the rule’s accuracy• m – a constant called equivalent sample size which determines how heavily p is

weighted relative to the observed data (m = 2)

Background relations (stemmed words with 200 occurrences)• has_word(Page): indicates which words occur in which pages.• link_to(Page, Page): represents the hyperlinks that interconnect

the pages in the data set.

For all FOIL class classifiers the m-estimate accuracy was calculated to determine the winning class for each document (d):

Page 25: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

A few of the rules learned by FOIL

For relationship course(A) the FOIL algorithm learnt the following:

1. Page has instructor word, but not the word good.

2. The page has link to other page which doesn’t contain any links.

3. This linked page contained the word assign.

Page 26: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Experimental Evaluation

Page 27: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Combining Learners

Method for combining predictions of classifiers:• Simple voting scheme among all four classifiers (majority of

votes made by the individual classifiers);

• In case of tie the confidence level is used as a tie-breaker;

To ensure comparability:• Calibrate each classifier by inducing a mapping from its output

scores to the probability of a prediction being correct• Partitioning the score produced by each classifier into bins and

then measuring the training-set accuracy of the scores that fall into each bin.

Page 28: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Experimental Evaluation

Page 29: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Experimental Evaluation

Page 30: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Identifying Multi-Page Segments To develop methods for identifying sets of

interlinked pages that represent a single knowledge base instance:

• Prior assumption: one page one instance (primary page and others);

• New approach of grouping related pages together (using regularities in URL structures);

• Identifying the most representative page of a group (for example: “/~*/” naming pattern identifies entity person);

• Main page could be identified by file name index, home, cs???;

Page 31: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

The URL Grouping Algorithm (Appendix C)

Page 32: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Experimental Evaluation

Page 33: Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science  and Engineering SUNY at Buffalo

Future work

Methods for classification documents:

• Bayesian Learning: Minimum Description Length (MDL);

• Symbolic Learning: Decision Trees;

• k-NN (Nearest Neighbor ) algorithm;