knowitall april 5 2007 william cohen. announcements reminder: project presentations (or progress...

26
KnowItAll April 5 2007 William Cohen

Upload: alfred-sims

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

KnowItAll

April 5 2007

William Cohen

Page 2: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Announcements

• Reminder: project presentations (or progress report)

– Sign up for a 30min presentation (or else)– First pair of slots is April 17– Last pair of slots is May 10

• William is out of town April 6-April 9– So, no office hours Friday.

• Next week: no critiques assigned– But I will lecture

Page 3: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

Rosenfeld and Feldman

2006

Stevenson & Greenwood

2005

Clever idea for learning relation patterns & strong

experimental results

De-emphasize duality, focus on distance between patterns.

Page 4: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Know It All

Page 5: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Architecture

Set of (disjoint?) predicates to consider + two names for each

~= [H92]

• Context – keywords from user to filter out non-domain pages• … ?

Page 6: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Architecture

Page 7: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Bootstrapping - 1

“city”

query

template rule

Page 8: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Bootstrapping - 2

Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”)i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”)

These are then used to create features: fU(x)>θ and fU(x)<θ

Page 9: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Bootstrapping - 3

1. Submit the queries & apply the rules to produce initial seeds.

2. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)|

3. Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds.

4. Train a NaiveBayes classifier using thresholded U’s as features.

Page 10: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Bootstrapping - 4

Estimate using the classifier

based on the previously-

trained discriminators

Some ad hoc stopping conditions… (“signal to noise” ratio)

Page 11: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Architecture - 2

Page 12: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Extensions to KnowItAll

• Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want– Eg target is “scientist”, but natural clusters are “biologist”,

“physicist”, “chemist”

• Solution: subclass extraction– Modify template/rule system to extract subclasses of target

class (eg scientist chemist, biologist, …)– Check extracted subclasses with WordNet and/or PMI-like

method (as for instances)– Extract from each subclass recursively

Page 13: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Extensions to KnowItAll• Problem: Set of rules is limited:

– Derived from fixed set of “templates” (general patterns ~ from H92)

• Solution 1: Pattern learning: augment the initial set of rules derivable from templates

1. Search for instances I on the web2. Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4”3. Assume classes are disjoint and estimate recall/precision of each pattern P4. Exclude patterns that cover only one seed (very low recall)5. Take the top 200 remaining patterns and

• Evaluate them as extractors “using PMI” (?)• Evaluate them as discriminators (in usual way?)

Examples: “headquartered in <city>”, “<city> hotels”, …,

Page 14: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Extensions to KnowItAll• Solution 2:

– List extraction: augment the initial set of rules with rules that are local to a specific web page

1. Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”)

2. For each page P:• Find subtrees T of the DOM tree that contain >k seeds• Find longest common prefix/suffix of the seeds in T

– [Some heuristics added to generalize this further]• Find all other strings inside T with the same prefix/suffix

• Heuristically select the “best” wrapper for a page– Wrapper = P, T, prefix, suffix

Page 15: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

T1

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

Page 16: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

T2

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

Page 17: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

T3

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w3 Italy, Japan, France, Israel, Spain, Brazil

Page 18: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

T4

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w2 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w3 Italy, Japan, France, Israel, Spain, Brazil

w4 Italy, Japan

Page 19: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

[…]

Page 20: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Results - City

Page 21: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Results - Film

Page 22: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Results - Scientist

Page 23: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

Observations

• Corpus is accessed indirectly thru Google API– Only use top k discriminators– Run extractors via query keywords & extract– Limited by network access time

• Lots of moving parts to engineer– Rule templates– Signal-to-noise– LE wrapper evaluation details– Parameters: number of discriminators, number of seeds to

keep, number of names per concept, ….

Page 24: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

KnowItNow: Son of KnowItAll

• Goal: faster results, not better results• Difference 1:

– Store documents locally– Build local index (Bindings Engine) optimized for

finding instances of KnowItAll rules and patterns• Based on inverted index

term (doc,position,contextInfo)

Page 25: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

KnowItNow: Son of KnowItAll

• Difference 2:– New model (URNS model) to merge information from multiple

extraction rules– Intuition: instances generated from each extractor are assumed

to be a mixture of two distributions1. Random noise from large instance pool2. Stuff with known structure (e.g., uniform, Zipf’s law, …)

– Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)

Page 26: KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First

KnowItNow: Son of KnowItAll

137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59Non-noisy data: uniform• over 137 instances

59% of mass doesn’t Prob(noise)= 0.59Non-noisy data: Zipf’s• over >N instances

41% of mass fits powerlaw