on wordnet, text mining, and knowledge bases of the future peter clark knowledge systems boeing...

35
On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Upload: dennis-holland

Post on 27-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On WordNet, Text Mining, and Knowledge Bases of the Future

Peter Clark

Knowledge SystemsBoeing Phantom Works

Page 2: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On Machine Understanding

• Creating models from data…

• Suggests:• there a rocket launch• China owns the satellite• the satellite is for monitoring weather• the orbit is around the Earth• etc.

None of these are explicitly stated in the text

“China launched a meteorological satellite into orbit Wednesday, the first of five weather guardians to be sent into the skies before 2008.”

Page 3: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On Machine Understanding• Understanding = creating a situation-specific model

(SSM), coherent with data & background knowledge– Data suggests background knowledge which may be

appropriate– Background knowledge suggest ways of interpreting data

Fragmentary,ambiguous

inputs

Coherent Model(situation-specific)

?

?

Page 4: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On Machine Understanding

Fragmentary,ambiguous

inputsCoherent Model

(situation-specific)

?

?

• Core theories of the world• Ton of common-sense/ episodic/experiential knowledge (“the way the world is”)

Assembly of pieces, assessment of coherence,inference

• Only a tiny part of the target model• Contains errors and ambiguity• Not even a subset of the target model

World Knowledge

Page 5: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On Machine Understanding

• Conjectures about the nature of the beast:– “Small” number of core theories

• space, time, movement, …• can encode directly

– Large amount of “mundane” facts• a dictionary contains many of these facts• also: episodic/script-like knowledge needed

• Core theories of the world• Ton of common-sense/ episodic/experiential knowledge (“the way the world is”)

World Knowledge

Page 6: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

On Machine Understanding

• How to acquire this background knowledge?– Manual encoding (e.g., Cyc, WordNet)– NLP on a dictionary (e.g., MindNet)– Community-wide acquisition (e.g., OpenMind)– Knowledge mining from text (e.g., Schubert)– Knowledge acquisition technology:

• graphical (e.g., Shaken)• entry using “controlled” (simple) English

• Core theories of the world• Ton of common-sense/ episodic/experiential knowledge (“the way the world is”)

World Knowledge

Page 7: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

What We’re Trying To Do…

Knowledge base

English-baseddescription ofa scene (partial,ambiguous)

Coherent representationof the scene (elaborated,disambiguated)

Question-Answering, Search, etc.

Page 8: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

“A man pulls and closes an airplane door”

“A lever is rotated to the unarmed position”

“…” “…”

Pull

Man Door Airplane

agent object

is-part-of

Video

Captions(manualauthoring)

Caption textInterpretation

Elaboration (inference,scene-building) Pull

ManDoor Airplane

World Knowledge

SearchTouch

Person Door

Query:

Illustration: Caption-Based Video Retrieval

Page 10: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“Someone broke the side mirrors of a truck” the truck is damaged

…if only the system knew that…

IF X breaks Y THEN(result) Y is damaged

IF X is-part-of YAND X is damagedTHEN Y is damaged

A mirror is part of a truck

Page 11: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“A man cut his hand on a piece of metal” the man is hurt

…if only the system knew that…

IF organism is cutTHEN(result) organism is hurt

IF X is-part-of YAND X is cutTHEN Y is cut

A hand is part of a person

(also: Metal can cut)

Page 12: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“A man carries a box across the floor” a person is walking

…if only the system knew that…

IF X is carrying somethingTHEN X is walking

IF X is a manTHEN X is a person

Page 13: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“The car engine” The engine part of a car“The car fuel” The fuel which is consumed by the car“The car driver” The person who is driving the car

…if only the system knew that…

Cars have enginesCars consume fuelPeople drive carsA driver is a person who is driving

Page 14: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“The man writes with a pen” Pen = pen_n1 (writing implement)“The pig is in the pen” Pen = pen_n2 (container (typically) for confining animals)

…if only the system knew that…

people writewriting is done with writing implementsa pen (n1) is a writing implement

a pig is an animalanimals are sometimes confineda pen (n2) confines animals

Page 15: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Example Inferences“The blue car” The color of the car is blue

…if only the system knew that…

physical objects can have colorsblue is a colora car is a physical object

Page 16: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

WordNet…• psycholinguistically motivated lexical reference system• core:

– synsets (“concepts”)– hypernym links

• later added additional relationships:– part-of– substance-of (“pavement is substance of road”)– causes (“revive causes come-to”)– entails (“sneeze entails exhale”)– antonyms (“appearance”/“disappearance”)– possible-values (“disposition” = {“willing”,“unwilling”})

• Currently at version 2.0– What will version 10.0 (say) look like?– What should it look like? – Is it/coult it migrate towards more of a knowledge base?

Page 17: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Why to Use WordNet• It’s a comprehensive ontology

– (approx. 120,000 concepts)• links between concepts (synsets) and lexical items (words)• Simple structure, easy to use• Rich population of hypernym links

Page 18: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Problems with WordNet• Too fine-grained word senses

– e.g., “cut” has 41 senses, including• cut (separate)• cut grain• cut timber• cut my hair• cut (knife cuts)

– linguistically, not representationally, motivated• e.g., “cut grain” because just happens to be a word for it (“harvest”)

– representationally, many share a core meaning• but commonality not captured

• Missing concepts/senses without an English word (+ just forgot)– e.g., goal-oriented entity (person, corporation, country)– difference between physical and legal ownership

• Single inheritance (mainly)– very different to Cyc, which uses multiple inheritance a lot

Page 19: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Problems with WordNet• “isa” (hypernym) hierarchy is broken in many places

– sometimes means “part of”• e.g., Connecticut -> America

– mixes instance-of and subclass-of• e.g., Paris -> Capital_City -> City

– many links seem strange/questionable• e.g., “launch” is a type of “propel”?• again, is psychologically not representationally motivated

– has major implications for reasoning• Semantics of relationships can be fuzzy/asymmetric

• “motor vehicle” has-part “engine” means….?• “car” has-part “running-board”

Page 20: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Problems with WordNet• Many relationships missing

– Simple• verbs/nominalizations (eg. “plan” (v) vs., “plan” (n))• adverbs/adjectives (“rapidly” vs. “rapid”)

– Conceptual; many, in particular:• causes• material• instrument• content• beneficiary• recipient• result• destination• shape• location

Page 21: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

What we’d like instead….

• Want a knowledge resource that can provide rich expectations about the world– to help interpret ambiguous input– to infer additional facts beyond those in the input– to create a coherent model from fragmented input

• It would have– a small set of “core” theories about the world

• containers, transportation, movement, space• probably hand-built

– many “mundane” facts which instantiate those theories in various ways

Page 22: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

What we’d like instead…. Core theories, e.g., Transportation:

OBJECTS can be at PLACESVEHICLES can TRANSPORT OBJECTS from PLACE to PLACETRANSPORT requires the OBJECT to be IN the VEHICLEBEFORE an OBJECT is TRANSPORTED by a VEHICLE

from PLACE1 to PLACE2, the OBJECT and VEHICLE are at PLACE1

etc.

Basic facts/instantiations: Cars are vehicles Cars can transport people Cars travel along roads Ships can transport people or goods Ships can transport over water between ports Rockets can transport satellites into space ….

Page 23: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Some Questions

• To what extent can a rulebase be simplified into a set of database-like tables?

• How much of the table-like knowledge can we learn automatically?

• How can we reason with “messy knowledge” that such a database inevitably contains?

• How can we represent different views/perspectives?• What not use Cyc?• How can we address WordNet’s deficiencies

efficiently?

Page 24: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

The “Common Sense” KB• An attempt to rapidly accumulate some core

knowledge and “routine” facts, to support– specific applications – research in how to work with all this knowledge

• Features:– knowledge (mainly) entered in simple English– interactively interpreted to KM (logic) structures– using WordNet’s ontology + UT’s “slot” library

Page 25: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Why Simple Language-based Entry?• Seems to be easier and faster than formal encoding

– but more restricted• More comprehensible & accessible• Viable (if a dictionary is a good model of scope…)• Ontologically less commital (can reinterpret)• Forces us to face some key issues

– ambiguity, conflict, “messy” knowledge• Step towards more extensive language processing

• Costs: more infrastructure needed, limited expressivity, still need to understand some KR

Page 26: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Demo

Page 27: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Or…• Can (at least some) of this basic world knowledge

be acquired automatically? e.g.,

– Girju

– Etzioni

– Schubert

Page 28: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Knowledge Mining

There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content, and which can be harnessed.

“The camouflaged helicopter landed near the embassy.” helicopters can land helicopters can be camouflaged

Schubert’s Conjecture:

Our attempt: “lightweight” LFs generated from ReutersLF forms: (S subject verb object (prep noun) (prep noun) …) (NN noun … noun) (AN adj noun)

Page 29: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Knowledge Mining

HUTCHINSON SEES HIGHER PAYOUT. HONG KONG. Mar 2.Li said Hong Kong’s property market remains strong while its economy is performing better than forecast. Hong Kong Electric reorganized and will spin off its non-electricity related activities. Hongkong Electric shareholders will receive one share in the new subsidiary for every owned share in the sold company. Li said the decision to spin off …

Newswire Article

Shareholders may receive shares.

Companies may be sold.

Shares may be owned.

Implicit, tacit knowledge

Page 30: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Knowledge Mining – our attempt

(S "history" "fall" ("out of" "view"))(S "rate" "fall on" (NIL "tuesday") ("to" "percent"))(S "index" "rally" "point" ("with" "volume") ("at" "share"))(S "you" "have" "decline" ("in" "inflation") ("in" "rate"))(S "you" "have" "decline" ("in" "figure") ("in" "rate"))(S "you" "have" "decline" ("in" "lack") ("in" "rate"))(S "Boni" "be wary")(S "recovery" "be" "led")(S "evidence" "patchy")(S "expansion" "be worth")(S "we" "be content")(S "investment" "boost" "sale" ("in" "zone"))(S "Eaton" "say" (S "investment" "boost" "sale" ("in" "zone")))(S "it" "grab" "portion" ("away from" "rival"))

Fragment of the raw data (Reuters)

Page 31: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works
Page 32: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works
Page 33: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works
Page 34: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Knowledge Mining…what next?• What could we do with all this data?

– Use it to bias the parser – Extra source of knowledge for the KB

• source of input sentences for our system?• possibilities (“this deduction looks coherent”)

• But:– Ambiguity makes it hard to use

• word senses, relationships

– No notion of “relevance”– Many types of knowledge not mined

• e.g., rule-like, script-like

Page 35: On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works

Summary• Machine understanding = building a coherent model

• Requires lots of world knowledge

– core theories + lots of “mudane” facts

• WordNet –

– a potentially useful resource, but with many problems

– slowly and manually becoming more KB-like

• there’s a lot of potential to jump ahead with text mining methods

– e.g., Schubert’s approach

– KnowItAll

• We would like to use the results for reasoning!!