slides si ps for automated ontologies

44
Using statistically improbable phrases to automate the creation of an ontology for a technical document collection Thesis Defense Joseph Colannino Tulsa, April 29, 2009 © 2009, all rights reserved

Upload: josephcolannino

Post on 18-Nov-2014

867 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Slides Si Ps For Automated Ontologies

Using statistically improbable phrases to automate the creation of an ontology

for a technical document collection

Thesis Defense

Joseph ColanninoTulsa, April 29, 2009

© 2009, all rights reserved

Page 2: Slides Si Ps For Automated Ontologies

Overview• Literature survey

Contrast of keyword and semantic searchThe TF x IDF strategyStatistically Improbably Phrases (SIPs)

• Key conceptsSemantic filtering and mappingUse of SIPs to create an ontology

• Method and key results• Limitations• Recommendations for future work

Page 3: Slides Si Ps For Automated Ontologies

keywordskeywords

Shortcomings of Keyword Search

DOCsDOCs

KeywordKeywordExtractionExtraction

DOCsDOCs

CacheCache

We would hope thatkeywords can be extractedand cached

Page 4: Slides Si Ps For Automated Ontologies

keywordskeywords

Shortcomings of Keyword Search

DOCsDOCs

KeywordKeywordExtractionExtraction

DOCsDOCs

CacheCache

We would hope thatkeywords can be extractedand cached, leading toaccurate documentretrieval

Page 5: Slides Si Ps For Automated Ontologies

keywordskeywords

Shortcomings of Keyword Search

DOCsDOCs

KeywordKeywordExtractionExtraction

DOCsDOCs

CacheCache

DOCsDOCs DOCsDOCs DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCs

Unfortunately, we retrieve a lot

of extraneousstuff as well.

Page 6: Slides Si Ps For Automated Ontologies

keywordskeywords

Shortcomings of Keyword Search

DOCsDOCs

KeywordKeywordExtractionExtraction

DOCsDOCs

CacheCache

DOCsDOCs DOCsDOCs DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs

DOCsDOCs DOCsDOCs

DOCsDOCs

DOCsDOCsDOCsDOCs

DOCsDOCs

Page 7: Slides Si Ps For Automated Ontologies

Advantages of Semantic Search

DOCsDOCs DOCsDOCs

semanticsemantictokenstokens

semanticsemanticextractionextraction

semanticsemanticcontainercontainer

If some black box existedthat could extract meaningor thoughts and cache them, we could link our thoughtswith thoughts in documents

Page 8: Slides Si Ps For Automated Ontologies

Advantages of Semantic Search

DOCsDOCs DOCsDOCs

semanticsemantictokenstokens

semanticsemanticextractionextraction

semanticsemanticcontainercontainer

Page 9: Slides Si Ps For Automated Ontologies

Why are keywords not good semantic tokens?

thou

ghts

words

red

gree

n

blue

t = φ(

w)

color: red

color: green

color: blue

thou

ghts

words

red

gree

n

blue

t = φ(

w)

color: red

color: green

color: blueIf words mapped to thoughts bijectively,we could use keywordsto search for semantics

Page 10: Slides Si Ps For Automated Ontologies

Why are keywords not good semantic tokens?

color: blue

wordsred

neop

hytic

t = γ(

w)

color: red

color: green

disposition: depressed

disposition: inexperienced

disposition: angry

livid

sulle

n

gree

n

blue

thou

ghts

But they don’t. Wordshave many more than one meaning and meanings may be expressed by many more than one word

Page 11: Slides Si Ps For Automated Ontologies

Why are keywords not good semantic tokens?

color: blue

wordsred

neop

hytic

t = γ(

w)

color: red

color: green

disposition: depressed

disposition: inexperienced

disposition: angry

livid

sulle

n

gree

n

blue

thou

ghts

targettarget

sourcesource

……

……

This makes it tough to match a source query withits target

Page 12: Slides Si Ps For Automated Ontologies

Why are keywords not good semantic tokens?•Words have meaning only in context*

WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…

(33 x 41 x 8 x 87) x 4! = 22,600,512

If there are 22 million possible meanings for the sentence “John’s car is blue,” why isn’t anyone confused about what it means?

*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.

Page 13: Slides Si Ps For Automated Ontologies

Why are keywords not good semantic tokens?•Words have meaning only in context*

WORD # Meanings Some alternate meaningsjohn 33 toilet, anonymous…car 41 railcar, wagon, coach…is 8 information systems...blue 87 sad, vulgar, restrictive…

(33 x 41 x 8 x 87) x 4! = 22,600,512

Why doesn’t it “John’s car is blue” mean that the “toilet’s railcar information systems are sad”?

*Frege, G. 1892. Über Sinn und Bedeutung, in Zeitschrift für Philosophie und philosophische Kritik, 100: 25-50. Translated as ‘On Sense and Reference’ by M. Black in Translations from the Philosophical Writings of Gottlob Frege, P. Geach and M. Black (eds. and trans.), Oxford: Blackwell, third edition, 1980.

Page 14: Slides Si Ps For Automated Ontologies

blue87

john33

car41

is8

restricted semantics

1

+J+ ’s–

The

jcib

John’s car is blue ≡The car belonging to the person named John is colored blue

Semantic FilteringBecause context acts as a filter to restrict meaning. With so many restrictions, only one meaning is left. Therefore, phrases have narrower semantics than keywords.

Page 15: Slides Si Ps For Automated Ontologies

Semantic Filtering• Owing to semantic filtering word

phrases contain more precise (more restricted) semantics than individual words or Boolean constructions thereof

• If we were able to automatically generate meaningful phrases from a document, we could begin to search semantically

Page 16: Slides Si Ps For Automated Ontologies

Statistically Improbable Phrases• A statistically improbable phrase (SIP) is a

contiguous cluster of words which occur

1. with high frequency within a document (high term frequency, TF)

*AND*

2. with low frequency in the document collection(high inverse document frequency, IDF)

Page 17: Slides Si Ps For Automated Ontologies

Statistically Improbable Phrases• Highly repeated phrases within a document are

statistically correlated with meaningful phrases, because humans emphasize important points via repetition

• Therefore, phrases occuring with high term frequency (TF) in a document are likely to be meaningful

• To filter out pedestrian words and turns of phrase, the phrase must also occur outside the document with low frequency (i.e., high inverse document frequency, IDF)

• The resulting strategy is known as TF x IDF

Page 18: Slides Si Ps For Automated Ontologies

JabberwockyHere is a nonsense poem: “…

Beware the Jabberwock , my son! The jaws that bite, the claws thatcatch!Beware the Jubjub bird, and shun the frumious Bandersnatch

…”From The Jaberwocky, Lewis Carroll; Through the Looking-Glass and

What Alice Found There, 1872

Page 19: Slides Si Ps For Automated Ontologies

Jabberwocky – Term FrequencyLet’s post the TF for each word in the

poem“…

Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shun the19 frumious Bandersnatch

…”

Page 20: Slides Si Ps For Automated Ontologies

Jabberwocky – Term FrequencyNow let’s filter out the low TF words“…

Beware2 the19 Jabberwock4, my3 son! The19 jaws that2 bite, the19 claws that2catch!Beware2 the19 Jubjub bird, and15 shunthe19 frumious Bandersnatch

…”

Page 21: Slides Si Ps For Automated Ontologies

• Considering TF only, “the” is the most important word!

the19

and15

Jabberwock4

my3

beware2

that2

Jabberwocky – Term Frequencies

Page 22: Slides Si Ps For Automated Ontologies

• But consideringTF x IDF, Jabberwock emerges as the seminalconstruct!

the19

and15

Jabberwock4

my3

beware2

that2

Jabberwocky – TF x IDF

the, and, the, and, my, beware, my, beware,

thatthat

the of to in and a the of to in and a for was is that on for was is that on at he with by be at he with by be

it an as hisit an as his……

JabberwockJabberwock

DocumentDocumentCollectionCollection

DocumentDocument

TFTF

TF x IDFTF x IDF

IDFIDF

Page 23: Slides Si Ps For Automated Ontologies

What is an Ontology?• Thesis title:

Using statistically improbable phrases to automate the creation of an ontology or a technical document collection

Page 24: Slides Si Ps For Automated Ontologies

What is an Ontology?• “Ontologies have been defined in many

ways, some of which contradict each other.”

• “The broadest usage of ontology is for formal representations of what, to a human, is common sense. …”

A. Taylor. 2004. A. Taylor. 2004. The Organization of InformationThe Organization of Information, , Libraries UnlimitedLibraries Unlimited, Westport CT. p 282, emphasis mine , Westport CT. p 282, emphasis mine

Page 25: Slides Si Ps For Automated Ontologies

What is an Ontology?• “The Semantic Web is a vision for the

future of the Web in which informationis given explicit meaning.”

• “An ontology defines the terms used to describe and represent an area of knowledge”

OWL Web Ontology Language, Use Cases and Requirements, OWL Web Ontology Language, Use Cases and Requirements, W3C Recommendation, 10 February 2004W3C Recommendation, 10 February 2004http://www.w3.org/TR/webonthttp://www.w3.org/TR/webont--req/#ontoreq/#onto--defdef, , empahsis mine empahsis mine

Page 26: Slides Si Ps For Automated Ontologies

What is an Ontology?• An ontology is a “a collection of

tokenized meanings.”

J. Colannino, 2009. J. Colannino, 2009. Using statistically improbable phrases to automate Using statistically improbable phrases to automate the creation of an ontology for a technical document collectionthe creation of an ontology for a technical document collection, p2, , p2, Masters Thesis, Masters Thesis, University of OklahomaUniversity of Oklahoma, Tulsa., Tulsa.

Page 27: Slides Si Ps For Automated Ontologies

Automating Ontology Creation

DOCsDOCs

TF x IDFTF x IDF

DOCsDOCs

semantic semantic tokenstokens

semantic semantic searchsearch

SIPsSIPs

semantic semantic extractionextraction

ontologyontology

Page 28: Slides Si Ps For Automated Ontologies

Automating Ontology Creation

DOCsDOCs

TF x IDFTF x IDF

DOCsDOCs

semantic semantic tokenstokens

semantic semantic searchsearch

SIPsSIPs

semantic semantic extractionextraction

ontologyontology

Now we see that TF x IDF can be used as anextraction algorithm, the ontology is the container, and the SIPsare the semantic tokens

Page 29: Slides Si Ps For Automated Ontologies

Selecting the Database

To perform the anlaysis the database had to be

• Large enough• Inexpensive• Indexed• Availabile to the public• Accessibile to the researcher

Page 30: Slides Si Ps For Automated Ontologies

Method• So we used Amazon.com

• Selected a seminal text

• Then used its SIPs to interrogate other collections such as the World Wide Web

• We compared the results with keyword searches

The next page shows the logic

Page 31: Slides Si Ps For Automated Ontologies

Method

BeginBegin

SpecifySpecifydocumentdocumentcollectioncollection

SelectSelectseminalseminal

documentdocument

Find relFind rel’’tdtddocs usingdocs using

SIPsSIPs

RelatedRelateddocsdocs(348)(348)

Use SMEsUse SMEsto scoreto score

docsdocs

MCSPAMCSPA Use SIPsUse SIPsto findto find

MCSPA MCSPA

Analyze Analyze

ReportReportResultsResultsUse keywordsUse keywords

to find MCSPAto find MCSPA

Page 32: Slides Si Ps For Automated Ontologies

Method, 25 SIPs in Seminal TextMCSPA

air preheat temperatureblocking generatorsbridgewall temperatureburner throatdiffusion burnerdummy factorsethylene cracking unitsflame dimensionsflame qualityfloor burnersfunction shape analysisgenuine replicatesheat flux curveheat flux profilehidden extrapolationhydrogen reformersinsulation patternlenient senselurking factorsnormal matrix equationnormalized heat fluxrefinery fuel gassubstoichiometric combustionthermoacoustic couplingtramp air

Page 33: Slides Si Ps For Automated Ontologies

Method, 348 Documents

Page 34: Slides Si Ps For Automated Ontologies

Method• 25 SIPs in seminal text, shared by

• 348 books, collectively comprising

• 5829 unique SIPs

Page 35: Slides Si Ps For Automated Ontologies

Results: SIP SearchGoogle Search Results for SIPs on the Worldwide Web,

Unrestricted as to Document Type Placement of MCSPA in Google Search

SIP Placement 1 ≤5 ≤10 ≤20 ≤50 air preheat temperature 48 N N N N Y blocking generators 7 N N Y Y Y bridgewall temperature 13 N N N Y Y burner throat 16 N N N Y Y diffusion burner 17 N N N Y Y dummy factors 14 N N N Y Y ethylene cracking units 10 N N Y Y Y flame dimensions 17 N N N Y Y flame quality 15 N N N Y Y floor burners 50 N N N N Y function shape analysis 1 Y Y Y Y Y genuine replicates 1 Y Y Y Y Y heat flux curve 1 Y Y Y Y Y heat flux profile 7 N N Y Y Y hidden extrapolation 4 N Y Y Y Y hydrogen reformers 7 N N Y Y Y insulation pattern 20 N N N Y Y lenient sense 4 N Y Y Y Y lurking factors 1 Y Y Y Y Y normal matrix equation 4 N Y Y Y Y normalized heat flux 1 Y Y Y Y Y refinery fuel gas 380 N N N N N substoichiometric combustion 8 N N Y Y Y thermoacoustic coupling 3 N Y Y Y Y tramp air 8 N N Y Y Y

Number by Category 5 9 15 22 24 Percentage by Category 20% 36% 60% 88% 96%

88%88%≤≤ 2020

88% of SIPs found the seminal document on the World Wide Webin the first 20 returns

Page 36: Slides Si Ps For Automated Ontologies

Results: Boolean SearchResults of Boolean Search on Google.com

Placement of MCSPA in Google Search

Search Term Placement 1 ≤5 ≤10 ≤20 ≤50

air AND preheat AND temperature >100 N N N N N blocking AND generators 4 N Y Y Y Y bridgewall AND temperature 20 N N N Y Y burner AND throat 48 N N N N Y diffusion AND burner 54 N N N N N dummy AND factors >100 N N N N N ethylene AND cracking AND units >100 N N N N N flame AND dimensions >100 N N N N N flame AND quality >100 N N N N N floor AND burners 2 N Y Y Y Y function AND shape AND analysis >100 N N N N N genuine AND replicates 2 N Y Y Y Y heat AND flux AND curve >100 N N N N N heat AND flux AND profile >100 N N N N N hidden AND extrapolation 7 N N Y Y Y hydrogen AND reformers >100 N N N N N insulation AND pattern >100 N N N N N lenient AND sense 29 N N N N Y lurking AND factors 3 N Y Y Y Y normal AND matrix AND equation >100 N N N N N normalized AND heat AND flux 43 N N N N Y refinery AND fuel AND gas >100 N N N N N substoichiometric AND combustion 12 N N N Y Y thermoacoustic AND coupling 21 N N N N Y tramp AND air 12 N N N Y Y

Number by Category 0 4 5 8 12 Percentage by Category 0% 16% 20% 32% 48%

32%32%≤≤ 2020

Only 32% of keywords did the same

Page 37: Slides Si Ps For Automated Ontologies

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60Return Placement of Seminal Text

Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP

BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368

R R 2 2 = 0.98= 0.98

Con

fiden

ce

Information Overload

Here are theresultsgraphically. We can think of the cumulative % of terms as a kind of confidence that we will get a return within a given placement. For example, we have about 75% confidence that SIPs will returnthe target text inthe first 20 hits

Page 38: Slides Si Ps For Automated Ontologies

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60Return Placement of Seminal Text

Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP

BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368

R R 2 2 = 0.98= 0.98

Con

fiden

ce

Information Overload

We can think of the return placement as a kind of measure of information overload. We want our target to be returned as the first hit. If its returned as hit number 50 then we are overloaded with lots of extraneous information

Page 39: Slides Si Ps For Automated Ontologies

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60Return Placement of Seminal Text

Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP

BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368

R R 2 2 = 0.98= 0.98

2.2 (Confidence ratio)

Con

fiden

ce

Information Overload

We can define a confidence ratio as shown. For example, if we desire our target to be returned in the first 20 hits, then SIPs give us more than double the confidence that this will be so compared with keywords.

Page 40: Slides Si Ps For Automated Ontologies

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60Return Placement of Seminal Text

Cum

mul

ativ

e Pe

rcen

t of T

erm

s

y = 0.2288Ln(x) + 0.0741R 2 = 0.94

SIPSIP

BooleanBooleany = 0.1294Ln(x) y = 0.1294Ln(x) -- 0.03680.0368

R R 2 2 = 0.98= 0.98

9.8 (Overload ratio)

Con

fiden

ce

Information Overload

We can also define an overload ratio as shown. For example, for 50% confidence, we need to examine about 10 times more returns to find our target if we use Boolean keyword constructions than if we use SIPs

Page 41: Slides Si Ps For Automated Ontologies

Results: Boolean Vs. SIP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60

Placement (Information Overload)

% o

f Tem

rs (C

onfid

ence

)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Con

fiden

ce R

atio

0 10 20 30 40 50 60Overload Ratio

2.2

9.8

The confidence ratio tends to stabilize after the first 10 hits, but the overload ratio continues to grow: for example, if you want 90% confidence that your search returns the target, you need to examine almost 40x more returns with keywords compared to SIPs.

Page 42: Slides Si Ps For Automated Ontologies

Limitations of the Work• Restricted to comparisons between

other documents and a single seminal text

• Investigated voluminous documents –i.e., books

• Presumed that Amazon.com’s SIP strategy was indicative of TF x IDF strategies in general

Page 43: Slides Si Ps For Automated Ontologies

Limitations of the Work• Neither TF x IDF nor SIPs are worry free:

non-standard terminology

author-coined terms

evolving nomenclature

synonymic constructions

• although SIPs have narrower semantics than keywords, they may still be too broad

• SIPs cannot capture all thoughts

Page 44: Slides Si Ps For Automated Ontologies

Recommendations• Expand the work to include more than a

single seminal text

• Examine shorter works such as technical reports and memoranda

• Modify search engines to cache SIP tokens as well as keywords and begin to enable semantic search