p. 1 2005-3-28beini ouyang phrase matching: assessing document similarity for nasa scientists and...

p. 12005-3-28 Beini Ouyang

Phrase Matching: Assessing Document Similarity for

NASA Scientists and Engineers

Phrase Matching: Assessing Document Similarity for

NASA Scientists and Engineers

Beini OuyangDepartment of Computer Science

The University of [email protected]

Advisor: Dr. Randy K. Smith


OutlineOutline

Problem & Motivation Background & Related Work Approach & Uniques Results and Contributions


Problem & MotivationProblem & Motivation

Problem Deal with hundreds of thousands of technical

standards from hundreds of organizations. Need to know

The most current and relevant information What related knowledge is available

A mechanism is needed to assist in answering the following questions Are there similar technical standards available? Are there training material related to this

standard? Are there lessons learned that have been

documented related to this standard?


Problem & MotivationProblem & Motivation

TrainingMaterial

TechnicalStandards

LessonsLearned

Figure 1. Relationship between Lessons Leaned, Training Material and Technical Standards


MotivationMotivation

A lot of work has been done on document search Exploiting matching strategies to address the

issue of locating similar documents Generally based on the frequency of single words

Single word: supplied keywords or generated by indexing the document of interest

Result: Degrade the efficiency and precision of the

searching pace once the document size and the number of documents grows


MotivationMotivation

We propose an approach that emphasizes word phrase over single word indexes.

Goal: finding fewer but precisely related documents

Phrase-based search will be used to refine the results


BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK

Background NASA’s Technical Standards Program (NTSP) has

the facility to provide access to over 1600 NASA agency-wide preferred technical standards, over 45,000 standards from other government groups, and more than 95,000 standards from over 145 national and international SDOs (Standards Development Organizations), committees and working groups.

The Lessons Learned and Best Practices (LLBP) include NASA published lessons and links to over 30 lessons-learned databases from government and non-government organizations



The SA_MetaMatch tool was developed to aid the discovery and linking of related standards and lessons learned documents.

The SA_Metamatch tool is a component of the larger Standard Advisors Project

SA_MetaMatch was designed for finding similar documents in NASA experience databases using single word scoring across document meta-data.



Related Work: SA_MetaMatch

Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC) .

Then, we mainly focus on integrating Dublin Core with metadata for each document.

After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.

In addition, SA_Metamatch also adopts a word-scored mechanism for ranking the results’ documents.



Fig. 2 Generate / Edit Metadata Screen

SA_MetaGen_Intf

SA_MetaGen_Intf()initComponents()CloseButtonMouseClicked()ResetButtonMouseClicked()GenMetaButtonMouseClicked()viewFilterButtonMouseClicked()viewStopwordButtonMouseClicked()viewIndexButtonMouseClicked()viewThesaurusButtonMouseClicked()viewWordnetButtonMouseClicked()viewDocButtonMouseClicked()exitForm()getFileExt()genXMLMeta()writeTmpFile()preprocess()genXMLMatchMeta()genXMLHeader()genXMLFooter()genXMLEmptyElement()genXMLStartElement()genXMLStartElement()genXMLEndElement()genXMLElement()genXMLElement()writeXMLFile()resetAll()set_meta_home()main()getContentPane()

GenIndex

GenIndex()getFileExt()CallPerl()CallCommand()

Generate XML Metadata

Call perl script to call Swish-e indexer to generate index

Fig 3. Class Diagram for SA_MetaGen



Related Work: SA_MetaMatch

Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC) .

Then, we mainly focus on integrating Dublin Core with metadata for each document.

After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.

In addition, SA_Metamatch also adopts a word-scored mechanism for ranking the results’ documents.



SA_MetaMatch An effective tool in locating similar documents However, it does return a large set or unrelated

documents. The use of single word index files which are used

in matching to find the related documents finds a large number of documents

slows down the search pace for large documents


APPROACH & UNIQUENESSAPPROACH & UNIQUENESS

Word phrase indexing can play a more significant role in matching documents than single word indexes.

This research explores a phrase-based indexing extension to SA_Metamatch.

This extension is expected to improve results for NASA NTSP.


APPROACH & UNIQUENESSAPPROACH & UNIQUENESS

The approach taken includes: Generating the phrase and word index metadata. Naturally, phrase length plays an important role in the

indexing and matching process. Heuristically, this work begins with a four word phrase limit. The approach taken is: Beginning based on the position of the word in the

document. Recursively generating phrases in terms of word

position. Limiting the phrase length Only matching top 20 phrases for the occurrence of

phrase frequency greater than 1. Adding a phrase weight score mechanism. The phrase

carries more weight than the raw index. In the end, it can give more specific results than the previous single word weight score mechanism.


RESULTS AND CONTRIBUTIONSRESULTS AND CONTRIBUTIONS

Fig 4: single word index frequency Fig 5: Phrase Word Index Frequency


RESULTS & CONTRIBUTIONSRESULTS & CONTRIBUTIONS

Preliminary results indicate phrase-based indexing achieves better results than single-word indexing for certain types of documents

Our results indicate that phrase-based indexing and matching is most beneficial when examining large documents

The amortized cost of generating the phrase index with the improved matching precision is justified when the target document and search documents are large.

Future work: Examining 4-word phrase heuristic Assessing our weighting scheme.


REFERENCESREFERENCES

P. Gill, W. Vaughan, and D. Garcia, “Lessons Learned and Technical Standards: A Logical Marriage,” ASTM Standardization News, November 2001. http://www.astm.org

Cooper J.W. and Prager, John M. “Anti-Serendipity Finding Useless Documents and Similar Documents,” Proceeding of the 33rd Hawaii International Conference on System Sciences,Maui, HI, January,2000.

C. Yau and S. Hawker, “SA_MetaMatch: Document Discovery Through Document Metadata and Indexing,” Proceedings of the 42nd Annual ACM Southeast Regional Conference, Huntsville, AL, April 2-3, 2004.

DCMI. Dublin Core Metadata Element Set, Version 1.1: Reference Description, 2 June 2003


Thanks!