p. 1 2005-3-28beini ouyang phrase matching: assessing document similarity for nasa scientists and...
Post on 22-Dec-2015
215 views
TRANSCRIPT
p. 12005-3-28 Beini Ouyang
Phrase Matching: Assessing Document Similarity for
NASA Scientists and Engineers
Phrase Matching: Assessing Document Similarity for
NASA Scientists and Engineers
Beini OuyangDepartment of Computer Science
The University of [email protected]
Advisor: Dr. Randy K. Smith
p. 22005-3-28 Beini Ouyang
OutlineOutline
Problem & Motivation Background & Related Work Approach & Uniques Results and Contributions
p. 32005-3-28 Beini Ouyang
Problem & MotivationProblem & Motivation
Problem Deal with hundreds of thousands of technical
standards from hundreds of organizations. Need to know
The most current and relevant information What related knowledge is available
A mechanism is needed to assist in answering the following questions Are there similar technical standards available? Are there training material related to this
standard? Are there lessons learned that have been
documented related to this standard?
p. 42005-3-28 Beini Ouyang
Problem & MotivationProblem & Motivation
TrainingMaterial
TechnicalStandards
LessonsLearned
Figure 1. Relationship between Lessons Leaned, Training Material and Technical Standards
p. 52005-3-28 Beini Ouyang
MotivationMotivation
A lot of work has been done on document search Exploiting matching strategies to address the
issue of locating similar documents Generally based on the frequency of single words
Single word: supplied keywords or generated by indexing the document of interest
Result: Degrade the efficiency and precision of the
searching pace once the document size and the number of documents grows
p. 62005-3-28 Beini Ouyang
MotivationMotivation
We propose an approach that emphasizes word phrase over single word indexes.
Goal: finding fewer but precisely related documents
Phrase-based search will be used to refine the results
p. 72005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
Background NASA’s Technical Standards Program (NTSP) has
the facility to provide access to over 1600 NASA agency-wide preferred technical standards, over 45,000 standards from other government groups, and more than 95,000 standards from over 145 national and international SDOs (Standards Development Organizations), committees and working groups.
The Lessons Learned and Best Practices (LLBP) include NASA published lessons and links to over 30 lessons-learned databases from government and non-government organizations
p. 82005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
The SA_MetaMatch tool was developed to aid the discovery and linking of related standards and lessons learned documents.
The SA_Metamatch tool is a component of the larger Standard Advisors Project
SA_MetaMatch was designed for finding similar documents in NASA experience databases using single word scoring across document meta-data.
p. 92005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
Related Work: SA_MetaMatch
Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC) .
Then, we mainly focus on integrating Dublin Core with metadata for each document.
After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.
In addition, SA_Metamatch also adopts a word-scored mechanism for ranking the results’ documents.
p. 102005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
Fig. 2 Generate / Edit Metadata Screen
SA_MetaGen_Intf
SA_MetaGen_Intf()initComponents()CloseButtonMouseClicked()ResetButtonMouseClicked()GenMetaButtonMouseClicked()viewFilterButtonMouseClicked()viewStopwordButtonMouseClicked()viewIndexButtonMouseClicked()viewThesaurusButtonMouseClicked()viewWordnetButtonMouseClicked()viewDocButtonMouseClicked()exitForm()getFileExt()genXMLMeta()writeTmpFile()preprocess()genXMLMatchMeta()genXMLHeader()genXMLFooter()genXMLEmptyElement()genXMLStartElement()genXMLStartElement()genXMLEndElement()genXMLElement()genXMLElement()writeXMLFile()resetAll()set_meta_home()main()getContentPane()
GenIndex
GenIndex()getFileExt()CallPerl()CallCommand()
Generate XML Metadata
Call perl script to call Swish-e indexer to generate index
Fig 3. Class Diagram for SA_MetaGen
p. 112005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
Related Work: SA_MetaMatch
Firstly, the scheme is to build metadata elements adopted from the Dublin Core (DC) .
Then, we mainly focus on integrating Dublin Core with metadata for each document.
After extracting and generating the indices from document content, the indices are as a benchmark to find the possible related documents.
In addition, SA_Metamatch also adopts a word-scored mechanism for ranking the results’ documents.
p. 122005-3-28 Beini Ouyang
BACKGROUND & RELATED WORKBACKGROUND & RELATED WORK
SA_MetaMatch An effective tool in locating similar documents However, it does return a large set or unrelated
documents. The use of single word index files which are used
in matching to find the related documents finds a large number of documents
slows down the search pace for large documents
p. 132005-3-28 Beini Ouyang
APPROACH & UNIQUENESSAPPROACH & UNIQUENESS
Word phrase indexing can play a more significant role in matching documents than single word indexes.
This research explores a phrase-based indexing extension to SA_Metamatch.
This extension is expected to improve results for NASA NTSP.
p. 142005-3-28 Beini Ouyang
APPROACH & UNIQUENESSAPPROACH & UNIQUENESS
The approach taken includes: Generating the phrase and word index metadata. Naturally, phrase length plays an important role in the
indexing and matching process. Heuristically, this work begins with a four word phrase limit. The approach taken is: Beginning based on the position of the word in the
document. Recursively generating phrases in terms of word
position. Limiting the phrase length Only matching top 20 phrases for the occurrence of
phrase frequency greater than 1. Adding a phrase weight score mechanism. The phrase
carries more weight than the raw index. In the end, it can give more specific results than the previous single word weight score mechanism.
p. 152005-3-28 Beini Ouyang
RESULTS AND CONTRIBUTIONSRESULTS AND CONTRIBUTIONS
Fig 4: single word index frequency Fig 5: Phrase Word Index Frequency
p. 162005-3-28 Beini Ouyang
RESULTS & CONTRIBUTIONSRESULTS & CONTRIBUTIONS
Preliminary results indicate phrase-based indexing achieves better results than single-word indexing for certain types of documents
Our results indicate that phrase-based indexing and matching is most beneficial when examining large documents
The amortized cost of generating the phrase index with the improved matching precision is justified when the target document and search documents are large.
Future work: Examining 4-word phrase heuristic Assessing our weighting scheme.
p. 172005-3-28 Beini Ouyang
REFERENCESREFERENCES
P. Gill, W. Vaughan, and D. Garcia, “Lessons Learned and Technical Standards: A Logical Marriage,” ASTM Standardization News, November 2001. http://www.astm.org
Cooper J.W. and Prager, John M. “Anti-Serendipity Finding Useless Documents and Similar Documents,” Proceeding of the 33rd Hawaii International Conference on System Sciences,Maui, HI, January,2000.
C. Yau and S. Hawker, “SA_MetaMatch: Document Discovery Through Document Metadata and Indexing,” Proceedings of the 42nd Annual ACM Southeast Regional Conference, Huntsville, AL, April 2-3, 2004.
DCMI. Dublin Core Metadata Element Set, Version 1.1: Reference Description, 2 June 2003