1 metadata extraction experiments with dtic collections department of computer science old dominion...
TRANSCRIPT
1
Metadata Extraction Experiments with DTIC
Collections
Department of Computer ScienceOld Dominion University
2/25/2005
Work in Progress
Metadata Extraction Project Sponsored by DTIC
2
Outline 2004 Project
Problem statement Motivations Objectives Delivery Tasks
Background Digital Library and OAI Metadata Extraction
System Architecture Experiments
Image PDF Normal (text) PDF Color PDF
Status 2005 Potential application to DTIC production ingestion process
3
2004 Project: Problem Statement
Problems of Legacy documents
“…most paper documents, even when represented as images …, in a dangerously inconvenient state: compared with encoded data, they are relatively illegible, unsearchable, and unbrowseable.”
Any information that is difficult to find, access and reuse risks being trapped in a second-class status
----Henry S. Baird, “Digital Libraries and Document Image Analysis”
ICDAR 2003
4
2004 Project : Problems Statements
Even for documents after OCR, they are still in “dangerous dangerously inconvenient state”:
Lack of metadata available for these resources hampers their discovery and dispersion over the Web. It also hampers the interoperability between them and resources from other organizations.
Lack of logical structure, which is very useful for better preservation, discovery and presentation.
5
2004 Project : Motivations for metadata extraction
Using metadata helps resource discovery It may save about $8,200 per employee for a company
to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files . (estimation made by Mike Doane on DCMI 2003 workshop)
Using metadata helps make collections interoperable with OAI-PMH
6
2004 Project : Motivations for metadata extraction
However, creating metadata manually for a large collection is expensive
It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)
Automatic metadata extraction tools are essential to reduce the cost.
7
2004 Project : Motivations - logical structure Converting a document into XML format with logical
structure helps information preservation Information in a document can still be accessible and the
document can still be presented in appropriate way when the software to open the document is not available any more.
Converting a document into XML format with logical structure helps information presentation
With different XSL, a XML document can be presented differently A XML document can be presented differently to different devices
such as web browsers, PDA, etc. It allows different users who have different accesses. For example,
registered users have full access to a document while Guests have access to only a part of the document such as introduction
Converting a document into XML format with logical structure helps information discovery
It allows logical component based retrieval, for example, searching only in introduction.
It allows some special searches such as equation search.
8
2004 Project: Approach
Organizations need to move towards: Effective conversion of existing
corpora into a DTD-compliant XML format
Integrate modern authoring tools in the publication process in a DTD-compliant XML format
9
2004 Project : Objectives
Our Objective is to automate the task of extracting metadata and basic structure from DTIC PDF documents:
Replace the current manual process for entering incoming documents into the DTIC digital library
Batch process existing large collections for structural metadata
10
2004 Project : Deliverables
A software package written mostly in Java, that will access PDF documents available on a file system, extract the metadata and store them in a local file system. Validate against the metadata manually extracted already for the selected pdf files.
A viewer software, written in Java, to view/edit the extracted metadata.
11
2004 Project : Deliverables
A technical report shows the results of the research and documentation to help in using the above listed software
Feasibility report on extracting complex objects such as
figures, equations, references, and tables from the document and representing them in a DTD-compliant XML format.
an ingestion software, written in Java, to insert extracted metadata into the existing system used by DTIC
13
Digital Library and OAI
Digital Library (DL) A DL is a network accessible and searchable collection of
digital information. DL provides a way to store, organize, preserve and share
information. Interoperability problem
DLs are usually created separately by using different technologies and different metadata schemas.
14
Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH)
is a framework to to provide interoperability among heterogeneous DLs.
It is based on metadata harvesting: a services provider can harvest metadata from a data provider.
Data provider accepts OAI-PMH requests and provides metadata through network
Service provider issues OAI-PMH requests to get metadata and build services on them.
Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.
16
Dublin Core Metadata Set
It supports 15 elements Title, Creator, Subject, Description, Publisher,
Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights
All fields are optional
http://dublincore.org/documents/dces/
17
Metadata Extraction: Rule-based
Basic idea: Use a set of rules to define how to extract metadata
based on human observation. For example, a rule may be “ The first line is title”.
Advantage Can be implemented straightforward No need for training
Disadvantage Lack of adaptabilities (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because
rules are usually fixed
18
Metadata Extraction: Machine-Learning Approach
Learn the relationship between input and output from samples and make predictions for new data
This approach has good adaptability but it has to be trained from samples.
HMM (hidden Markov Model) & SVM (Support Vector Machine)
19
Hidden Markov Model -general
Overview
HMM was introduced by Baum in late 60s. HMM is a dominating technology for Speech recognition. It is widely use in other areas such as DNA segmentation
and gene recognition. HMM has been used in Information Extraction recently
Address parsing (borkar 2001, etc.) Name recognition (Klein 2003, etc.) Reference Parsing (borkar 2001) Metadata Extraction ( seymore 1999, Freitag 2000, etc.)
20
Support Vector Machine - general
Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as
face detection, isolated handwriting digit recognition, gene classification, etc.
A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html
It is also used in text analysis (Joachims 1998, etc.) and metadata extraction (Han 2003).
21
SVM - Metadata Extraction
Basic idea Classes metadata elements Extract metadata from a document classify each
line (or block) into appropriate classes. For example
Extract document title from a document Classify each line to see whether it is a part of title or not
Related work Automatic Document Metadata Extraction Using
Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported
22
System Architecture
Documents
OCR/ Converter
Metadata Extractor Metadata
JDBC
OAI Layer
Search Engine
Cache
User Interface
Query Results
Request Response
23
System Architecture (cont.)
Main components: OCR/Converter: Commercial OCR software is used to
OCR image PDF files; For normal PDF files, we convert them to XML files.
Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format.
OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses.
Search Engine
24
Metadata Extraction - Rule-based
Expert system approach Build a large rule base by
using standard languages such as prolog
Use existed expert system engine (for example, SWI-prolog)
Advantages Can use existing engine
Disadvantages Building rule base is
time-consuming
Doc
Parser
Expert System Engine
KnowledgeBase
Facts
metadata
25
Metadata Extraction – Rule-based
Template-based approach Classify documents into
classes based on similarity For each document class,
create a template, or a set of rules
Decoupling rules from coding
A template is kept in a separate file
Advantages Easy to extend
For a new document class, just create a template
Rules are simpler Rules can be refined easily
Doc3
template2
Metadata
Extraction
Doc1
template1
Doc2
template2
metadata
26
Metadata Extraction – Machine Learning Approach
SVM Feature Extraction
Line level feature: line length, how many words in a line, percentage of capitalized words in a line, percentage of possible name in a line, etc.
Word level feature: is an English word, is an person name, is a city name, is a state name, etc.
Feature Extraction Knowledge base
SVMClassifiers
Metadata
Doc
Models
SVM Learner
Tagged Doc
classifying learning
28
Performance Measure
For individual metadata element Precision=a/(a+c) Recall=a/(a+b) Accuracy=(a+d)/(a+b+c+d)
Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data.
a b
c d
Original
Classified
In class
Not In class
In class Not In class
29
Experiments with image PDF
OCR OmniPage PrimeOCR
Metadata Extraction SVM Template-based
Structure Extraction & Markup
30
Image PDF : OCR
ScanSoft Omnipage Pro 14.0 Features: Support PDF input, standard XML output,
automatic process new added documents in its watched folder.
We developed clean-up module to produce standard XML output
PrimeOCR SDK Features: Support PDF input, does not support XML
output We wrote code to use its API to process documents
and output the results to our XML format
31
Image/metadata extraction: SVM Objective
Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction.
Data Sets Data Set 1: Seymore935
Download from http://www-2.cs.cmu.edu/~kseymore/ie.html 935 manually tagged document headers Using the first 500 for training and the rest for test
Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test
32
Image/metadata extraction: SVM(cont.)
Data Set 3: DTIC33 A subset of DTIC100 33 tagged document headers with identical layout Using the first 24 for training and the rest for test
DTIC33
Seymore945 DTIC100
More heterogeneous
34
Image/metadata extraction: SVM(cont.)
Precision
0.00%10.00%
20.00%30.00%40.00%50.00%
60.00%70.00%80.00%
90.00%100.00%
Title Creator Affiliation Date
dtic33
dtic100
seymore935
35
Experiments/Image/metadata: SVM(cont.)
Recall
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Title Creator Affiliation Date
dtic33
dtic100
seymore935
36
Image/metadata extraction: Template
Objective Evaluate the performance of our rule-based approach
– defining a template for each class. Experiment
Use data set DTIC100: 100 XML files with font size and bold information
It is divided into 7 classes according to layout information
For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data
37
Image/metadata extraction: Template (cont.)
Template Documents Class Name Precision Recall Identifier 100% 100% Type 100% 100%
Date 100% 100% Title 100% 100%
Creator 100% 100%
Contributor 100% 100%
Afrl 5
Publisher 100% 100%
Identifier 100% 100%
Date 100% 100%
Title 100% 83.33%
Arl 5
Creator 75.00% 100%
Identifier 100% 100%
Date 100% 100% Title 100% 83.33%
Edgewood 4
Creator 85.71% 66.67%
38
Image/metadata extraction: Template (cont.)
Template Documents Class Name Precision Recall Creator 100% 93.33%
Date 100% 96.67% Nps 15
Title 100% 86.67%
Creator 100% 90.00%
Date 100% 100% Usnce 5
Title 100% 100%
Title 100% 100%
Creator 100% 100% Contributor 100% 100%
Identifier 100% 100%
Afit 6
Right 100% 100%
Title 100% 100%
Creator 100% 100% Contributor 100% 100%
Date 100% 100%
Text 33
Type 100% 100%
39
Image/metadata extraction: Template (cont.)
We applied this approach to enlarged data set Extract metadata from raw XML files obtained
after OCR. We divide the 546 documents into 15 classes and use a template for each document class.
Demo2 is a demo to show metadata extraction from 546 documents with 15 classes.
40
Image/metadata extraction: Template(cont.)Summary of metadata accuracy (demo2) 191 metadata xml files are manually tagged, and are compared with the metadata extracted by our engine. The following shows the results CLASS ELEMENTS EXACT
MATCH Partial Match Comments
AFIT – 29 files Title Creator Identifier
48.27586% 3.4482758% 93.10345%
100.0% 100.0% 100.0%
The creator elements picks up certain text on the same line, hence low accuracy in exact match
AFRL – 22 files Title Date Creator Identifier
18.181818% 100.0% 95.454544% 68.181816%
90.909096% 100.0% 95.454544% 100.0%
41
Image PDF/Structure Markup
Extract document logical structure and represent the document in XML format by using hardcode rules.
Classify lines to see whether they are subtitles or not (by line length, layout information, etc.)
For subtitles, group them into classes based on their features. Each class represents a level in final hierarchy structure
See results for 2 PDF files at structure extraction ( The metadata part was extracted by template-based approach)
A mockup example is shown how a markup document looks like in the future.
42
Text PDF
Metadata extraction with expert system approach
Cover page detection To classify whether a page is cover page or not
SF298 form detection and processing Given a document, find its SF298 form page and
extract information from the form
43
Text PDF/metadata extraction: Expert System
Objective: Can an expert system be used to recognize
documents that do not fall into known categories Preliminary Experiment
We used prolog and SWI-Prolog engine Wrote rules for title extraction, encoding documents
into prolog data structure, and extract title from these documents.
Results: We succeed to extract ‘title’ from about 85% in
randomly selected DTIC documents
44
Text PDF/metadata extraction: Cover Page detection
We wrote a program to classify whether a page is a “cover page” or not based on some rules such as:
A cover page contains less words, less lines Less number of words per line Has some lines at the second half page, etc.
Currently, we hardcode these rules in our code We tested our code with about 30 DTIC
documents (see classification result). All cover pages were found 2 documents with no cover page were identified
45
Text PDF/metadata extraction: SF298 form detection and processing
We downloaded about 1000 PDF files from DTIC collection and wrote code to locate and process SF298 forms
Locate SF298 form by matching some special strings Detect feature changes in a SF298 form to determine whether a string
is a part of a field name or a part of a field value Convert multi-line field name into a single line When a filed value and a field name are in a same line, separate them
Find its field name for a string A field name with minimum distance On the top of it
Demo6 shows the results of SF298 form processing with these 1000 PDF documents.
100% of SF298 pages were located in 1000 documents (when present) About 95% of major 4 fields were correctly identified* About 90% of up to 27 (maximum) fields were correctly extracted*
*based on random sample validation
46
Experiments with color Pdfs
We downloaded 18 documents and Explored both:
OCR with Omni to produce RawXML and applied our cleanup module to produce CleanXML
OCR with PrimeOCR having modified it to produce CleanXML
Applied the template approach, found 8 classes, and extracted the metadata correctly from 17 of 18 documents
47
Status We have gauged accuracy of individual
components of architecture through experiments on DTIC collection
We have implementations for metadata extraction modules: Template based SVM HMM Rule-based expert system
48
Status We have
Feature set to express rules templates for 15 classes of DTIC documents
We have automated process: Manual OCR set of pdf files -> directory Automated for all files:
Extract metadata Create OAI-compliant, XML metadata record Insert into OAI-compliant Digital Library
49
Status We have Java editor for XML records We have OCR cleanup modules for
Input Image pdf Text pdf
OCR Output WordML Raw XML
Cleanup Output: CleanXML We have a very preliminary prototype
structure markup module
50
Experiment summary
OCR Omni XML output Expert DTIC10 85% Template DTIC20 95%Automatic add files
PrimeOCR API -> XML Coverpage DTIC30 100% detection
SVM DTIC100 85%-100%SF298 loc. DTIC1000 100%
Template DTIC600 90%-100%* SF298 4fields DTIC1000 95%-97%*#SF298 27fields DTIC1000 88%-99%*#
Structure basic ok
*partial #based on random sample of 20 to validate
Image PDF Text PDF Color PDF
51
Status We have knowledge database
obtained from analyzing arc and DTIC collections Authors (4Mill strings from
http://arc.cs.odu.edu)
Organizations (79 from DTIC250) Universities (52 from DTIC250)
52
Proposal 2005-06 Goal: Develop computer assisted
production process and supporting software to Extract metadata for color pdf Insert records into DTIC collection
Assumptions On order of 10,000 documents/yr On order of 100 types of documents
53
Proposal 2005-06 Needed additional software
Analyze existing DTIC collection and create knowledge database of metadata (e.g., all known authors, organizations,..)
Create software environment for humans to interact with system generated analyses to correct metadata
Create voting module to predict need for human intervention
54
Proposal 2005-06 Tasks Phase 1 (5 months):
Analyze DTIC system environment Develop software modules
Phase 2 (4 months): Integrate software to insert records into
DTIC collection Train learning modules Create initial set of templates for rule
module
55
Proposal 2005-06 Tasks Phase 3 (2 months):
Observe humans on actual process and gather data on real production
Develop editors for templates, features to handle new types of documents not handled by system
57
Proposal 2005-06: Outcome We expect to be able to run (with
developer’s support the system) the system such that only 20% of documents will need human intervention
We expect that in the case of human intervention we will reduce the processing time by 80%
60
Metadata Extraction: Rule-based
Related works Automated labeling algorithms for biomedical
document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%,
affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and
Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979
journal pages Good results for some elements (such as page-number has
90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)
61
2004 Project : Tasks
Working with DTIC in identifying the set of documents and the metadata of interest.
Developing software for metadata and structure extraction from the selected set of PDF documents
Feasibility study for extracting complex objects and representing the complete document in XML
63
Summary
For image PDF documents, OCR software is very important. In order to automate the whole process, OCR software has to either support auto processing or provide API. We chose Omnipage and PrimeOCR SDK based on our research.
Lesson Learned: Different OCR software may use different format. A desirable
way is to use an internal format. An OCR software may output many formats, some are difficult
to process. Choose which format to work with is important. Developing a good post-OCR procession tool is very important.
According to our experience, a bad post-OCR procession may downgrade overall performance a lot.
64
Summary (cont.)
For text PDF documents, to avoid OCR errors, we converted them to XML format(without OCR).Lesson learned: sometimes, reordering the text strings are necessary, especially for SF298 forms.
We showed SVM is a feasible way to extract metadata by applying it to several data sets.Lesson learned: The less heterogeneous the data set, the better the performance.
65
Summary (cont.)
Our template-based approach got high accuracy result while keeping a template for a class very simple. It also provided a way to extract different metadata from different document classes.
Lesson learned: It is challenge to make it scalable to a large collection when you do not know how many classes.
66
Summary (cont.)
Our expert system approach is aim to develop more general rules for extracting metadata from a large collection.
Lesson learned: building a rule base for a large heterogonous collection is time-consuming. A more feasible way is to develop general rules under a situation, for example, extracting title from cover page. And then combine them together.
67
Summary (cont.)
We showed it’s feasible to classify whether a page is a “cover page” or not by some simple rules.
We handled a special case: SF298 from by detecting it from a document and extracting information from it.
68
Conclusion and Future works
Our template-based approach can provide high accuracy with a set of simple rules. In the future, we need:
Develop a tool to classify documents into different classes or assign an new document to a class.
For documents does not belong to a know class, either leave them out for user to define more templates or use an expert system with general rules to process them.
We believe that integration machine learning approach will improve the performance of our system.
To improve the performance, a feed back loop need to be implemented to let users check the results and learn from users action.
69
Documents
OCR Metadata Extractor
Repository of Metadata
OAI Layer Search Engine
Query Results Request Response
Structure Extractor
Complex Objects
Recognition Repository of
Digitized Objects
XSL
Transform
Reference Processor Object Digitization
70
Future Works: Metadata Extraction
Doc
OCR
Template
Rule-based module
Interactive Tool
TaggedText
Doc after OCR
SVM Classifier
Merger
Models
Metadata
Models
HMM
71
Hidden Markov Model -general “Hidden Markov Modeling is a probabilistic technique for the
study of observed items arranged in discrete-time series” --Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988
HMM is a probabilistic finite state automaton Transit from state to state Emit a symbol when visit each state States are hidden
A B C D
72
Hidden Markov Model -general
A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from
one state to another Emission probabilities: probability of emitting
each symbol in each state Initial probabilities: probability of each state to
be chosen as the first state
73
HMM - Metadata Extraction
A document is a sequence of words that is produced by some hidden states (title, author, etc.)
The parameters of HMM was learned from samples in advance.
Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.
74
Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003
Challenges in Building Federation … Kurt Maly … 2003
…title title title title
Challenges in Building Federation … Kurt Maly … 2003
author author date…
75
HMM - Metadata Extraction
Related work K. Seymore, A. McCallum, and R. Rosenfeld.
Learning hidden Markov model structure for information extraction.
Result: overall accuracy 90.1% was reported
76
Support Vector Machine - general
Binary Classifier (classify data into two classes) It represents data with
pre-defined features It finds the plane with
largest margin to separate the two classes from samples
It classifies data into two classes based on which side they located.
Font size
Line number
hyperplane
margin
The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.
77
Extension to non-linear decision boundary
( )
( )
( )( )( )
( )
( )( )
(.)( )
( )
( )
( )( )
( )
( )
( )( )
( )
Feature spaceInput space
Support Vector Machine - general
78
Support Vector Machine - general Extension to multiple classes
One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes)
One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM