1 metadata extraction experiments with dtic collections department of computer science old dominion...

78
1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction Project Sponsored by DTIC

Upload: myles-wilkinson

Post on 30-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

1

Metadata Extraction Experiments with DTIC

Collections

Department of Computer ScienceOld Dominion University

2/25/2005

Work in Progress

Metadata Extraction Project Sponsored by DTIC

2

Outline 2004 Project

Problem statement Motivations Objectives Delivery Tasks

Background Digital Library and OAI Metadata Extraction

System Architecture Experiments

Image PDF Normal (text) PDF Color PDF

Status 2005 Potential application to DTIC production ingestion process

3

2004 Project: Problem Statement

Problems of Legacy documents

“…most paper documents, even when represented as images …, in a dangerously inconvenient state: compared with encoded data, they are relatively illegible, unsearchable, and unbrowseable.”

Any information that is difficult to find, access and reuse risks being trapped in a second-class status

----Henry S. Baird, “Digital Libraries and Document Image Analysis”

ICDAR 2003

4

2004 Project : Problems Statements

Even for documents after OCR, they are still in “dangerous dangerously inconvenient state”:

Lack of metadata available for these resources hampers their discovery and dispersion over the Web. It also hampers the interoperability between them and resources from other organizations.

Lack of logical structure, which is very useful for better preservation, discovery and presentation.

5

2004 Project : Motivations for metadata extraction

Using metadata helps resource discovery It may save about $8,200 per employee for a company

to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files . (estimation made by Mike Doane on DCMI 2003 workshop)

Using metadata helps make collections interoperable with OAI-PMH

6

2004 Project : Motivations for metadata extraction

However, creating metadata manually for a large collection is expensive

It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)

Automatic metadata extraction tools are essential to reduce the cost.

7

2004 Project : Motivations - logical structure Converting a document into XML format with logical

structure helps information preservation Information in a document can still be accessible and the

document can still be presented in appropriate way when the software to open the document is not available any more.

Converting a document into XML format with logical structure helps information presentation

With different XSL, a XML document can be presented differently A XML document can be presented differently to different devices

such as web browsers, PDA, etc. It allows different users who have different accesses. For example,

registered users have full access to a document while Guests have access to only a part of the document such as introduction

Converting a document into XML format with logical structure helps information discovery

It allows logical component based retrieval, for example, searching only in introduction.

It allows some special searches such as equation search.

8

2004 Project: Approach

Organizations need to move towards: Effective conversion of existing

corpora into a DTD-compliant XML format

Integrate modern authoring tools in the publication process in a DTD-compliant XML format

9

2004 Project : Objectives

Our Objective is to automate the task of extracting metadata and basic structure from DTIC PDF documents:

Replace the current manual process for entering incoming documents into the DTIC digital library

Batch process existing large collections for structural metadata

10

2004 Project : Deliverables

A software package written mostly in Java, that will access PDF documents available on a file system, extract the metadata and store them in a local file system. Validate against the metadata manually extracted already for the selected pdf files.

A viewer software, written in Java, to view/edit the extracted metadata.

11

2004 Project : Deliverables

A technical report shows the results of the research and documentation to help in using the above listed software

Feasibility report on extracting complex objects such as

figures, equations, references, and tables from the document and representing them in a DTD-compliant XML format.

an ingestion software, written in Java, to insert extracted metadata into the existing system used by DTIC

12

Background

OAI and Digital Library

Metadata Extraction Approaches

13

Digital Library and OAI

Digital Library (DL) A DL is a network accessible and searchable collection of

digital information. DL provides a way to store, organize, preserve and share

information. Interoperability problem

DLs are usually created separately by using different technologies and different metadata schemas.

14

Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH)

is a framework to to provide interoperability among heterogeneous DLs.

It is based on metadata harvesting: a services provider can harvest metadata from a data provider.

Data provider accepts OAI-PMH requests and provides metadata through network

Service provider issues OAI-PMH requests to get metadata and build services on them.

Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

15

16

Dublin Core Metadata Set

It supports 15 elements Title, Creator, Subject, Description, Publisher,

Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights

All fields are optional

http://dublincore.org/documents/dces/

17

Metadata Extraction: Rule-based

Basic idea: Use a set of rules to define how to extract metadata

based on human observation. For example, a rule may be “ The first line is title”.

Advantage Can be implemented straightforward No need for training

Disadvantage Lack of adaptabilities (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because

rules are usually fixed

18

Metadata Extraction: Machine-Learning Approach

Learn the relationship between input and output from samples and make predictions for new data

This approach has good adaptability but it has to be trained from samples.

HMM (hidden Markov Model) & SVM (Support Vector Machine)

19

Hidden Markov Model -general

Overview

HMM was introduced by Baum in late 60s. HMM is a dominating technology for Speech recognition. It is widely use in other areas such as DNA segmentation

and gene recognition. HMM has been used in Information Extraction recently

Address parsing (borkar 2001, etc.) Name recognition (Klein 2003, etc.) Reference Parsing (borkar 2001) Metadata Extraction ( seymore 1999, Freitag 2000, etc.)

20

Support Vector Machine - general

Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as

face detection, isolated handwriting digit recognition, gene classification, etc.

A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html

It is also used in text analysis (Joachims 1998, etc.) and metadata extraction (Han 2003).

21

SVM - Metadata Extraction

Basic idea Classes metadata elements Extract metadata from a document classify each

line (or block) into appropriate classes. For example

Extract document title from a document Classify each line to see whether it is a part of title or not

Related work Automatic Document Metadata Extraction Using

Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

22

System Architecture

Documents

OCR/ Converter

Metadata Extractor Metadata

JDBC

OAI Layer

Search Engine

Cache

User Interface

Query Results

Request Response

23

System Architecture (cont.)

Main components: OCR/Converter: Commercial OCR software is used to

OCR image PDF files; For normal PDF files, we convert them to XML files.

Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format.

OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses.

Search Engine

24

Metadata Extraction - Rule-based

Expert system approach Build a large rule base by

using standard languages such as prolog

Use existed expert system engine (for example, SWI-prolog)

Advantages Can use existing engine

Disadvantages Building rule base is

time-consuming

Doc

Parser

Expert System Engine

KnowledgeBase

Facts

metadata

25

Metadata Extraction – Rule-based

Template-based approach Classify documents into

classes based on similarity For each document class,

create a template, or a set of rules

Decoupling rules from coding

A template is kept in a separate file

Advantages Easy to extend

For a new document class, just create a template

Rules are simpler Rules can be refined easily

Doc3

template2

Metadata

Extraction

Doc1

template1

Doc2

template2

metadata

26

Metadata Extraction – Machine Learning Approach

SVM Feature Extraction

Line level feature: line length, how many words in a line, percentage of capitalized words in a line, percentage of possible name in a line, etc.

Word level feature: is an English word, is an person name, is a city name, is a state name, etc.

Feature Extraction Knowledge base

SVMClassifiers

Metadata

Doc

Models

SVM Learner

Tagged Doc

classifying learning

27

Experiments

Image PDF Normal (text) PDF Color PDF

28

Performance Measure

For individual metadata element Precision=a/(a+c) Recall=a/(a+b) Accuracy=(a+d)/(a+b+c+d)

Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data.

a b

c d

Original

Classified

In class

Not In class

In class Not In class

29

Experiments with image PDF

OCR OmniPage PrimeOCR

Metadata Extraction SVM Template-based

Structure Extraction & Markup

30

Image PDF : OCR

ScanSoft Omnipage Pro 14.0 Features: Support PDF input, standard XML output,

automatic process new added documents in its watched folder.

We developed clean-up module to produce standard XML output

PrimeOCR SDK Features: Support PDF input, does not support XML

output We wrote code to use its API to process documents

and output the results to our XML format

31

Image/metadata extraction: SVM Objective

Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction.

Data Sets Data Set 1: Seymore935

Download from http://www-2.cs.cmu.edu/~kseymore/ie.html 935 manually tagged document headers Using the first 500 for training and the rest for test

Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test

32

Image/metadata extraction: SVM(cont.)

Data Set 3: DTIC33 A subset of DTIC100 33 tagged document headers with identical layout Using the first 24 for training and the rest for test

DTIC33

Seymore945 DTIC100

More heterogeneous

33

Image/metadata extraction: SVM(cont.)

Overall accuracy of title, author, affiliation and date

34

Image/metadata extraction: SVM(cont.)

Precision

0.00%10.00%

20.00%30.00%40.00%50.00%

60.00%70.00%80.00%

90.00%100.00%

Title Creator Affiliation Date

dtic33

dtic100

seymore935

35

Experiments/Image/metadata: SVM(cont.)

Recall

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Title Creator Affiliation Date

dtic33

dtic100

seymore935

36

Image/metadata extraction: Template

Objective Evaluate the performance of our rule-based approach

– defining a template for each class. Experiment

Use data set DTIC100: 100 XML files with font size and bold information

It is divided into 7 classes according to layout information

For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data

37

Image/metadata extraction: Template (cont.)

Template Documents Class Name Precision Recall Identifier 100% 100% Type 100% 100%

Date 100% 100% Title 100% 100%

Creator 100% 100%

Contributor 100% 100%

Afrl 5

Publisher 100% 100%

Identifier 100% 100%

Date 100% 100%

Title 100% 83.33%

Arl 5

Creator 75.00% 100%

Identifier 100% 100%

Date 100% 100% Title 100% 83.33%

Edgewood 4

Creator 85.71% 66.67%

38

Image/metadata extraction: Template (cont.)

Template Documents Class Name Precision Recall Creator 100% 93.33%

Date 100% 96.67% Nps 15

Title 100% 86.67%

Creator 100% 90.00%

Date 100% 100% Usnce 5

Title 100% 100%

Title 100% 100%

Creator 100% 100% Contributor 100% 100%

Identifier 100% 100%

Afit 6

Right 100% 100%

Title 100% 100%

Creator 100% 100% Contributor 100% 100%

Date 100% 100%

Text 33

Type 100% 100%

39

Image/metadata extraction: Template (cont.)

We applied this approach to enlarged data set Extract metadata from raw XML files obtained

after OCR. We divide the 546 documents into 15 classes and use a template for each document class.

Demo2 is a demo to show metadata extraction from 546 documents with 15 classes.

40

Image/metadata extraction: Template(cont.)Summary of metadata accuracy (demo2) 191 metadata xml files are manually tagged, and are compared with the metadata extracted by our engine. The following shows the results CLASS ELEMENTS EXACT

MATCH Partial Match Comments

AFIT – 29 files Title Creator Identifier

48.27586% 3.4482758% 93.10345%

100.0% 100.0% 100.0%

The creator elements picks up certain text on the same line, hence low accuracy in exact match

AFRL – 22 files Title Date Creator Identifier

18.181818% 100.0% 95.454544% 68.181816%

90.909096% 100.0% 95.454544% 100.0%

41

Image PDF/Structure Markup

Extract document logical structure and represent the document in XML format by using hardcode rules.

Classify lines to see whether they are subtitles or not (by line length, layout information, etc.)

For subtitles, group them into classes based on their features. Each class represents a level in final hierarchy structure

See results for 2 PDF files at structure extraction ( The metadata part was extracted by template-based approach)

A mockup example is shown how a markup document looks like in the future.

42

Text PDF

Metadata extraction with expert system approach

Cover page detection To classify whether a page is cover page or not

SF298 form detection and processing Given a document, find its SF298 form page and

extract information from the form

43

Text PDF/metadata extraction: Expert System

Objective: Can an expert system be used to recognize

documents that do not fall into known categories Preliminary Experiment

We used prolog and SWI-Prolog engine Wrote rules for title extraction, encoding documents

into prolog data structure, and extract title from these documents.

Results: We succeed to extract ‘title’ from about 85% in

randomly selected DTIC documents

44

Text PDF/metadata extraction: Cover Page detection

We wrote a program to classify whether a page is a “cover page” or not based on some rules such as:

A cover page contains less words, less lines Less number of words per line Has some lines at the second half page, etc.

Currently, we hardcode these rules in our code We tested our code with about 30 DTIC

documents (see classification result). All cover pages were found 2 documents with no cover page were identified

45

Text PDF/metadata extraction: SF298 form detection and processing

We downloaded about 1000 PDF files from DTIC collection and wrote code to locate and process SF298 forms

Locate SF298 form by matching some special strings Detect feature changes in a SF298 form to determine whether a string

is a part of a field name or a part of a field value Convert multi-line field name into a single line When a filed value and a field name are in a same line, separate them

Find its field name for a string A field name with minimum distance On the top of it

Demo6 shows the results of SF298 form processing with these 1000 PDF documents.

100% of SF298 pages were located in 1000 documents (when present) About 95% of major 4 fields were correctly identified* About 90% of up to 27 (maximum) fields were correctly extracted*

*based on random sample validation

46

Experiments with color Pdfs

We downloaded 18 documents and Explored both:

OCR with Omni to produce RawXML and applied our cleanup module to produce CleanXML

OCR with PrimeOCR having modified it to produce CleanXML

Applied the template approach, found 8 classes, and extracted the metadata correctly from 17 of 18 documents

47

Status We have gauged accuracy of individual

components of architecture through experiments on DTIC collection

We have implementations for metadata extraction modules: Template based SVM HMM Rule-based expert system

48

Status We have

Feature set to express rules templates for 15 classes of DTIC documents

We have automated process: Manual OCR set of pdf files -> directory Automated for all files:

Extract metadata Create OAI-compliant, XML metadata record Insert into OAI-compliant Digital Library

49

Status We have Java editor for XML records We have OCR cleanup modules for

Input Image pdf Text pdf

OCR Output WordML Raw XML

Cleanup Output: CleanXML We have a very preliminary prototype

structure markup module

50

Experiment summary

OCR Omni XML output Expert DTIC10 85% Template DTIC20 95%Automatic add files

PrimeOCR API -> XML Coverpage DTIC30 100% detection

SVM DTIC100 85%-100%SF298 loc. DTIC1000 100%

Template DTIC600 90%-100%* SF298 4fields DTIC1000 95%-97%*#SF298 27fields DTIC1000 88%-99%*#

Structure basic ok

*partial #based on random sample of 20 to validate

Image PDF Text PDF Color PDF

51

Status We have knowledge database

obtained from analyzing arc and DTIC collections Authors (4Mill strings from

http://arc.cs.odu.edu)

Organizations (79 from DTIC250) Universities (52 from DTIC250)

52

Proposal 2005-06 Goal: Develop computer assisted

production process and supporting software to Extract metadata for color pdf Insert records into DTIC collection

Assumptions On order of 10,000 documents/yr On order of 100 types of documents

53

Proposal 2005-06 Needed additional software

Analyze existing DTIC collection and create knowledge database of metadata (e.g., all known authors, organizations,..)

Create software environment for humans to interact with system generated analyses to correct metadata

Create voting module to predict need for human intervention

54

Proposal 2005-06 Tasks Phase 1 (5 months):

Analyze DTIC system environment Develop software modules

Phase 2 (4 months): Integrate software to insert records into

DTIC collection Train learning modules Create initial set of templates for rule

module

55

Proposal 2005-06 Tasks Phase 3 (2 months):

Observe humans on actual process and gather data on real production

Develop editors for templates, features to handle new types of documents not handled by system

56

Proposal 2005-06 Tasks Phase 4 (1 month):

Monitored production run.

57

Proposal 2005-06: Outcome We expect to be able to run (with

developer’s support the system) the system such that only 20% of documents will need human intervention

We expect that in the case of human intervention we will reduce the processing time by 80%

58

59

60

Metadata Extraction: Rule-based

Related works Automated labeling algorithms for biomedical

document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%,

affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and

Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979

journal pages Good results for some elements (such as page-number has

90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

61

2004 Project : Tasks

Working with DTIC in identifying the set of documents and the metadata of interest.

Developing software for metadata and structure extraction from the selected set of PDF documents

Feasibility study for extracting complex objects and representing the complete document in XML

62

End

63

Summary

For image PDF documents, OCR software is very important. In order to automate the whole process, OCR software has to either support auto processing or provide API. We chose Omnipage and PrimeOCR SDK based on our research.

Lesson Learned: Different OCR software may use different format. A desirable

way is to use an internal format. An OCR software may output many formats, some are difficult

to process. Choose which format to work with is important. Developing a good post-OCR procession tool is very important.

According to our experience, a bad post-OCR procession may downgrade overall performance a lot.

64

Summary (cont.)

For text PDF documents, to avoid OCR errors, we converted them to XML format(without OCR).Lesson learned: sometimes, reordering the text strings are necessary, especially for SF298 forms.

We showed SVM is a feasible way to extract metadata by applying it to several data sets.Lesson learned: The less heterogeneous the data set, the better the performance.

65

Summary (cont.)

Our template-based approach got high accuracy result while keeping a template for a class very simple. It also provided a way to extract different metadata from different document classes.

Lesson learned: It is challenge to make it scalable to a large collection when you do not know how many classes.

66

Summary (cont.)

Our expert system approach is aim to develop more general rules for extracting metadata from a large collection.

Lesson learned: building a rule base for a large heterogonous collection is time-consuming. A more feasible way is to develop general rules under a situation, for example, extracting title from cover page. And then combine them together.

67

Summary (cont.)

We showed it’s feasible to classify whether a page is a “cover page” or not by some simple rules.

We handled a special case: SF298 from by detecting it from a document and extracting information from it.

68

Conclusion and Future works

Our template-based approach can provide high accuracy with a set of simple rules. In the future, we need:

Develop a tool to classify documents into different classes or assign an new document to a class.

For documents does not belong to a know class, either leave them out for user to define more templates or use an expert system with general rules to process them.

We believe that integration machine learning approach will improve the performance of our system.

To improve the performance, a feed back loop need to be implemented to let users check the results and learn from users action.

69

Documents

OCR Metadata Extractor

Repository of Metadata

OAI Layer Search Engine

Query Results Request Response

Structure Extractor

Complex Objects

Recognition Repository of

Digitized Objects

XSL

Transform

Reference Processor Object Digitization

70

Future Works: Metadata Extraction

Doc

OCR

Template

Rule-based module

Interactive Tool

TaggedText

Doc after OCR

SVM Classifier

Merger

Models

Metadata

Models

HMM

71

Hidden Markov Model -general “Hidden Markov Modeling is a probabilistic technique for the

study of observed items arranged in discrete-time series” --Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988

HMM is a probabilistic finite state automaton Transit from state to state Emit a symbol when visit each state States are hidden

A B C D

72

Hidden Markov Model -general

A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from

one state to another Emission probabilities: probability of emitting

each symbol in each state Initial probabilities: probability of each state to

be chosen as the first state

73

HMM - Metadata Extraction

A document is a sequence of words that is produced by some hidden states (title, author, etc.)

The parameters of HMM was learned from samples in advance.

Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

74

Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003

Challenges in Building Federation … Kurt Maly … 2003

…title title title title

Challenges in Building Federation … Kurt Maly … 2003

author author date…

75

HMM - Metadata Extraction

Related work K. Seymore, A. McCallum, and R. Rosenfeld.

Learning hidden Markov model structure for information extraction.

Result: overall accuracy 90.1% was reported

76

Support Vector Machine - general

Binary Classifier (classify data into two classes) It represents data with

pre-defined features It finds the plane with

largest margin to separate the two classes from samples

It classifies data into two classes based on which side they located.

Font size

Line number

hyperplane

margin

The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

77

Extension to non-linear decision boundary

( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

Support Vector Machine - general

78

Support Vector Machine - general Extension to multiple classes

One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes)

One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM