1 economic growth center digital library a project funded by the andrew w. mellon foundation ann...

1

Economic Growth CenterDigital Librarya project funded by the Andrew W. Mellon Foundation

Ann Green, Steve Citron-Pousty, and Marcia FordSocial Science Research ServicesYale ITS Academic Media & Technology

Sandy Peterson and Julie LindenYale Social Science Libraries and Information Services

Christopher UdryProfessor, Yale Dept of EconomicsDirector, Economic Growth Center and Dept of Economics

2

Goals of the grant:Improving access to statistical resources not born digital

Build a prototype archive of statistics not born digital.

Implement standard digitization practices and emerging metadata standards for statistical tables.

Document the costs and processes of creating a statistical digital library from print.

Build the collection based upon long range digital life cycle requirements.

Present the prototype digital library to the scholarly community for evaluation.

3

Research questions

What effect does online access to the statistical information have on scholarly use of the materials?

Are common digitization practices and standards suited to statistically-intensive documents?

What are the long term preservation requirements of the EGCDL?

What are the costs of producing high quality statistical tables with OCR and editing?

How scalable is this process, for what kinds of collections, and for what purposes?

4

Selecting the EGC

Why Economic Growth Center collection?• extraordinary wealth of statistical material in printed form • condition of paper is poor; preservation concerns• a range of access problems imposed not only by non-digital

physical formats but also by inadequate descriptions of the contents of the publications

Why Mexico?• Faculty connections, library strengths, image quality and

completeness of collection• Publisher: Instituto Nacional de Estadística, Geografía e

Informática (Mexico, INEGI) Anuarios Estatales

5

Digitization process

• Vendor review:• Part one: request for proposals (13 vendors)• Part two: production of samples and extended

proposals (6 vendors)• Part three: final review and budget evaluation• Part four: contract negotiation and processing

• Prepared and shipped 221 volumes in Fall ’03• All materials received back at Yale Feb ’04• Quality assessment period complete March

’04

6

Deliverables received: images and PDFs

• 300 dpi TIFFs of each page of each volume, including cover and back of volume (103,115 TIFF files; 460 gigabytes)Color TIFFs for color pages, black and white TIFFs

for pages with only black and white

• PDF image + text files of each chapter of each volume (5,607 files)Separate files for each subject chapter, front

matter, indices, and back matter

7

Deliverables received: statistical tables in Excel

• Excel tables (16,488) of demographic and economic tables for 1994, 1996, 1998, 2000

• Selected OCR’d statistical tables converted to Excel format

• DSI operators reviewed each table twice• Custom tagging done by DSI improved

ability to extract columns and rows into online database and build XML metadata

8

Quality assessment of PDFs

• Overall assessment was very good; minor problems on a very small number of pages

• Sampled 5% of the PDFs • subjective review of image:

• Tilting, cut off text, illegible characters, bleed through, color evaluation, noise

• Vendor did not produce PDFs exactly to our specification, had to reformat 216 files

9

Quality assessment of Excel tables

Checksums: Wrote script to find tables with ‘Total’

columns Wrote Excel macro to compare column

sums with Total values Excellent numeric transfer of numbers

from printVisual review

10

Automated Dublin Core metadata production for PDF and Excel files

• Defined metadata format for PDF and Excel documents• Dublin Core subset (title, date, format, identifier, source,

coverage, subject)

• Wrote scripts to produce individual metadata records for each PDF chapter and each Excel file• Matched chapter numbers with chapter titles: created

standard matching tables• Generated subject term for each chapter: created standard

tables of chapter titles to topic list• Generated series title list from library online catalog• Other text generated from file name and directory

information

12

First generation interface:Select, view, downloadssrs.yale.edu/egcdl

Features:• Select files by year, state, topic, and/or

type of file (PDF or Excel)• Reconstruct the full volume• Based upon Dublin Core metadata

14

Next generation interface: NESSTAR• Features:

• Search across tables• View tables, select columns and rows• Create graphs and charts• Download and extract

• Based upon the Data Documentation Initiative (DDI) metadata standard for statistical tables

• Uses statistical software similar to SPSS• Nesstar is in use by data archives in Europe,

World Bank, Health Canada; in review by ICPSR and The Roper Center

15

Metadata production and data publishing: Nesstar process

Script creates DDI XML file from Dublin Core record and CSV file from Excel table

Staff publish each pair of DDI XML file and CSV file using Nesstar ‘cube builder’ software

Some editing needed at this stage: Add ‘measure’ (what the table is measuring—

persons, events, pesos, etc) Add column header name

Publish metadata and table data to Nesstar server

20

Evaluation of Nesstar interface

No major advantage over Excel in terms of viewing or manipulating data

Metadata and table can’t be viewed together

Can’t search within specific fields (e.g. state, year, column headers, etc.)

De-contextualization of tables (search results don’t indicate what volumes the tables belong to)

Lack of flexibility in customization

22

Evaluation of Nesstar-produced metadata

Labor-intensive to create and edit; no batch processing

Have to interpret table elements (add column header and measure labels)

Some data lost (textual codes for missing data; footnotes within cells)

DDI records not valid; some deviations from DDI specification

23

Automated DDI production: script-based process

Scripts pulled elements from Excel files and Dublin Core records

No manual editing necessaryDDI records are validMetadata describes table “as is” without

our interpretation imposed on itChallenges with marking up hierarchical

tables in DDI

24

Search/browse UI for PDFs and Excel

Lucene index for keyword searching Includes

text from table titles, column and row labels, and footnotes

Can use accent marks or not in search terms; Lucene returns appropriate results set

27

Transforming paper to digital resources:Generations of table presentation

PaperImages of paperPDF surrogates for paper publicationsExcel files for downloadingNesstar online presentationKeyword search on PDF/Excel interface

28

What we are learning and documenting

• Costs and processes of digitizing paper and building a statistical digital library:• Scanning requirements (TIFFs, PDF/a, etc)• OCR of Spanish text • OCR of numbers into spreadsheets (zoned

scanning)• Quality assessment• Is it less expensive to key in the tables? What

is the ‘tipping point’?

29

Table complexity

Metadata complexity and volume: linked to table complexity

Text legibility: affects OCR accuracy

OCR accuracy: formatting

OCR accuracy: character

Complex

High Poor Low Low 1

2

3

Simple Low Good High High 4

30

Cost comparison of keying in the data

Sent sample tables to two vendors and asked for cost estimates for keying Varied widely – different pricing structures (per

1000 characters vs. per Kb) Both require shipping volumes (or

photocopies) overseas

31

Costs of digitizing Nigeria data

Many volumes have poor quality paper and print mimeographed copies of typewritten pages skewing, bleed through, strikeovers, text cut off

Goals of project Try to find “tipping point” for digitizing collection vs.

digitizing selection of tables Test whether lessons learned from relatively clean

tables apply to lower-quality print materials Determine how these materials would fit into EGCDL UI

Cost estimates Considering TIFF and PDF for all volumes, flagging good

quality volumes for automated conversion to Excel

32


• Digital Life Cycle considerations: • Formats and metadata standards• Long term home for digital assets and the

Rescue Repository• Exactly what assets do we preserve? • Questions about versioning• XML to Excel as potential preservation format

33


• Metadata production: extensive use of automated processes• Dublin core for file level metadata• DDI for table level metadata

• User interface development and Spanish character challenges• Nesstar implementation for aggregate data• Developed own UI for more flexibility, ability to include

PDFs and Excel tables

34


Scholarly use of the EGCDL: Finding and accessing tables

Great advantage in locating data content by the full text of the tables; value in online access to individual tables.

Integrate searching and access into existing resources

Are investments in online analysis and visualization worth it? Are Excel tables what faculty and students want?

35


Production of more tables: Do faculty value a collections based digitization

effort or is there more value in on demand service?

What other countries in the EGC print collection are of interest?

Cost comparisons of building full volume equivalents vs selected tables.

The process can be leveraged to facilitate production for other research and learning projects.

36

Contact information

[email protected]

[email protected]

[email protected]

Project web site:ssrs.yale.edu/egcdl

mailto:[email protected]



1 economic growth center digital library a project funded by the andrew w. mellon foundation ann...

Documents

excel excel tables

xml metadata slide

excel files

economic tables

matter slide

dublin core metadata

statistical digital

roper center slide