1 economic growth center digital library a project funded by the andrew w. mellon foundation ann...
TRANSCRIPT
1
Economic Growth CenterDigital Librarya project funded by the Andrew W. Mellon Foundation
Ann Green, Steve Citron-Pousty, and Marcia FordSocial Science Research ServicesYale ITS Academic Media & Technology
Sandy Peterson and Julie LindenYale Social Science Libraries and Information Services
Christopher UdryProfessor, Yale Dept of EconomicsDirector, Economic Growth Center and Dept of Economics
2
Goals of the grant:Improving access to statistical resources not born digital
Build a prototype archive of statistics not born digital.
Implement standard digitization practices and emerging metadata standards for statistical tables.
Document the costs and processes of creating a statistical digital library from print.
Build the collection based upon long range digital life cycle requirements.
Present the prototype digital library to the scholarly community for evaluation.
3
Research questions
What effect does online access to the statistical information have on scholarly use of the materials?
Are common digitization practices and standards suited to statistically-intensive documents?
What are the long term preservation requirements of the EGCDL?
What are the costs of producing high quality statistical tables with OCR and editing?
How scalable is this process, for what kinds of collections, and for what purposes?
4
Selecting the EGC
Why Economic Growth Center collection?• extraordinary wealth of statistical material in printed form • condition of paper is poor; preservation concerns• a range of access problems imposed not only by non-digital
physical formats but also by inadequate descriptions of the contents of the publications
Why Mexico?• Faculty connections, library strengths, image quality and
completeness of collection• Publisher: Instituto Nacional de Estadística, Geografía e
Informática (Mexico, INEGI) Anuarios Estatales
5
Digitization process
• Vendor review:• Part one: request for proposals (13 vendors)• Part two: production of samples and extended
proposals (6 vendors)• Part three: final review and budget evaluation• Part four: contract negotiation and processing
• Prepared and shipped 221 volumes in Fall ’03• All materials received back at Yale Feb ’04• Quality assessment period complete March
’04
6
Deliverables received: images and PDFs
• 300 dpi TIFFs of each page of each volume, including cover and back of volume (103,115 TIFF files; 460 gigabytes)Color TIFFs for color pages, black and white TIFFs
for pages with only black and white
• PDF image + text files of each chapter of each volume (5,607 files)Separate files for each subject chapter, front
matter, indices, and back matter
7
Deliverables received: statistical tables in Excel
• Excel tables (16,488) of demographic and economic tables for 1994, 1996, 1998, 2000
• Selected OCR’d statistical tables converted to Excel format
• DSI operators reviewed each table twice• Custom tagging done by DSI improved
ability to extract columns and rows into online database and build XML metadata
8
Quality assessment of PDFs
• Overall assessment was very good; minor problems on a very small number of pages
• Sampled 5% of the PDFs • subjective review of image:
• Tilting, cut off text, illegible characters, bleed through, color evaluation, noise
• Vendor did not produce PDFs exactly to our specification, had to reformat 216 files
9
Quality assessment of Excel tables
Checksums: Wrote script to find tables with ‘Total’
columns Wrote Excel macro to compare column
sums with Total values Excellent numeric transfer of numbers
from printVisual review
10
Automated Dublin Core metadata production for PDF and Excel files
• Defined metadata format for PDF and Excel documents• Dublin Core subset (title, date, format, identifier, source,
coverage, subject)
• Wrote scripts to produce individual metadata records for each PDF chapter and each Excel file• Matched chapter numbers with chapter titles: created
standard matching tables• Generated subject term for each chapter: created standard
tables of chapter titles to topic list• Generated series title list from library online catalog• Other text generated from file name and directory
information
11
12
First generation interface:Select, view, downloadssrs.yale.edu/egcdl
Features:• Select files by year, state, topic, and/or
type of file (PDF or Excel)• Reconstruct the full volume• Based upon Dublin Core metadata
13
14
Next generation interface: NESSTAR• Features:
• Search across tables• View tables, select columns and rows• Create graphs and charts• Download and extract
• Based upon the Data Documentation Initiative (DDI) metadata standard for statistical tables
• Uses statistical software similar to SPSS• Nesstar is in use by data archives in Europe,
World Bank, Health Canada; in review by ICPSR and The Roper Center
15
Metadata production and data publishing: Nesstar process
Script creates DDI XML file from Dublin Core record and CSV file from Excel table
Staff publish each pair of DDI XML file and CSV file using Nesstar ‘cube builder’ software
Some editing needed at this stage: Add ‘measure’ (what the table is measuring—
persons, events, pesos, etc) Add column header name
Publish metadata and table data to Nesstar server
16
17
18
19
20
Evaluation of Nesstar interface
No major advantage over Excel in terms of viewing or manipulating data
Metadata and table can’t be viewed together
Can’t search within specific fields (e.g. state, year, column headers, etc.)
De-contextualization of tables (search results don’t indicate what volumes the tables belong to)
Lack of flexibility in customization
21
22
Evaluation of Nesstar-produced metadata
Labor-intensive to create and edit; no batch processing
Have to interpret table elements (add column header and measure labels)
Some data lost (textual codes for missing data; footnotes within cells)
DDI records not valid; some deviations from DDI specification
23
Automated DDI production: script-based process
Scripts pulled elements from Excel files and Dublin Core records
No manual editing necessaryDDI records are validMetadata describes table “as is” without
our interpretation imposed on itChallenges with marking up hierarchical
tables in DDI
24
Search/browse UI for PDFs and Excel
Lucene index for keyword searching Includes
text from table titles, column and row labels, and footnotes
Can use accent marks or not in search terms; Lucene returns appropriate results set
25
26
27
Transforming paper to digital resources:Generations of table presentation
PaperImages of paperPDF surrogates for paper publicationsExcel files for downloadingNesstar online presentationKeyword search on PDF/Excel interface
28
What we are learning and documenting
• Costs and processes of digitizing paper and building a statistical digital library:• Scanning requirements (TIFFs, PDF/a, etc)• OCR of Spanish text • OCR of numbers into spreadsheets (zoned
scanning)• Quality assessment• Is it less expensive to key in the tables? What
is the ‘tipping point’?
29
Table complexity
Metadata complexity and volume: linked to table complexity
Text legibility: affects OCR accuracy
OCR accuracy: formatting
OCR accuracy: character
Complex
High Poor Low Low 1
2
3
Simple Low Good High High 4
30
Cost comparison of keying in the data
Sent sample tables to two vendors and asked for cost estimates for keying Varied widely – different pricing structures (per
1000 characters vs. per Kb) Both require shipping volumes (or
photocopies) overseas
31
Costs of digitizing Nigeria data
Many volumes have poor quality paper and print mimeographed copies of typewritten pages skewing, bleed through, strikeovers, text cut off
Goals of project Try to find “tipping point” for digitizing collection vs.
digitizing selection of tables Test whether lessons learned from relatively clean
tables apply to lower-quality print materials Determine how these materials would fit into EGCDL UI
Cost estimates Considering TIFF and PDF for all volumes, flagging good
quality volumes for automated conversion to Excel
32
What we are learning and documenting
• Digital Life Cycle considerations: • Formats and metadata standards• Long term home for digital assets and the
Rescue Repository• Exactly what assets do we preserve? • Questions about versioning• XML to Excel as potential preservation format
33
What we are learning and documenting
• Metadata production: extensive use of automated processes• Dublin core for file level metadata• DDI for table level metadata
• User interface development and Spanish character challenges• Nesstar implementation for aggregate data• Developed own UI for more flexibility, ability to include
PDFs and Excel tables
34
What we are learning and documenting
Scholarly use of the EGCDL: Finding and accessing tables
Great advantage in locating data content by the full text of the tables; value in online access to individual tables.
Integrate searching and access into existing resources
Are investments in online analysis and visualization worth it? Are Excel tables what faculty and students want?
35
What we are learning and documenting
Production of more tables: Do faculty value a collections based digitization
effort or is there more value in on demand service?
What other countries in the EGC print collection are of interest?
Cost comparisons of building full volume equivalents vs selected tables.
The process can be leveraged to facilitate production for other research and learning projects.
36
Contact information
Project web site:ssrs.yale.edu/egcdl