digital reformatting of text aaron choate digital library production services the university of...

Post on 11-Jan-2016

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Digital Reformatting of Text

Aaron ChoateDigital Library Production Services

The University of Texas Libraries

From last time:

Calculating potential file size (no really… this time we got it!)

file size = height x width x bit-depth x dpi2

8 bits per byte

imagingBenchmarking

Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.

imagingBenchmarking

Physical Type, size and presentation

imagingBanchmarking

Physical condition• Darkening pages

• Fading ink

• Stains

• bleed-through

• Uneven printing

• Fold lines

• smearing

imagingBenchmarking

Document classification• Simple text / printed line art

• Distinct-edge based representationBitonal?

• Manuscripts• Soft-edge-based

Grayscale / color

• Mixed material

imagingBenchmarking

Medium and support• Support – (paper, clay tablet, etc.)

• Thin paper? (bleed through)

• Medium – (graphite pencil, inks, etc)• Fading of ink

• Variations in color or density

imagingBenchmarking

Tonal Representation

imagingBenchmarking

Color Appearance• Is color reproduction necessary to the

document’s meaning?

• What purpose does the color serve?

• How important is maintaining the color appearance?

imagingBenchmarking

Detail• Printed text –

• Measure the height of the smallest lowercase letter that typifies the item or group of items.

• Manuscripts, line art –• Measure the finest stroke-width that must be

represented and characterize the needed level of quality

imagingBenchmarking

QI…(Quality Index)• Defining detail as character height

• ANSI/AIIM preservation microfilming standard for determining requirements for text legibility

• Defines a range from barely legible through excellent that maps to technical test targets

imagingBenchmarking

Line pairs

Excellent = 8 line pairs

Good = 5 line pairs

Marginal = 3.6 line pairs

Barely legible = 3.0 line pairs

imagingBenchmarking

Digital QI Bitonal (only black pixels)

QI = (dpi x .039h)/3

h = 3QI/.039dpi

dpi = 3QI/.039h

Tonal images (grayscale for printed text)QI = (dpi x .039h)/2

h = 2QI/0.39dpi

dpi = 2QI/.039h

Text Capture

Methods• Rekeying

• OCR

Accuracy …

Software

Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition

Encoding

XML vs SGML

SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages

XML is a subset of SGML with an intent on being the format for use on the Internet.

XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)

xmlDTDs vs Schemas

xmlTEI

Text Encoding Initiative• Initially launched in 1987, the TEI is an

international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.

xmlTEI

Levels of encoding• Level 1: Fully Automated Conversion and En

coding

• Level 2: Minimal Encoding

• Level 3: Simple Analysis

• Level 4: Basic Content Analysis

• Level 5: Scholarly Encoding Projects

Character sets

Unicode –

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

character setsUnicode

Greek & Coptic

Software

XMetal Oxygen Cooktop

Software

MetaE

top related