what publishers need to know about digitization

62
What publishers need to know about digitization Liza Daly Consultant, Threepress Consulting Inc. http://threepress.org / Thursday, November 13, 2008

Upload: lizadaly

Post on 31-Oct-2014

9.365 views

Category:

Technology


0 download

DESCRIPTION

Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology. More information on Liza Daly and threepress can be found at http://www.threepress.org/

TRANSCRIPT

Page 1: What publishers need to know about digitization

What publishers need to know about digitizationLiza Daly

Consultant, Threepress Consulting Inc.

http://threepress.org/

Thursday, November 13, 2008

Page 2: What publishers need to know about digitization

Software engineer and consultant specializing in web-based publishing applications

Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications

Online reference products for Oxford University Press and Columbia University Press

Current: ebook applications and consulting

IntroductionLiza Daly [email protected]

Thursday, November 13, 2008

Page 3: What publishers need to know about digitization

1. Digitization 101: from scanning to OCR to XML

2. Smart vendor selection

3. A gentle introduction to XML

4. I’ve got digital content: now what?

IntroductionWhat I’ll cover

?Thursday, November 13, 2008

Page 4: What publishers need to know about digitization

What we talk about when we talk about digitization

Turning printed content...

...or microfilm archives

...or documents in legacy systems

...into modern digital forms.

(sometimes starting from print is easier)

text

<text>

Thursday, November 13, 2008

Page 5: What publishers need to know about digitization

Assume that we’re starting from a print archive.

(If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!)

Digitization 101

Thursday, November 13, 2008

Page 6: What publishers need to know about digitization

Scan

From paper to digital images...

Thursday, November 13, 2008

Page 7: What publishers need to know about digitization

OCR

...to digital text...

Thursday, November 13, 2008

Page 8: What publishers need to know about digitization

XML

...to reusable markup.

Thursday, November 13, 2008

Page 9: What publishers need to know about digitization

Digitization 101Scanning

http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008

Page 10: What publishers need to know about digitization

Digitization 101Scanning

Scan

http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008

Page 11: What publishers need to know about digitization

Digitization 101Scanning methods

Destructive scanningPages are cut out of the binding and

machine-fed into the scanner in batch.

(Imagine a huge office copier.)

Scanned copies are normally destroyed.

Thursday, November 13, 2008

Page 12: What publishers need to know about digitization

Non-destructive scanning

Pages kept in their original binding

Manual page-turning

Originals are returned to the source

Primarily for rare or historical works

Digitization 101Scanning methods

Thursday, November 13, 2008

Page 13: What publishers need to know about digitization

High-volume, non-destructive automated scanning also exists.

Digitization 101Scanning methods

Thursday, November 13, 2008

Page 14: What publishers need to know about digitization

Optical Character Recognition

OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors.

Common errors include wordsruntogether or speling mistakes.

Digitization 101OCR

Thursday, November 13, 2008

Page 15: What publishers need to know about digitization

OCR quality is sensitive to a number of factors.

Is the document in good condition with clear type?

Is the layout simple or complex?

Is a custom dictionary required for proper names or obscure terms?

Digitization 101OCR

Thursday, November 13, 2008

Page 16: What publishers need to know about digitization

This is easy.

Thursday, November 13, 2008

Page 17: What publishers need to know about digitization

This is hard.

Thursday, November 13, 2008

Page 18: What publishers need to know about digitization

http://timesmachine.nytimes.com/

Thursday, November 13, 2008

Page 19: What publishers need to know about digitization

Better OCR Worse OCR

Layout Simple textMulticolumn,

sidebars

Vocabulary Common Specialized

Source quality Clean and legibleDamaged, dirty or

partial

Digitization 101OCR

Thursday, November 13, 2008

Page 20: What publishers need to know about digitization

Limitations and cautions:

Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries.

Tables and equations aren’t suitable for OCR.

A human check is always advisable.

Digitization 101OCR

Thursday, November 13, 2008

Page 21: What publishers need to know about digitization

If the goal of digitization is to make content findable on the web, the text needs to be correct.

Thursday, November 13, 2008

Page 22: What publishers need to know about digitization

X

SCAN the documents to convert to digital files

Apply OCR to the scans to get computer-ready text

Convert the text into XML

Thursday, November 13, 2008

Page 23: What publishers need to know about digitization

Digitization 101XML

Not all digitization projects end with XML.

Why?

Thursday, November 13, 2008

Page 24: What publishers need to know about digitization

1,000 1,500 2,000 3,000+

Characters-per-page versus digitization cost/time

Machine OCRHuman-checked OCRXML

Thursday, November 13, 2008

Page 25: What publishers need to know about digitization

Vendor selection and costs

Thursday, November 13, 2008

Page 26: What publishers need to know about digitization

But also:

Project management

Shipping

Heterogeneous content

Front/back matter & indexes

Consider:

Quantity of material

Quality of the originals

Layout complexity

Vocabulary

Thursday, November 13, 2008

Page 27: What publishers need to know about digitization

But also:

Project management

Shipping

Heterogeneous content

Front/back matter & indexes

Consider:

Quantity of material

Quality of the originals

Layout complexity

Vocabulary

Thursday, November 13, 2008

Page 28: What publishers need to know about digitization

Vendor tips

Send samples before considering any estimate

...and have the output evaluated.

Compare not just cost-per-page but estimated time.

Feel comfortable with their project management.

Check references!

Thursday, November 13, 2008

Page 29: What publishers need to know about digitization

Should you partner?

Thursday, November 13, 2008

Page 30: What publishers need to know about digitization

?Thursday, November 13, 2008

Page 31: What publishers need to know about digitization

??

Thursday, November 13, 2008

Page 32: What publishers need to know about digitization

It’s too early to say whether Google Books is right for all publishers.

But you’re certainly giving up:

1. Control

2. Revenue share

3. Ownership

Thursday, November 13, 2008

Page 33: What publishers need to know about digitization

Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license.

Thursday, November 13, 2008

Page 34: What publishers need to know about digitization

XML 101

Thursday, November 13, 2008

Page 35: What publishers need to know about digitization

XML 101What’s XML?

XML is just plain text, with markers to tell a computer what the text means and how it should be laid out.

Thursday, November 13, 2008

Page 36: What publishers need to know about digitization

XML 101What’s XML?

Text with “markup” is an old idea.

This is a paragraph.¶This is another paragraph.

Thursday, November 13, 2008

Page 37: What publishers need to know about digitization

XML 101What’s XML?

XML just changes the symbols around.

<p>This is a paragraph.</p><p>This is another paragraph.</p>

Thursday, November 13, 2008

Page 38: What publishers need to know about digitization

XML 101What’s XML good for?

1. Everybody speaks it.

2. Once you have one kind of XML, it’s easy to turn it into another kind.

Thursday, November 13, 2008

Page 39: What publishers need to know about digitization

When you decide to digitize to XML, you’ll need to pick what kind of XML you want.

Thursday, November 13, 2008

Page 40: What publishers need to know about digitization

Kinds of XML

Thursday, November 13, 2008

Page 41: What publishers need to know about digitization

Kinds of XML

DTD

Thursday, November 13, 2008

Page 42: What publishers need to know about digitization

Kinds of XML

DTD Language

Thursday, November 13, 2008

Page 43: What publishers need to know about digitization

Kinds of XML

DTD

Format

Language

Thursday, November 13, 2008

Page 44: What publishers need to know about digitization

Kinds of XML

DTD

Format

Language

Schema

Thursday, November 13, 2008

Page 45: What publishers need to know about digitization

Kinds of XML

DTD

Format

Language

XSD

Schema

Thursday, November 13, 2008

Page 46: What publishers need to know about digitization

Kinds of XML

DTD

Format

Language

XSD

Schema

Thursday, November 13, 2008

Page 47: What publishers need to know about digitization

The schema defines the list of <tags> that appear in a document, and what they mean.

A paragraph ¶ in one schema might be <p>, but in another it might be <para>.

XML 101Schema vocabulary

Thursday, November 13, 2008

Page 48: What publishers need to know about digitization

TEI

DocBookMETS/ALTO

PRISMePub

DAISY

Thursday, November 13, 2008

Page 49: What publishers need to know about digitization

TEI

DocBookMETS/ALTO

PRISMePub

DAISY

XML

Thursday, November 13, 2008

Page 50: What publishers need to know about digitization

XML 101Choosing a schema

Books DocBook, DAISY, ePub, TEI

Magazines/Newspapers METS/ALTO, PRISM

Scholarly TEI, MathML

Thursday, November 13, 2008

Page 51: What publishers need to know about digitization

XML 101DIY schemas

Creating your own schema should be a last resort.

Expensive to build and maintain.

High training and hiring costs.

Reduced opportunities for interoperability.

Regulatory compliance.

Thursday, November 13, 2008

Page 52: What publishers need to know about digitization

XML 101DIY schemas

Creating your own schema should be a last resort.

Expensive to build and maintain.

High training and hiring costs.

Reduced opportunities for interoperability.

Regulatory compliance.

Thursday, November 13, 2008

Page 53: What publishers need to know about digitization

$

$$$

Low High

Complex schemas cost more...

...but also provide more opportunity for product development.

Thursday, November 13, 2008

Page 54: What publishers need to know about digitization

Now what?

Thursday, November 13, 2008

Page 55: What publishers need to know about digitization

MonetizingXML conversion

XML

Thursday, November 13, 2008

Page 56: What publishers need to know about digitization

MonetizingXML conversion

XML web

Thursday, November 13, 2008

Page 57: What publishers need to know about digitization

XML web

Thursday, November 13, 2008

Page 58: What publishers need to know about digitization

webXML

Thursday, November 13, 2008

Page 59: What publishers need to know about digitization

webUGC

Thursday, November 13, 2008

Page 60: What publishers need to know about digitization

Remixing content

XML allows content to be distributed, altered,

and recontextualized in unexpected ways.

http://flickr.com/photos/thomashawk/2492298772/Thursday, November 13, 2008

Page 61: What publishers need to know about digitization

Small Beer Press

Thursday, November 13, 2008

Page 62: What publishers need to know about digitization

Questions?

Liza DalyThreepress Consulting Inc.+01 617 301 [email protected]

Thursday, November 13, 2008