digitizing serialized fiction kirk hess dh 2013 – july 17, 2013 [email protected]

Digitizing Serialized FictionKirk HessDH 2013 – July 17, [email protected]

mailto:[email protected]

Serialized Fictionin Farm Newspapers• Libguide

for Serialized Fiction in the Farm Field and Fireside collection • “Many of the newspapers in Farm, Field and Fireside published

serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “

http://uiuc.libguides.com/content.php?pid=53560

http://uiuc.libguides.com/content.php?pid=53560

Serialized Fiction in the Farmer’s Wife• Farmer’s Wife was published from 1897-1939; April 1906-April

1939 digitized in FFF• “Many of the stories could be characterized as romance fiction

designed to appeal to farm wives”• Previously indexed in practicum project; stored in spreadsheet

(link). Intended as a database with a way to link to existing articles.

https://docs.google.com/spreadsheet/ccc?key=0Al86zuCXcg_BdG5sZnVsSTJhYkFMSlN6ME5uRHRMNUE

Newspaper Digitization• Select Newspaper• Create page images

• Microfilmed?• If not, film• If film bad, fix film

• Scan film• Tiff image, cropped, deskewed

• Article Segmentation• Process TIFF to Olive specs• OCR text, Article/Ad/Image segmentation

• Load to access system (Olive Active Paper/Veridian)

Finding Serialized Fiction

Software doesn’t make this easy to findNo metadataOCR problems with newsprintArticles span multiple issues, no links between them

On the other hand…The text is thereThe images are thereThe articles are segmented

OCR issues• Only adminstrators• A lot of errors, not a lot of people• Manual process, not easily automatable• Full text not visible• Users expect correct text

• Demo’d many solutions, coalesced around Omekahttp://omeka.org

• Moving to Veridian Fall 2014

http://omeka.org/

http://omeka.org/

http://omeka.org/

Prototype Omeka/Scripto• http://uller.grainger.illinois.edu/omeka/• Workflow http

://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions

• PM/Technical Lead (Kirk), 4 part time editors (Olivia, Matt, Shoshana, Carl)

• Completed project in ~ 4 months, 736 serials

http://uller.grainger.illinois.edu/omeka/

http://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions



Completed story• THE MYSTERIOUS MCCORKLES by F. Roney Weir• http://uller.grainger.uiuc.edu/omeka/items/show/20

http://uller.grainger.uiuc.edu/omeka/items/show/20


TEI?• Requires training, manual process for full annotations, lite TEI

can be automatically generated from corrected text• Has some advantages for scholars over plain text• XTF Example• http://uller.grainger.uiuc.edu:8080/xtf/search

• More McCorkles• http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.

xml&chunk.id=AR00300&toc.id=&brand=default

http://uller.grainger.uiuc.edu:8080/xtf/search



http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.xml&chunk.id=AR00300&toc.id=&brand=default



Beyond the Berry Farm• How can we prioritize work so important text is corrected

first?Example:• http://uller.grainger.uiuc.edu/omeka/items/show/6• Words: 2876, spelling errors 55, 98% accuracy• Predictive solutions

• How can we identify serialized fiction without having to find it manually and put it in a spreadsheet?



Identifying Serialized Fiction• Building a Feature set• Common N-Grams

• Chapter (number/roman numeral)• To Be Continued• The End

• Topic/Genre/Theme (Romance, children stories, holidays, etc.)• Named entity extraction• Predictive solutions (Google API)

Topics• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al.• A document contains a finite amount of topics, and each word

can be assigned to a topic• Used Mallet (http://mallet.cs.umass.edu/)• Example output:

Topic 10Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread

http://mallet.cs.umass.edu/

http://mallet.cs.umass.edu/

Network Analysis • Topics and Documents are nodes, docs in topics are edges.• By generating a network graph (Gephi) we can see

connections• By using clustering algorithms, we can see clusters of

documents around a topic• Train data mining algorithm?

Named Entity Extraction• Proper names interfere with LSA

• Manually generate stop word list• Lots of names to find!

• Programmatically find names• Stanford NLP Named Entity Recognizer

NLTK• Similar to Movie Review sample using a small subset of articles, Naïve Bayes

Classifier using NTLK, top 2000 words• >>> classifier.show_most_informative_features(5)• contains(having) = True fictio : nonfic = 1.9 : 1.0• contains(plan) = True fictio : nonfic = 1.9 : 1.0• contains(growing) = True fictio : nonfic = 1.9 : 1.0• contains(entertaining) = True fictio : nonfic = 1.9 : 1.0• contains(home) = True fictio : nonfic = 1.9 : 1.0

• High accuracy (> .95) but weak ratios

Next Steps• Implement Veridian• Crowdsource OCR correction• Direct access to index (Solr)

• Continue NLP research using NLTK Toolkit

digitizing serialized fiction kirk hess dh 2013 – july 17, 2013 [email protected]

Documents

serialized fictionbuilding

romance fiction

serialized fictionkirk

automatablefull text

handthe text

important text

farm field

olive specsocr text