digitizing serialized fiction kirk hess dh 2013 – july 17, 2013 [email protected]
TRANSCRIPT
Digitizing Serialized FictionKirk HessDH 2013 – July 17, [email protected]
Serialized Fictionin Farm Newspapers• Libguide
for Serialized Fiction in the Farm Field and Fireside collection • “Many of the newspapers in Farm, Field and Fireside published
serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “
Serialized Fiction in the Farmer’s Wife• Farmer’s Wife was published from 1897-1939; April 1906-April
1939 digitized in FFF• “Many of the stories could be characterized as romance fiction
designed to appeal to farm wives”• Previously indexed in practicum project; stored in spreadsheet
(link). Intended as a database with a way to link to existing articles.
Newspaper Digitization• Select Newspaper• Create page images
• Microfilmed?• If not, film• If film bad, fix film
• Scan film• Tiff image, cropped, deskewed
• Article Segmentation• Process TIFF to Olive specs• OCR text, Article/Ad/Image segmentation
• Load to access system (Olive Active Paper/Veridian)
Finding Serialized Fiction
Software doesn’t make this easy to findNo metadataOCR problems with newsprintArticles span multiple issues, no links between them
On the other hand…The text is thereThe images are thereThe articles are segmented
OCR issues• Only adminstrators• A lot of errors, not a lot of people• Manual process, not easily automatable• Full text not visible• Users expect correct text
• Demo’d many solutions, coalesced around Omekahttp://omeka.org
• Moving to Veridian Fall 2014
Prototype Omeka/Scripto• http://uller.grainger.illinois.edu/omeka/• Workflow http
://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions
• PM/Technical Lead (Kirk), 4 part time editors (Olivia, Matt, Shoshana, Carl)
• Completed project in ~ 4 months, 736 serials
Completed story• THE MYSTERIOUS MCCORKLES by F. Roney Weir• http://uller.grainger.uiuc.edu/omeka/items/show/20
TEI?• Requires training, manual process for full annotations, lite TEI
can be automatically generated from corrected text• Has some advantages for scholars over plain text• XTF Example• http://uller.grainger.uiuc.edu:8080/xtf/search
• More McCorkles• http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.
xml&chunk.id=AR00300&toc.id=&brand=default
Beyond the Berry Farm• How can we prioritize work so important text is corrected
first?Example:• http://uller.grainger.uiuc.edu/omeka/items/show/6• Words: 2876, spelling errors 55, 98% accuracy• Predictive solutions
• How can we identify serialized fiction without having to find it manually and put it in a spreadsheet?
Identifying Serialized Fiction• Building a Feature set• Common N-Grams
• Chapter (number/roman numeral)• To Be Continued• The End
• Topic/Genre/Theme (Romance, children stories, holidays, etc.)• Named entity extraction• Predictive solutions (Google API)
Topics• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al.• A document contains a finite amount of topics, and each word
can be assigned to a topic• Used Mallet (http://mallet.cs.umass.edu/)• Example output:
Topic 10Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread
Network Analysis • Topics and Documents are nodes, docs in topics are edges.• By generating a network graph (Gephi) we can see
connections• By using clustering algorithms, we can see clusters of
documents around a topic• Train data mining algorithm?
Named Entity Extraction• Proper names interfere with LSA
• Manually generate stop word list• Lots of names to find!
• Programmatically find names• Stanford NLP Named Entity Recognizer
NLTK• Similar to Movie Review sample using a small subset of articles, Naïve Bayes
Classifier using NTLK, top 2000 words• >>> classifier.show_most_informative_features(5)• contains(having) = True fictio : nonfic = 1.9 : 1.0• contains(plan) = True fictio : nonfic = 1.9 : 1.0• contains(growing) = True fictio : nonfic = 1.9 : 1.0• contains(entertaining) = True fictio : nonfic = 1.9 : 1.0• contains(home) = True fictio : nonfic = 1.9 : 1.0
• High accuracy (> .95) but weak ratios
Next Steps• Implement Veridian• Crowdsource OCR correction• Direct access to index (Solr)
• Continue NLP research using NLTK Toolkit