on the books: jim crow and algorithms of resistance
TRANSCRIPT
Amanda Henley, Head of Digital Research Serviceson behalf of the On the Books Project Team
On the Books: Jim Crow and Algorithms of Resistance
Student Workers: Montana Eck, Julia Long, Ashley Mullikin, Siri Nallaparaju, Tim Oyeleke, and Jenna Patton
Project TeamNeil Byers, Graduate Assistant- Documentation and Content Developer
Lorin Bruckner, Data Visualization Services Librarian - Text Analysis and Visualization Expert
Sarah Carrier, North Carolina Research and Instructional Librarian - Special Collections Expert
Rucha Dalwadi, Research Assistant - Documentation and Content Developer
James Dick, Graduate Assistant (& Attorney)- Law review and QA/QC
María R. Estorino, AUL for Special Collections & Director of the Wilson Library -Executive Sponsor and Liaison to the Library Leadership Team
Grant Glass, Graduate Assistant- Text Analysis workflow
Amanda Henley, Head of Digital Research Services - PI and Project Lead
Hannah Jacobs, Graduate Assistant – Content Developer
Matt Jansen, Data Analyst - Text Analysis Expert and Statistician
Steve Segedy, Applications Analyst – Web developer
William Sturkey, Faculty Member of History - Disciplinary Scholar
Kimber Thomas, African American Studies Scholar
Nathan Kelber, Ithaka – Collaborator (former PI and Project Lead)
Funding
About On the Books
Project to make North Carolina legal history accessible as a text corpus.
100+ years of North Carolina public, private, and local session laws
Project Goals:
-Create corpus of NC Session Laws from 1865/66 - 1967-Identify discoverable NC segregation statutes during the Jim Crow era using text analysis
Motivated by a reference question:
Where do I find a list of NC Jim Crow laws?
Workflow & ProcessesFor creating Collection as Data
• Compile Volume List
• Download Images from Internet Archive
• Preprocess Images• Identify location of marginalia and paratextual
information• Rotate as needed• Crop image to main text body• Add color-matched borders• Adjust images to optimize OCR
• OCR over 80,000 Images
Marginalia and paratextual information were removed.
Unit of analysis is individual lawsUsed pattern matching to split lawsExtensive post-split cleanup
Results: • 53,218 chapters• 297,000 sections
Parse and Annotate Laws
Text Analysis
Can we determine which laws are Jim Crow?
Requires a training set to teach the algorithm what is/is not a Jim Crow law.
Laws in the training set identified by experts:
• Pauli Murray• Richard Paschal• William Sturkey• Kimber Thomas
Supervised Classification
• To identify the best model, 80% of the training set was used to train models, while 20% was used to assess precision.
• XGBoost model selected for highest precision.
• Incorporated the type of law (public, private) and the year.
• Output was probability of law being Jim Crow.
• 90% probable Jim Crow cutoff selected (conservative).
Analysis
Identified 905 Jim Crow Laws
141 identified by experts
411 identified by the model only
353 identified by the model and confirmed by an expert
Version 2 is Forthcoming • Improved corpus - more accurately split chapters and sections
• Improved text analysis – more advanced workflow• Identified additional Jim Crow laws
• Training set
onthebooks.lib.unc.edu