on the books: jim crow and algorithms of resistance

Amanda Henley, Head of Digital Research Serviceson behalf of the On the Books Project Team

On the Books: Jim Crow and Algorithms of Resistance

Student Workers: Montana Eck, Julia Long, Ashley Mullikin, Siri Nallaparaju, Tim Oyeleke, and Jenna Patton

Project TeamNeil Byers, Graduate Assistant- Documentation and Content Developer

Lorin Bruckner, Data Visualization Services Librarian - Text Analysis and Visualization Expert

Sarah Carrier, North Carolina Research and Instructional Librarian - Special Collections Expert

Rucha Dalwadi, Research Assistant - Documentation and Content Developer

James Dick, Graduate Assistant (& Attorney)- Law review and QA/QC

María R. Estorino, AUL for Special Collections & Director of the Wilson Library -Executive Sponsor and Liaison to the Library Leadership Team

Grant Glass, Graduate Assistant- Text Analysis workflow

Amanda Henley, Head of Digital Research Services - PI and Project Lead

Hannah Jacobs, Graduate Assistant – Content Developer

Matt Jansen, Data Analyst - Text Analysis Expert and Statistician

Steve Segedy, Applications Analyst – Web developer

William Sturkey, Faculty Member of History - Disciplinary Scholar

Kimber Thomas, African American Studies Scholar

Nathan Kelber, Ithaka – Collaborator (former PI and Project Lead)

Funding

About On the Books

Project to make North Carolina legal history accessible as a text corpus.

100+ years of North Carolina public, private, and local session laws

Project Goals:

-Create corpus of NC Session Laws from 1865/66 - 1967-Identify discoverable NC segregation statutes during the Jim Crow era using text analysis

Motivated by a reference question:

Where do I find a list of NC Jim Crow laws?

Workflow & ProcessesFor creating Collection as Data

• Compile Volume List

• Download Images from Internet Archive

• Preprocess Images• Identify location of marginalia and paratextual

information• Rotate as needed• Crop image to main text body• Add color-matched borders• Adjust images to optimize OCR

• OCR over 80,000 Images

Marginalia and paratextual information were removed.

Unit of analysis is individual lawsUsed pattern matching to split lawsExtensive post-split cleanup

Results: • 53,218 chapters• 297,000 sections

Parse and Annotate Laws

Text Analysis

Can we determine which laws are Jim Crow?

Requires a training set to teach the algorithm what is/is not a Jim Crow law.

Laws in the training set identified by experts:

• Pauli Murray• Richard Paschal• William Sturkey• Kimber Thomas

Supervised Classification

• To identify the best model, 80% of the training set was used to train models, while 20% was used to assess precision.

• XGBoost model selected for highest precision.

• Incorporated the type of law (public, private) and the year.

• Output was probability of law being Jim Crow.

• 90% probable Jim Crow cutoff selected (conservative).

Analysis

Identified 905 Jim Crow Laws

141 identified by experts

411 identified by the model only

353 identified by the model and confirmed by an expert

Version 2 is Forthcoming • Improved corpus - more accurately split chapters and sections

• Improved text analysis – more advanced workflow• Identified additional Jim Crow laws

• Training set

onthebooks.lib.unc.edu

on the books: jim crow and algorithms of resistance

Documents