digital humanities: modeling semi-structured data from ... · digital humanities: modeling...

93
Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott ([email protected]) NLP Fall 2019 Human Language Technology Center of Excellence Center for Language and Speech Processing 1

Upload: others

Post on 23-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Digital humanities: modelingsemi-structured data from traditionalscholarship

Tom Lippincott ([email protected])NLP Fall 2019Human Language Technology Center of ExcellenceCenter for Language and Speech Processing

1

Page 2: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Outline

Intro: A few thoughts on “Digital humanities”

Graphs and Autoencoders

Motivating study: Post-Atlantic Slave Trade

Another application: Authorship attribution of the Hebrew bible

Ongoing work

2

Page 3: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Intro: A few thoughts on “Digitalhumanities”

Page 4: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

What is “digital humanities”?

Some responses:

• “an idea that will increasingly become invisible” -Stanford

• “a term of tactical convenience” -UMD

• “I don’t: I’m sick of trying to define it” -GMU

• “a convenient label, but fundamentally I dont believe in it”-NYU

• “an unfortunate neologism” -Library of Congress

3

Page 5: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

What is “digital humanities”?

Themes at DH2019

• Visualization

• Geographic information systems

• Social and ethical issues

• Education

• VR, maker spaces

• OCR

• Machine learning

4

Page 6: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Working definitions

Digital humanities

Traditional inquiries enabled by computational intelligence

Traditional scholar

Academic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . . )

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 7: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional scholar

Academic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . . )

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 8: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional scholarAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . . )

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 9: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional scholarAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . . )

(Traditional) scholarly datasetData assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 10: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional scholarAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . . )

(Traditional) scholarly datasetData assembled by a traditional researcher in the field

Computational researcherDesign and bring machine learning models to bear on datasets

5

Page 11: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Why is collaboration rare?

Traditional scholars have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Computational researchers can pair data with appropriatemodels

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 12: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Why is collaboration rare?

Traditional scholars have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Computational researchers can pair data with appropriatemodels

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 13: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Why is collaboration rare?

Traditional scholars have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Computational researchers can pair data with appropriatemodels

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 14: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Topic models: a success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required to train

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will always give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 15: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Topic models: a success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required to train

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will always give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 16: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Topic models: a success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required to train

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will always give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 17: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

A guiding challenge:

Can we leverage sophisticated modeling techniques withoutlosing the advantages that popularize topic models andrecreating some of the same bad community practices?

8

Page 18: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Aside: Traditional Scholars are Knowledge Workers

Financial analysts, investigative reporters . . .

• Concerned with specific domains

• Need to gather and understand datasets

• Construct and reason over knowledge bases

• Wide range of technical abilities

• The DH story is relevant to industry, government, etc

9

Page 19: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graphs and Autoencoders

Page 20: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

General relational data

John

23

M

Jane

31

F

Crazy Ray’s

25000

Honda

35000

Ford

10

Page 21: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Common format: JSON

[{ ”name” : ” John ” , ” age ” : 23 , ” gender ” : ”M”} ,{ ”name” : ” Jane ” , ” age ” : 31 , ” gender ” : ”F ”} ,{ ” business ” : ” Crazy Ray ’ s ”} ,{ ”make” : ” Honda ” , ” p r i ce ” : 25000 ,

” owned by ” : 0 , ” so ld by ” : 2} ,{ ”make” : ” Ford ” , ” p r i ce ” : 35000 ,

” owned by ” : 1 , ” so ld by ” : 2}]

11

Page 22: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Let’s design a model that naturally adaptsto the data structure

12

Page 23: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Building-blocks for modeling relational data

Encoders, decoders, and autoencodersCapture the entities and fields

Graph convolutional networksCapture the relationships

13

Page 24: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

14

Page 25: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

14

Page 26: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

14

Page 27: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Encoder

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

14

Page 28: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Decoder

Hiddenlayers

Someoutputdata

15

Page 29: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Encoders and decoders are often paired

Theseare thesame

CliffNotesStudies Recites

16

Page 30: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

If the goal is to reconstruct the input, it’s an autoencoder

Theseare thesame

CliffNotesStudies Recites

16

Page 31: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

*coder summary

• An encoder transforms data into a fixed-lengthrepresentation

• A decoder takes a fixed-length representation andgenerates data

• An autoencoder is an encoder and decoder workingtogether to preserve data through a bottleneck

17

Page 32: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

On to graph convolutions. . .

18

Page 33: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Normal 1D CNN

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 34: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Normal 1D CNN

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 35: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Normal 1D CNN

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 36: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Normal 1D CNN

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 37: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph convolutional network (GCN)

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 38: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph convolutional network (GCN)

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 39: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph convolutional network (GCN)

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 40: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph convolutional network (GCN)

Grid (image,text . . . )

Eachposition

incorporatesits “receptive

field”

Repeatprocess,

expand field

Graph nodes(e.g. entities)

Adjacentnodes

(relatedentities)

Each nodeincorporatesits neighbors

Info spreadsaccordingto graph

19

Page 41: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

GCN summary

• Extends CNNs from grids to graphs

• Information passes along edges

• Each GCN layer allows nodes to see one further “hop”

20

Page 42: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Modeling relational data

• Encoders, decoders, autoencoders

• Graph convolutional mechanism

• Combine these to match the data being modeled

21

Page 43: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 44: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 45: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 46: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 47: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 48: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 49: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Graph Entity Autoencoder (GEA)

John

23

M

John

23

M

John

23

M

RNN

Ident

Embed

RNN

MLP-MSE

MLP-NLL

RNN

MLP-MSE

MLP-NLL

Summarize

Summarize

22

Page 50: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

How can we use a trained model?

• Compute distance between two entities

• Find flat or hierarchical clusters of entities

• Generate likely value of missing field

• Detect an improbable value of a present field

• Observe response of one field to another

23

Page 51: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Motivating study: Post-AtlanticSlave Trade

Page 52: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

24

Page 53: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

24

Page 54: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Shipping manifests

slave slave slave owner journey vesselname sex age name date type

Willis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

24

Page 55: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 Schooner

Maria f 19 Amidu 1832/09/24 Schooner

24

Page 56: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

24

Page 57: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

25

Page 58: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

25

Page 59: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward date

Davy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

25

Page 60: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

25

Page 61: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Some numbers

• 45k manifest entries spanning five cities

• 11k fugitive notices from 70 gazettes

• 28k unique slave names

• 7k unique owner names

• Not big data, but thousands of studies like this at aresearch university!

26

Page 62: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

27

Page 63: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

27

Page 64: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

27

Page 65: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

27

Page 66: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

What might a historian want to do with this data?

• Follow one slave throughout their life

• Group owners according to the nature of their workforce

• Map out trade “ecosystems” of sellers, shippers, owners,etc

• Determine what drove valuation in transactions andrewards

• Reconstruct slave families when there are no last names

28

Page 67: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Ask the traditional scholar to follow some simple guidelineswhen gathering data

29

Page 68: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Traditional scholarly data

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 69: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Numbers

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 70: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Categories

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 71: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Strings

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 72: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

More complex fields

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 73: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Entities

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 74: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Entities

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 75: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Slave-to-owner

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 76: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Vessel-to-voyage, slave-to-voyage

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 77: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Entities, field types, and relations

Fewer assumptions

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

30

Page 78: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Example data point: one graph component

31

Page 79: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Train a GEA model . . .

32

Page 80: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Example insights looking at most-similar entities

Mistranscriptions

Baltiomre, Austin Woolfolk ⇐⇒ Baltimore, Austin WoolfolkNew Orleans, William Wiliams ⇐⇒ New Orleans, William Williams

Semantically-equivalent variants

Baltimore, George Y. Kelso ⇔ Baltimore, Kelso & FergusonNew Orleans, Leon Chabert ⇔ Louisiana, Leon Chabert

Same slave transported multiple times

Louisa, F, 16yo ⇔ Louisa, F, 17yoWaters, F, 14yo ⇔ Waters, F, 15yoKesiah, F, 20yo ⇔ Kesiah, F, 22yoTaylor, F, 15yo ⇔ Taylor, F, 16yo

33

Page 81: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Another application: Authorshipattribution of the Hebrew bible

Page 82: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Transmission of a text: the “Documentary Hypothesis”

34

Page 83: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Hypothesis as pointers into document structure

Jehovist

Elohist

Redactor

Bible

Book 1 Book 2

Chapter 1 Chapter 2

Verse 1 Verse 2

Word 1 Word 2

Morph 1 Morph 2

35

Page 84: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Assume the hypothesis, see how various models and featureslearn it as a supervised classification problem

36

Page 85: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Thomas Mendenhall: The Characteristic Curves of Compo-sition

37

Page 86: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Mosteller and Wallace: Inference in an Authorship Problem

The Federalist papers

• 85 articles written by Hamilton, Madison, and Jay

• 12 are unattributed

• Frequency analysis of function words determined Madisonas author

38

Page 87: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Back to the Documentary Hypothesis

Problems

• The “authors” are also editors, redactors, synthesizers. . . they interact in context-dependent ways

• There is no predefined segmentation into “articles”

• We know more than function-words are important (e.g.name of God)

Solutions

• Limit vocabulary to words that are used frequently by allauthors

• Employ a GCN to exploit the document structure

39

Page 88: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

GEA predicts the author slightly better . . .

Model F-scoreLR 41.39MLP 47.45GEA 48.60

Gold GuessJ E P 1D 2D nD R O

J 100 8 7 0 0 0 3 0E 22 53 8 0 0 0 0 0P 13 5 77 0 1 0 4 01D 2 0 2 7 1 0 0 02D 2 2 1 0 5 0 0 0nD 0 0 0 1 0 0 0 0R 3 3 11 0 0 0 33 0O 2 0 1 0 0 0 1 0

40

Page 89: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Error analysis

Sentiment and in-context word senses

• “wife” shows up as polygamous in older but monogamousin newer sources

• Redactor’s positive view of Aaron+Moses, violent story ofrebellion

Narrative continuity

• Abraham and Isaac story thought to originally end withsacrifice, changed by the Redactor

• “it was the season for grapes”(travel and geographic locations)“They broke off some grapes.”

41

Page 90: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Ongoing work

Page 91: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Visualizing results

42

Page 92: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Other applications

43

Page 93: Digital humanities: modeling semi-structured data from ... · Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott (tom@cs.jhu.edu) NLP Fall

Thanks!

Quick plug: come to David Mimno’s talk!

• Nov. 15 at noon (Hackerman B17)• CS Professor at Cornell• Rare CS faculty working in DH (topic modeling)

References

• Embedding Multimodal Relational Data for KnowledgeBase Completion, Singh et al., 2018

• Inductive Representation Learning on Large Graphs,Hamilton et al., 2017

• Semi-Supervised Classification with Graph ConvolutionalNetworks, Kipf et al., 2016

• Reducing the Dimensionality of Data with NeuralNetworks, Hinton et al., 2006 44