demystifying digital humanities: winter 2014 workshop #2: programming on the whiteboard

Winter 2014: Session #2Programming on the Whiteboard

(Paige Morgan, Sarah Kremen-Hicks, Brian Gutierrez)

Previously, at DMDH...

•The work of creating usable data

•Forms that this data might take:

•markup language

•spreadsheets

Workshop #2•Caveat Curator (challenges of working with data)

•Programming on the whiteboard, i.e., conceptualizing the specific steps that you need to take to accomplish your goals

Why this focus on data?•Understanding your data, and

your intended actions, is a key skill for working with any programming language or platform.

•This is true whether you are the programmer or whether you are working with professional programmers.

Programming languages are like human languages in that they both have phrases, patterns, and

rules.

Programming languages are unlike human languages in

that they aren’t for

communicating with people.

They are also unlike human

languages in that every programming utterance does something, i.e.,

causes an action to occur.

You can get used to patterns – even unfamiliar

ones.

The shift is in getting used to thinking in

terms of every single action.

Our subject matter today is all actions that you’ll need to think about before you work with...

Image: Josh Lee, @wtrsld, via Twitter, January 2014.

Even when you’re just experimenting, you need to prep your

data.

You may know your dataset in detail already, from your research -- but your

computer is concerned with

different levels of detail.

Becoming aware of those levels of

detail is not only helpful for your project ideas...

...it’s also a useful skill for working with programming

languages.(where a stray /> or ; can break your program/website)

Caveat Curator

Data only works if your computer can

read it.

But my data is just text!

(Isn’t that easy?)

(Remember, your computer is fairly

stupid).

Formatted text is

often full of text your

computer can’t parse correctly.

The┘re┘sÜlt ís that yoÜr te┘xt

might come┘ oÜt looking

like┘this

whe┘n yoÜ ope┘n it in a

programming e┘nvironme┘nt.

So you need to

convert it to plain text.

(without any of the fancy details encoded in MS Word fonts.)

But even that can produce unexpected

errors.

Maybe you want to work with sailing data and ports of

call:

The ship you’re interested in leaves the Ivory Coast for

St. Helena...

But when you create your map, you get

this:

The latitude/longitude coordinate is the significant datum.

The city name is just the human-readable

component.

Each datum needs to be unique.

Figuring out what sort of

unique configuration will work best involves at least some

experimentation.

To experiment effectively, you’ll want to keep careful

records.

If you develop categories of

information, you’ll want to keep a

record of what each category means, and what its limits

are.

Cleaning and structuring your

data is a foundation issue that changes, depending on the

available format of your data.

What if your data is crowdsourced?

You can require a particular format for

submissions

You can even put programmatic limits

on the formats available for submission

But in the end, you’re still going to need to scrub and/or

format.

This is true even for data from supposedly

reputable sources, like government or

media organizations.

Example: Doctor Who Villains dataset

http://tinyurl.com/doctorwhovillains

This step is no fun!

But it’s absolutely necessary.

What does a baby computer call his father: “data”

Break!

Working with “little data”:

GIS and the Spatial Turn

GIS technology has paved the way for the analyzing qualitative data associated with cultural experiences

“A good map is worth a thousand words, cartographers say, and

they are right: because it produces a thousand words: it

raises doubts, ideas. It poses new questions, and forces you

to look for new answers.”

(Moretti 1998, 3–4)

Literary texts are filled with

subjective spatial data: an author or

character's articulation of geographically

located dwellings, urban and rural

landscapes, as well as performance spaces

Project: Mapping William Wordsworth's

Conspicuous Consumption in The

Prelude

(Brian R. Gutierrez)

Objective: to map the visual culture events referenced in Wordsworth’s autobiographical poem The Prelude (as well as the ones not referenced)

Problem to solve: Prove that literary galleries, specifically Joseph Boydell’s “Shakespeare Gallery” shaped the dramaturgical choices in the only play written by Wordsworth. He reads Shakespeare not through a personal copy of the play, but through the visual and performative texts at that time

Data: place-names, indirect references,

and all non-referenced visual cultural events

Access to data: Project Gutenberg, digital archive of British newspapers and periodicals

What to do with that data?

Map it!!

First data set:Literary spatial articulations

Wordsworth mentions these following place names and references:

"Oh wonderous power of words, how sweet they are / According to the meaning which they bring-- / Vauxhall and Ranelagh, I then had heard / Of your green groves and wilderness of lamps, / Your gorgeous ladies, fairy cataracts, And pageant fireworks" (119-125) "Half-rural Sadler's Wells" (267)

First, I need to know what and where these places were in order to identify them as

spatial data

Ex: Vauxhall and Ranelagh

Second, if I'm interested in visual cultural experiences, I need to identify what kind of event occurred there: galley play, etc.

Third, how would I access the data? Answer: place-names in a book are not under any copyright.

However, if I wanted to include sections from the text when a viewer would click on that place name then I would have to think about copyright, but it's on PG, so that's covered.

Fourth, I would have to locate any indirect reference to visual cultural phenomena.

Ex: Wordsworth mentions two actresses by name Mary Robinson and Sarah Siddons.

Since I cannot map a person, I need to investigate which plays they were in and at which theaters during that moment of his life (it's an autobiography)

Fifth, I need to research what special events were occurring at other places he mentions. For that, I

look to The Times (newspapers) and various

periodicals.

Sixth, because I going to create a

map, using ArcGIS, I need to put my data

in an excel spreadsheet so that it can be read by the

program.

What is the relationship between

the data?

Analyze the qualitative data

Humanist skill=Dhumanist skill

Programming on the whiteboard involves

looking at the categories of

information, and thinking about how they interact.

Categories•Place names

•Poetic lines

•Genre of visual/cultural event

•Spatial data (latitude/longitude)

Return to the source of original data—the

literary text—to examine how the

author is describing these phenomena

Why use ArcGIS?

Benefits of ArcGIS•It allows the overlay of historical

maps

•Trainings were available and accessible (through DHSI and UW courses)

•As a software program, ArcGIS is established enough to be considered robust

•Available through the UW software suite

Disadvantages of ArcGIS•Available only for PCs

• Proprietary file format (even if input data is open-access, the end result is not)

•Available only on an annual subscription model (and prohibitively expensive for scholars without campus-granted access)

In Franco Moretti’s Atlas of the European

Novel 1800-1900 (1998), he calls for

a “literary geography,”

predicated on the creation of “readerly maps” and the use of

those maps as analytical tools.

Caveats?

The pursuit of mapping data may exclude complex

social spaces (e.g., gender domestic environments)

Caveats?

Cartographical representations should not be

divorced from their primary texts

Project: Visualizing Prosody

(Sarah Kremen-Hicks)

x / |x /|xx / | x / |x /Sir Walter Vivian all a summer's day / x | / x | x / | x / | x /Gave his broad lawns until the set of sun

Marking up a poem for metrical scansion is encoding it with

data.

What can a computer do with that data?

Computers are good at counting things –

like iambs.

Is it possible to predict deviations from a metrical norm based on author or

lyric classification?

Will authors show a tendency for

particular types of metrical

substitution?

Prepping the Data

•For proof of concept, start with one author (Alfred, Lord Tennyson)

•Get Tennyson’s poems from Project Gutenberg

•Hand-mark representative poems for prosody

Programming on the Whiteboard

What should the computer do?

Computer tasks•Count feet per line

•Recognize | as a foot boundary

•Recognize carriage return as a line boundary

•Supply foot boundaries at beginning/end of lines

•Count the number of areas contained within foot boundaries for each line

These steps involve recognizing each metrical foot as units that contain

particular accentual-syllabic data.

x / |x /|xx / | x / |x /

Sir Walter Vivian all a summer's day

Computer tasks, cont’d.•Identify the most common

number of feet per line

•Supply a report on lines (by number) that deviate

•Calculate rate of deviation/adherence

•Mode = paradigm

After recognizing the foot as a unit, the

computer can calculate what patterns of data each foot contains.

Computer tasks, cont’d.

•Identify the most common foot type

•Identify markings within foot boundaries

•Compare markings to foot dictionary to identify type

These tasks identify each line as a unit composed of one or

more feet.

x / |x /|xx / | x / |x /

Sir Walter Vivian all a summer's day

(iambic pentameter with third foot anapestic substitution)

Still more computing tasks!•Identify the most common foot type within a poem

•Supply a report on feet (by line and foot number) that deviate

•Calculate rate of deviation/adherence

•Mode = paradigm

Just as the feet contain patterns, the

lines contain patterns that can be analyzed as well.

Still more computing tasks!•Report on types of deviations arranged by most to least common

•Information should include location (line/foot number), as well as prevalence of substitution type

Deviations and their placement within each line and each poem should display certain patterns

unique to each author (I hope!)

Current status: I’m investigating using the Natural Language Toolkit to tokenize each foot; and to

establish syllables, feet, and lines as a unique hierarchy.

Applicable Values

•Iterative development

•Failure as valuable

•Collaboration

If you are thinking about your data, and the tasks that you need to accomplish, then it’s easier to determine what sort

of language or platform your project

needs.

There are countless tutorials, online courses, etc., for

almost any programming language or platform.

(We’re giving you a cheat sheet, too; and http://www.dmdh.org is

your friend. So is Google.)

Learning them can be a slow process,

especially at first.

However, knowing what tasks you’re working towards makes it

easier to understand the purpose of the

introductory lessons.

It’s also easy to think about how the first rules you learn for any language or platform might affect

your goals.

And now, it’s your turn...

For this activity, we recommend that you pair up, or form

small groups to work together.

Group Activity•What do you need to do with your data?

•What units might that data exist in?

•What categories do you need to create?

•What relationships need to exist between the units and categories?

Spring Workshops!

•Project Ideation and Development

•April 5th and April 26th (advance registration for DMDH participants at the end of Winter Quarter

DMDH content is developed by Paige Morgan, Sarah Kremen-Hicks, and Brian Gutierrez, with generous support from the Simpson Center

for the Humanities at the University of Washington.

Content is available under a Creative Commons Attribution-NonCommercial 3.0 Unported

License.

Please contact Paige at [email protected] with questions.

demystifying digital humanities: winter 2014 workshop #2: programming on the whiteboard

Education

data programming

data break

qualitative data

data set

little data

sailing data

subjective spatial data

usable data forms