demystifying digital humanities: winter 2014 workshop #2: programming on the whiteboard
DESCRIPTION
Slides for the second workshop on programming in digital humanities through the University of Washington's Demystifying Digital Humanities project.TRANSCRIPT
Winter 2014: Session #2Programming on the Whiteboard
(Paige Morgan, Sarah Kremen-Hicks, Brian Gutierrez)
Previously, at DMDH...
•The work of creating usable data
•Forms that this data might take:
•markup language
•spreadsheets
Workshop #2•Caveat Curator (challenges of working with data)
•Programming on the whiteboard, i.e., conceptualizing the specific steps that you need to take to accomplish your goals
Why this focus on data?•Understanding your data, and
your intended actions, is a key skill for working with any programming language or platform.
•This is true whether you are the programmer or whether you are working with professional programmers.
Programming languages are like human languages in that they both have phrases, patterns, and
rules.
Programming languages are unlike human languages in
that they aren’t for
communicating with people.
They are also unlike human
languages in that every programming utterance does something, i.e.,
causes an action to occur.
You can get used to patterns – even unfamiliar
ones.
The shift is in getting used to thinking in
terms of every single action.
Our subject matter today is all actions that you’ll need to think about before you work with...
Image: Josh Lee, @wtrsld, via Twitter, January 2014.
Even when you’re just experimenting, you need to prep your
data.
You may know your dataset in detail already, from your research -- but your
computer is concerned with
different levels of detail.
Becoming aware of those levels of
detail is not only helpful for your project ideas...
...it’s also a useful skill for working with programming
languages.(where a stray /> or ; can break your program/website)
Caveat Curator
Data only works if your computer can
read it.
But my data is just text!
(Isn’t that easy?)
(Remember, your computer is fairly
stupid).
Formatted text is
often full of text your
computer can’t parse correctly.
The┘re┘sÜlt ís that yoÜr te┘xt
might come┘ oÜt looking
like┘this
whe┘n yoÜ ope┘n it in a
programming e┘nvironme┘nt.
So you need to
convert it to plain text.
(without any of the fancy details encoded in MS Word fonts.)
But even that can produce unexpected
errors.
Maybe you want to work with sailing data and ports of
call:
The ship you’re interested in leaves the Ivory Coast for
St. Helena...
But when you create your map, you get
this:
The latitude/longitude coordinate is the significant datum.
The city name is just the human-readable
component.
Each datum needs to be unique.
Figuring out what sort of
unique configuration will work best involves at least some
experimentation.
To experiment effectively, you’ll want to keep careful
records.
If you develop categories of
information, you’ll want to keep a
record of what each category means, and what its limits
are.
Cleaning and structuring your
data is a foundation issue that changes, depending on the
available format of your data.
What if your data is crowdsourced?
You can require a particular format for
submissions
You can even put programmatic limits
on the formats available for submission
But in the end, you’re still going to need to scrub and/or
format.
This is true even for data from supposedly
reputable sources, like government or
media organizations.
Example: Doctor Who Villains dataset
http://tinyurl.com/doctorwhovillains
This step is no fun!
But it’s absolutely necessary.
What does a baby computer call his father: “data”
Break!
Working with “little data”:
GIS and the Spatial Turn
GIS technology has paved the way for the analyzing qualitative data associated with cultural experiences
“A good map is worth a thousand words, cartographers say, and
they are right: because it produces a thousand words: it
raises doubts, ideas. It poses new questions, and forces you
to look for new answers.”
(Moretti 1998, 3–4)
Literary texts are filled with
subjective spatial data: an author or
character's articulation of geographically
located dwellings, urban and rural
landscapes, as well as performance spaces
Project: Mapping William Wordsworth's
Conspicuous Consumption in The
Prelude
(Brian R. Gutierrez)
Objective: to map the visual culture events referenced in Wordsworth’s autobiographical poem The Prelude (as well as the ones not referenced)
Problem to solve: Prove that literary galleries, specifically Joseph Boydell’s “Shakespeare Gallery” shaped the dramaturgical choices in the only play written by Wordsworth. He reads Shakespeare not through a personal copy of the play, but through the visual and performative texts at that time
Data: place-names, indirect references,
and all non-referenced visual cultural events
Access to data: Project Gutenberg, digital archive of British newspapers and periodicals
What to do with that data?
Map it!!
First data set:Literary spatial articulations
Wordsworth mentions these following place names and references:
"Oh wonderous power of words, how sweet they are / According to the meaning which they bring-- / Vauxhall and Ranelagh, I then had heard / Of your green groves and wilderness of lamps, / Your gorgeous ladies, fairy cataracts, And pageant fireworks" (119-125) "Half-rural Sadler's Wells" (267)
First, I need to know what and where these places were in order to identify them as
spatial data
Ex: Vauxhall and Ranelagh
Second, if I'm interested in visual cultural experiences, I need to identify what kind of event occurred there: galley play, etc.
Third, how would I access the data? Answer: place-names in a book are not under any copyright.
However, if I wanted to include sections from the text when a viewer would click on that place name then I would have to think about copyright, but it's on PG, so that's covered.
Fourth, I would have to locate any indirect reference to visual cultural phenomena.
Ex: Wordsworth mentions two actresses by name Mary Robinson and Sarah Siddons.
Since I cannot map a person, I need to investigate which plays they were in and at which theaters during that moment of his life (it's an autobiography)
Fifth, I need to research what special events were occurring at other places he mentions. For that, I
look to The Times (newspapers) and various
periodicals.
Sixth, because I going to create a
map, using ArcGIS, I need to put my data
in an excel spreadsheet so that it can be read by the
program.
What is the relationship between
the data?
Analyze the qualitative data
Humanist skill=Dhumanist skill
Programming on the whiteboard involves
looking at the categories of
information, and thinking about how they interact.
Categories•Place names
•Poetic lines
•Genre of visual/cultural event
•Spatial data (latitude/longitude)
Return to the source of original data—the
literary text—to examine how the
author is describing these phenomena
Why use ArcGIS?
Benefits of ArcGIS•It allows the overlay of historical
maps
•Trainings were available and accessible (through DHSI and UW courses)
•As a software program, ArcGIS is established enough to be considered robust
•Available through the UW software suite
Disadvantages of ArcGIS•Available only for PCs
• Proprietary file format (even if input data is open-access, the end result is not)
•Available only on an annual subscription model (and prohibitively expensive for scholars without campus-granted access)
In Franco Moretti’s Atlas of the European
Novel 1800-1900 (1998), he calls for
a “literary geography,”
predicated on the creation of “readerly maps” and the use of
those maps as analytical tools.
Caveats?
The pursuit of mapping data may exclude complex
social spaces (e.g., gender domestic environments)
Caveats?
Cartographical representations should not be
divorced from their primary texts
Project: Visualizing Prosody
(Sarah Kremen-Hicks)
x / |x /|xx / | x / |x /Sir Walter Vivian all a summer's day / x | / x | x / | x / | x /Gave his broad lawns until the set of sun
Marking up a poem for metrical scansion is encoding it with
data.
What can a computer do with that data?
Computers are good at counting things –
like iambs.
Is it possible to predict deviations from a metrical norm based on author or
lyric classification?
Will authors show a tendency for
particular types of metrical
substitution?
Prepping the Data
•For proof of concept, start with one author (Alfred, Lord Tennyson)
•Get Tennyson’s poems from Project Gutenberg
•Hand-mark representative poems for prosody
Programming on the Whiteboard
What should the computer do?
Computer tasks•Count feet per line
•Recognize | as a foot boundary
•Recognize carriage return as a line boundary
•Supply foot boundaries at beginning/end of lines
•Count the number of areas contained within foot boundaries for each line
These steps involve recognizing each metrical foot as units that contain
particular accentual-syllabic data.
x / |x /|xx / | x / |x /
Sir Walter Vivian all a summer's day
Computer tasks, cont’d.•Identify the most common
number of feet per line
•Supply a report on lines (by number) that deviate
•Calculate rate of deviation/adherence
•Mode = paradigm
After recognizing the foot as a unit, the
computer can calculate what patterns of data each foot contains.
Computer tasks, cont’d.
•Identify the most common foot type
•Identify markings within foot boundaries
•Compare markings to foot dictionary to identify type
These tasks identify each line as a unit composed of one or
more feet.
x / |x /|xx / | x / |x /
Sir Walter Vivian all a summer's day
(iambic pentameter with third foot anapestic substitution)
Still more computing tasks!•Identify the most common foot type within a poem
•Supply a report on feet (by line and foot number) that deviate
•Calculate rate of deviation/adherence
•Mode = paradigm
Just as the feet contain patterns, the
lines contain patterns that can be analyzed as well.
Still more computing tasks!•Report on types of deviations arranged by most to least common
•Information should include location (line/foot number), as well as prevalence of substitution type
Deviations and their placement within each line and each poem should display certain patterns
unique to each author (I hope!)
Current status: I’m investigating using the Natural Language Toolkit to tokenize each foot; and to
establish syllables, feet, and lines as a unique hierarchy.
Applicable Values
•Iterative development
•Failure as valuable
•Collaboration
If you are thinking about your data, and the tasks that you need to accomplish, then it’s easier to determine what sort
of language or platform your project
needs.
There are countless tutorials, online courses, etc., for
almost any programming language or platform.
(We’re giving you a cheat sheet, too; and http://www.dmdh.org is
your friend. So is Google.)
Learning them can be a slow process,
especially at first.
However, knowing what tasks you’re working towards makes it
easier to understand the purpose of the
introductory lessons.
It’s also easy to think about how the first rules you learn for any language or platform might affect
your goals.
And now, it’s your turn...
For this activity, we recommend that you pair up, or form
small groups to work together.
Group Activity•What do you need to do with your data?
•What units might that data exist in?
•What categories do you need to create?
•What relationships need to exist between the units and categories?
Spring Workshops!
•Project Ideation and Development
•April 5th and April 26th (advance registration for DMDH participants at the end of Winter Quarter
DMDH content is developed by Paige Morgan, Sarah Kremen-Hicks, and Brian Gutierrez, with generous support from the Simpson Center
for the Humanities at the University of Washington.
Content is available under a Creative Commons Attribution-NonCommercial 3.0 Unported
License.
Please contact Paige at [email protected] with questions.