web viewsoftsub file formats like .srt, ... into voyant’s default skin reveals a number of...

1

Teaching Subtitles as Historiographic Research

Kevin L. Ferguson

I see this question on Twitter: “Someone shouting, ‘We got company!’ is a classic action

movie cliché. Does anyone know the earliest film to use that line?” My immediate thought: a

perfect question for a digital humanities approach to cinema and media studies! My students

would know how to answer this, and they would also know how the question needs clarification.

I often see semi-serious questions about film history like this on Twitter: what are the

anti-feminist semiotics of pizza versus Chinese takeout as it coded female protagonists in 1980s

popular films? What television show was the first to begin with the now-formulaic “Previously

on . . .” or “Last week on . . .”? Questions like these address themselves to a new media studies

informed by digital humanities’ search-oriented promise of direct access to large datasets, one

with significant pedagogical opportunities to teach students the difficult skill of refining suitable

research questions. On Twitter, users answer these questions with educated guesses based on

their film history knowledge, but the answers are often little more than stabs at trivia, relying on

traditional forms of knowledge-making that require either accessing a store of previously-

remembered information or searching reference works like IMDb or Wikipedia.

While questions about action clichés or movie pizza might seem to be idle ones, the

subtleties in answering them get students to think about fundamental problems with traditional

media studies’ reliance on the personal investigative mode and how a corpus-driven approach

might be able not only to augment traditional methods but also do things that traditional

methods cannot. For example, one person proposed Han Solo in Star Wars (dir. George Lucas,

1977) as an example of the line “we got company,” while acknowledging it was unlikely that this

was the first appearance. This guess relies on personal memory and intuition, which we want to

encourage in students, while at the same time recognizing the limits of even a lifetime of

watching films. While Han Solo’s might arguably be the most memorable or significant or ironic

https://twitter.com/eddiemarsattack/status/632653563248820225

https://twitter.com/mimbale/status/612858331858759680

2

use of the cliché, this claim is of a different order than an answer to the objective historical

question posed. Thus, we might still teach students to answer traditional questions like “how has

the action genre been defined historically?,” but we can now better address those questions

with digital methods that answer specific research questions like “how has the color palette of

action movie posters evolved over time?,” “with what frequency have industry publications used

the word ‘action’ to categorize types of films?,” or “do action film screenplays have a notably

different structure than other genres?” In this way, students can get a richer sense of the kind of

film historiography described by Robert Sklar, who argues that “the practice of historiography is

fundamentally dialogic. Historical discourse is constantly being transformed through historians’

commentaries and critiques on the work of other past and present historians.”1 Thus, students

can see how film histories are written and rewritten by new digital methods and techniques.

Subtitles

One way students in my digital humanities courses are taught to pose and answer such

media history questions is by treating film and television subtitles as a textual archive for

research. Using free tools such as the command line and Voyant, students manipulate a large,

readily available corpus of film and television subtitles from OpenSubtitles.org (they claim as of

September 2015 to have 3,441,809 subtitles). Media texts are subtitled for a variety of reasons:

closed-captioning for hearing-impaired spectators, translations of foreign or accented speech,

and for use in public places where audio is muted or hard to hear. While celluloid, VHS, and

early DVD releases may have included subtitles that were irreversibly “burned” onto the moving

image itself, contemporary practice for digital television and video distribution is to include

subtitles within the image track which can be turned on or off by viewers. A third possibility,

increasingly common online, are “softsubs,” which are a separate text, XML, or HTML file with

1 Robert Sklar, “Does Film History Need a Crisis?” Cinema Journal 44.1, 2004, 134.

3

subtitles specially marked with a timecode so as to align with specific digital releases or rips of

film and television shows. Softsub file formats like .srt, .sub, or .usf are popular since viewing

them is as simple as dropping them into the same folder as the video file. Unlike the encoded

graphics subtitles on commercial DVD or Blu-Ray releases, which require special software to

remove, softsubs can all be quickly converted to plain text and thus machine read in ways that

are already familiar to digital humanists. In fact, the .srt files commonly found on

OpenSubtitles.org are already plain text and so require very little work to manage. Here is a

sample of the .srt file produced by the program SubRip, which uses optical character recognition

(OCR) to process on-screen graphics subtitles from a DVD. Note the subtitle number, timecode

in hour:minute:second,millisecond format, and the displayed text from the classic 80s horror film

A Nightmare on Elm Street.

4

The separation of the subtitle from the image track creates a new paratext for media

scholars to investigate digitally. As with any paratext, we should not confuse these with the

actual film or television soundtrack. Some differences for students to keep in mind when

working with subtitles: we are not looking at a script or screenplay with actor’s names, locations,

or stage directions; we are not looking at a transcription that includes sound effects (as with

closed-captioning); we have no sense of nuance (dialogue spoken ironically or contextualized

with hand or facial gestures); and there is no soundtrack to signal emotional content.

Furthermore, subtitles are restricted by conventions that fit their purpose: they must abbreviate

spoken dialogue to fit on screen (using two lines of about 40 characters each), they must

indicate when it is unclear who is speaking, they use ellipses to show when a character’s

5

speech lasts more than two lines, and they must indicate off-screen speech. But even with

these limitations, subtitles offer a “good enough,” readily available solution to a large set of

otherwise unanswerable research questions.

Method

There are some simple web interfaces for searching subtitles: IMDb allows for searching

a limited database of movie quotations, Subzin offers a larger database, and even appending

“srt” to a Google search returns a set of useful results. But because of their relatively small

selection and user-generated nature, these sites mainly confirm for students how important it is

to consider the source, quality, and limitations of their corpus. The next step is to ask students

to identify and prepare their own corpus using OpenSubtitles.org, for example the subtitles of 2–

3 seasons of a television show; the films of a particular director or writer; or a representative set

of genre, historical, or national films. Here are [three such sample subtitle corpora] [link]: twenty-

five episodes of the first season of the television show The Golden Girls, a dozen films written

by Nora Ephron, and a dozen horror films produced between 1980 and 1989.

Following advice on a [handout like this one] [link], students download the appropriate

subtitle files, use the command line to clean out the timecodes and rename the files, and create

a compressed folder of their corpus for later analysis. For beginning text analysis, I ask students

to use the program Voyant,2 a set of textual analysis tools which can either be accessed via the

internet or through a browser-based server downloaded for offline use. I point students to

Voyant’s own excellent tutorial workshop documentation. As an example, loading season 1 of

The Golden Girls into Voyant’s default skin reveals a number of windows showing information

about word count, word frequencies, which words and texts (in this case, episodes) are most

unique and which most common, and sparklines showing the trends of individual words over

time (in this case, over the season).

2 There is an improved beta version of Voyant 2.0 here: beta.voyant-tools.org

6

Users can click on any word to see it revealed in each tool or can search for specific words. For

example, we might wonder if the four central characters Rose, Blanche, Dorothy, and Sophia

tended to be named equally in the show or if some characters were mentioned more than others

in individual episodes or over the course of the season. The “Word Trends” sparkline in the

upper right shows us at a glance that while Sophia is consistently but rarely mentioned, the

other three tend to dominate individual episodes, with exceptions like episode 8 where all four

are relatively unmentioned. We might next use the Bubblelines tool to see what these trends

look like in individual episodes, noting for instance how the overlap of names marks moments of

conversation, interpersonal conflict, or off-screen speech, as in conflict-heavy episode 5 “The

Triangle.”

7

Voyant works well to allow students to visually wrangle the otherwise invisible dialogue

from media texts. However, for broader questions about media history, like the ones I began

with, we need a much larger corpus. For this, students can work directly with the English-

language subtitle corpus files from here: about 310,000 gzipped XML files in a 6.5GB folder.3

This can be unwieldy to work with, but since the corpus is organized in subfolders by year,

portions of it can be assigned. Students might use the command line to search for phrases or

words, i.e.:

find . -iname '*.gz' -exec zgrep "got company" {} +

3 Jörg Tiedemann, “Parallel Data, Tools and Interfaces in OPUS” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC, 2012).

8

which returns results that look something like:

Ignoring the duplicate entries caused by multiple subtitles of different video releases, we see

results for our search phrase in chronological order. No doubt your local computer scientist can

suggest other ways to find phrases in files using the command line.

And the answer to the “we got company” question? Maybe Jack Conway’s 1940 Boom

Town, a Clark Gable–Spencer Tracy comedy, but students quickly discover that we first need to

redefine what “action genre” even means before we are able to speak with any certainty. While

“we got company” is a cliché of action movies today, there are numerous early examples of the

phrase’s ambiguous use that force students to pay attention to the changing context of spoken

dialogue. For instance, one early use of the phrase is in Applause (dir. Rouben Mamoulian,

1929):

Come on in, boys and girls, we got company.

Gee, Kitty, the baby looks great.

Oh, isn’t it cute?

It looks just like Kitty.

[embed video?: https://youtu.be/7jZCNg6_ulk?t=7m48s ]

9

This is certainly not an action film and “we got company” does not have the same euphemistic

overtones as in later uses, but the menace of this scene is apparent and later the film in fact

parses the meaning of the word “company”:

Just looking for somebody to talk to.

Sailors don’t generally have much trouble finding company.

Oh, I don’t want the kind of company you mean.

Listen, all sailors ain’t a bunch of bums.

Finding an answer to what seems like a straightforward question instead requires us to consider

even more fundamental questions, like what defines the action genre, when do phrases become

clichés, or why is the “first occurrence” privileged in historiography. In this way, using freely

available and simple computing tools allows students to focus on the process and method of

archival research: while there is an important aspect of investigative play with the tools, the

emphasis is on refining a research question through experimentation.

https://www.youtube.com/watch?v=7jZCNg6_ulk&feature=youtu.be&t=44m20s

10

Bio (150 words)

Kevin L. Ferguson is Assistant Professor at Queens College, City University of New

York, where he directs Writing at Queens and teaches digital humanities, film adaptation,

college writing, and contemporary literature. His forthcoming book, Eighties People: New Lives

in the American Imagination (Palgrave), examines new cultural figures in the American 1980s.

His next book, Aviation Cinema, tracks the history of flight as it is mirrored in a century of

popular film. His film writing has appeared or is forthcoming in Cinema Journal, Camera

Obscura, [in]Transition, Criticism, Jump Cut, Scope, The Journal of Medical Humanities, The

Journal of Dracula Studies, Bright Lights Film Journal, and the collection Satanic Panic: Pop-

Cultural Paranoia in the 1980s. His current digital humanities + media studies work argues for a

“digital surrealism” by using scientific image analysis software to treat cinema texts as slice-

based volumes that transform the dimension of time into a third spatial dimension.

web viewsoftsub file formats like .srt, ... into voyant’s default skin reveals a number of...

Documents