dmds winter workshop 2 slides
TRANSCRIPT
Last week...•The work of creating usable data
•Forms that this data might take:
•markup language
•Spreadsheets (MySQL & relational DBs)
•Graph databases (RDF/Linked Open Data
This week:
•Caveat Curator (challenges of working with data)
•Programming on the Whiteboard, i.e., conceptualizing the specific steps that you need to take to accomplish your goals
Goals/Takeaways
•A better understanding of the workflow for dealing with data
•Greater ability to talk about what you’re trying to do
Why this focus on data?•Understanding your data, and
your intended actions, is a key skill for developing any digital project (big or small).
•You may have one big project – but your data may support several small/intermediary projects.
Programming languages (and digital apps) are
like human languages in that they both have
phrases, patterns, and rules.
They are also unlike human languages in
that every programming
utterance does something, i.e., causes
an action to occur.
You may know your dataset in detail
already, from your research -- but your
computer is concerned with different levels of
detail.
...it’s also a useful skill for working with
programming languages.
(where a stray /> or ; can break your program/website)
The┘re┘sÜlt ís that yoÜr te┘xt
might come┘ oÜt looking
like┘this
whe┘n yoÜ ope┘n it in a
programming e┘nvironme┘nt.
So you need to convert it to plain text.
(without any of the fancy details encoded in MS Word
fonts.)
Figuring out what sort of unique
configuration will work best involves
at least some experimentation.
If you develop categories of
information, you’ll want to keep a record of what each category means, and what its
limits are.
Cleaning and structuring your data is a
foundation issue that changes, depending on the available format of
your data.
This is true even for data from supposedly reputable sources, like government or media
organizations.
If you are thinking about your data, and the tasks
that you need to accomplish, then it’s
easier to determine what sort of language or
platform your project needs.
There are countless tutorials, online courses,
etc., for almost any programming language or
platform.
(You can also ask for a Sherman Centre consultation to figure out what you need to learn.)
However, knowing what tasks you’re working
towards makes it easier to understand the
purpose of the introductory lessons.
It’s also easy to think about how the first rules
you learn for any language or platform
might affect your goals.
Pseudocode
•Used by programmers to break down a complex task into manageable steps
•Easily adaptable for use by non-programmers
Pseudocode Example
(Visible Prices)• Computer has a file that contains prices from
different texts.
• Computer must know that each price amount is connected with an object, and with a bibliographical record.
• Users can input a price amount, and computer will retrieve all objects that match the price, and display them to the user, along with bibliographical information.
• (More complex): Computer is able to retrieve prices linked with certain categories (clothing, food, etc.)
Social work + social change
• Recent history of academic social work in Canada; 1960s onward
• Interested in the ways in which academic social work has attempted to advance justice-oriented social change projects, and how political, cultural, and theoretical shifts have influenced this type of disciplinary imagination and work
• Related to disciplinary boundaries and methods and orthodoxies, and the social role of universities
MARC Record, back end<?xml version="1.0"?><record xmlns="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"> <leader>00000cam a000001</leader> <controlfield tag="001">468966</controlfield> <controlfield tag="008">710913s1968vaub000 0 eng c</controlfield> <datafield tag="010" ind1=" " ind2=" "> <subfield code="a">a68007753 </subfield> </datafield> <datafield tag="040" ind1=" " ind2=" "> <subfield code="a">Virginia. Univ. Libr.</subfield> <subfield code="b">eng</subfield> <subfield code="c">DLC</subfield> <subfield code="d">OCLCQ</subfield> <subfield code="d">CLU</subfield> <subfield code="d">OCLCO</subfield> <subfield code="d">OCLCF</subfield> <subfield code="d">OCLCQ</subfield> </datafield> <datafield tag="043" ind1=" " ind2=" "> <subfield code="a">n-us-va</subfield> </datafield> <datafield tag="050" ind1="0" ind2="4"> <subfield code="a">HV98.V8</subfield> <subfield code="b">C46</subfield> </datafield> <datafield tag="082" ind1=" " ind2=" "> <subfield code="a">361/.9/755</subfield> </datafield> <datafield tag="100" ind1="1" ind2=" "> <subfield code="a">Cepuran, Joseph.</subfield> </datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Public assistance and child welfare:</subfield>
<subfield code="b">the Virginia pattern, 1646 to 1964.</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="a">[Charlottesville]</subfield> <subfield code="b">Institute of Government, University of Virginia,</subfield> <subfield code="c">1968.</subfield> </datafield> <datafield tag="300" ind1=" " ind2=" "> <subfield code="a">vii, 120 pages</subfield> <subfield code="c">28 cm</subfield> </datafield> <datafield tag="336" ind1=" " ind2=" "> <subfield code="a">text</subfield> <subfield code="b">txt</subfield> <subfield code="2">rdacontent</subfield> </datafield> <datafield tag="337" ind1=" " ind2=" "> <subfield code="a">unmediated</subfield> <subfield code="b">n</subfield> <subfield code="2">rdamedia</subfield> </datafield> <datafield tag="338" ind1=" " ind2=" "> <subfield code="a">volume</subfield> <subfield code="b">nc</subfield> <subfield code="2">rdacarrier</subfield> </datafield> <datafield tag="504" ind1=" " ind2=" "> <subfield code="a">Includes bibliographical references.</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="0">
<subfield code="a">Public welfare</subfield>
<subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="0"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy</subfield> <subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst00854729</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Public welfare.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01083250</subfield> </datafield> <datafield tag="651" ind1=" " ind2="7"> <subfield code="a">Virginia.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01204597</subfield> </datafield> <datafield tag="710" ind1="2" ind2=" "> <subfield code="a">University of Virginia.</subfield> <subfield code="b">Institute of Government.</subfield> </datafield></record>
MARC 21 Format
Things to count• Social problems: child abuse, unemployment, inequality
• Concepts: mental hygiene, non-voluntary clients, culture of poverty, consciousness raising, privilege
• Sub-populations: immigrants, unwed mothers, the oppressed
• Institutions: work houses, shelters, detention, the non-profit industrial complex
• Interventions: motivational interviewing, case management, urban planning, life skills education, community organizing
• Types of social work: case work, radical social work, community development, clinical social work
• SW books in Canadian Libraries
Objective
To analyze the diagnostic chapters of five volumes of the DSM to determine whether the referring expressions used therein are dehumanizing and if so, determine if the usages have changed over time.
Problems
Traditional discourse analysis is done by hand and can be very time consuming.
Volumes of the DSM range from 494 pages to 991 pages.
Solutions
• Digital Corpus Analysis– Computer Software (R)
• Faster• More Efficient• Can handle large amounts of data at once
• Data has to be prepared before it is ready to be used for digital analysis.
Preparation of Data
• Physical data must be converted to digital medium
• Steps– Permission– Scanning to PDF– OCR PDF– Convert OCRed PDF to Plain Text– Clean Plain Text
Permission
• Digital e-book copies of DSM are not available for any of the versions
• American Psychiatric Association holds copyright and is VERY protective
Scanning DSMs
• Physical copies of the DSMs need to be scanned into a digital format (in this case PDF)
• PDFs need to be converted to a text format that a computer can read, edit, and work with
OCR PDFs
Clean Plain Text Files
• Once you have OCRed plain text files, you need to make sure they are accurate– Computers are only as good as their
input• If the data input is messy, the analysis will
be messy
• Made files consisting of only the chapters for analysis
• Checked for and fixed any remaining OCR/Scanning errors
Now To The Project
• Come up with a list of referring expressions based on a visual scan through the DSMs
• Use R to narrow down the list to only the most frequent– Narrows 10K+ unique words to a handful
• Use R to pull out all sentences with the terms in question– Narrows down ~19K sentences to 655 for individual
Benefits of Digital Analysis
• This project still used some manual analysis
• Using digital technologies and corpora sped things up considerably
• Made it easier to break down large corpus to manageable parts
• Now have a corpus on which to do other projects in the future: prep-work already done
Working with Digital Corpora with R
• Pros– Free and Cross-Platform– Powerful, Efficient, Fast– Capable of working with VERY large datasets– Subsequent projects can be much faster as
code can be saved and built on or recycled
• Cons– Code-based command-line style interface
(?)– GIGO– Depending on project, input data may need
substantial preparation
Summary
• Overall, 75% of the time spent doing this project was prepping the data
• Project took only 3-4 months to do, part-time• Corpus analyzed totalled 1.08 million words
from the DSM-III through the DSM-5• Future projects based on this corpus will be
much faster to do as well• Digital technologies made this project feasible• Project was much faster than if done by hand
Key skill•Thinking flexibly about your
data (and potential project)
•Are there portions of your dataset that could be extracted for use in a particular tool?
•How can you adjust your data in order to show it to people (and be more able to talk/write/present about your research interests?)
Group Activity•What do you need to do with your
data? (share, aggregate, combine…?)
•What units might that data exist in?
•What categories do you need to create?
•What connections need to exist between the units and the categories?
Next steps•What’s the smallest version of your dataset possible? (useful for testing out tools)
•Possible tools to examine (as ways of presenting your data)• Omeka (http://www.omeka.net)
• Scalar (http://scalar.usc.edu)
• Simile (http://www.simile-widgets.org)
• Google Fusion Tables (https://support.google.com/fusiontables/answer/2571232)
SCDS support for data wrangling•Consultations
http://www.tinyurl.com/scds-consult
•Colloquium slots (opportunities to talk through your project plans for a supportive audience)
•Graduate fellowships (workspace and greater access to SCDS staff expertise)