integrated publishing and archiving of journals in...
TRANSCRIPT
![Page 1: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/1.jpg)
mPachIntegrated Publishing and Archiving
of Journals in HathiTrustSeth Johnson, Bryan Smith, & Kevin S. Hawkins
Michigan Publishing
![Page 2: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/2.jpg)
Overview
1. Overview of mPach, a package of tools for publication of born-digital journals in HathiTrust
2. Introduction to mPach’s Prepper interface
3. Technical discussion of mPach’s Norm utility for converting Word DOCX files to JATS XML
![Page 3: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/3.jpg)
Michigan Publishing
Michigan Publishing is the primary academic publisher of University of Michigan and is based in the University Library.
Michigan Publishing has long used a system called DLXS as its primary platform for online content, but we need an architecture that will scale better in order for us to continue to grow.
www.publishing.umich.edu
![Page 4: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/4.jpg)
What is HathiTrust?
Partnership of research libraries around the world
Shared digital repository certified to be preservation-quality with over 11 million digitized volumes (nearly 500 terabytes of data)
www.hathitrust.org
![Page 5: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/5.jpg)
Publishers and Archives
Publishers require flexibility to innovate. But archives need stability.
HathiTrust provides us with an infrastructure in which to provide long-term preservation and discoverability while allowing for innovative services to be built on top.
![Page 6: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/6.jpg)
Main design principle
Archiving happens as a byproduct of publication rather than after the fact.
![Page 7: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/7.jpg)
JATS and mPach
JATS was selected because of the increasing coalescence of the publishing industry around this open, non-proprietary standard.
Publishing (“blue”) tag set works for born-digital literature, with a constrained set of tags to render, unlike “green”. But unlike “orange” it also includes important metadata elements (in <front>).
![Page 8: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/8.jpg)
mPach overview
www.lib.umich.edu/mpach
![Page 9: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/9.jpg)
Prepper
Dashboard for administering a journal and putting manuscripts through the production process
Guides the conversion process from DOCX to JATS (using Norm)
Ruby on Rails application
![Page 10: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/10.jpg)
Prepper Article Prep (1 of 8)
![Page 11: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/11.jpg)
Prepper Article Prep (2 of 8)
![Page 12: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/12.jpg)
Prepper Article Prep (3 of 8)
![Page 13: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/13.jpg)
Prepper Article Prep (4 of 8)
![Page 14: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/14.jpg)
Prepper Article Prep (5 of 8)
![Page 15: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/15.jpg)
Prepper Article Prep (6 of 8)
![Page 16: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/16.jpg)
Prepper Article Prep (7 of 8)
![Page 17: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/17.jpg)
Prepper Article Prep (8 of 8)
![Page 18: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/18.jpg)
Article View in HathiTrust
![Page 19: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/19.jpg)
Journal View in HathiTrust
![Page 20: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/20.jpg)
Norm
Converts DOCX to JATS XML:
1. Parse DOCX XML
2. Internal Representation and Mapping
3. Create JATS XML and assets
![Page 21: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/21.jpg)
Norm Usage
Stand-alone command-line application
Input: DOCX or ODT file
Output:document_name.zip/
document_name.xml (JATS)assets/
image_1.pngimage_2.png
![Page 22: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/22.jpg)
Word Styles and Norm
![Page 23: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/23.jpg)
Norm Transformation ProcessGiven:
Word document
Configuration specifying:● Word styles corresponding to each JATS element● Parents for each JATS element● Appropriate section (head, body, back) for each JATS
element
![Page 24: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/24.jpg)
Step 1: Transform data into internal representationCreate empty array for each section (front, body, back)
For each element in DOCX body:● Find style and contents of element● Determine which JATS element (configuration)● Determine which section (configuration)● Append tuple [JATS element, content, style] to section's
array
![Page 25: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/25.jpg)
DOCX XML with Word Style<w:body>
<w:p>
<w:pPr><w:pStyle w:val="ArticleTitle"/></w:pPr>
<w:r>
<w:t>Color variability and body size of larvae of two</w:t>
</w:r>
<w:r>
<w:rPr><w:i/></w:rPr>
<w:t>Epomis</w:t>
</w:r>
<w:r>
...
![Page 26: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/26.jpg)
Norm configuration mappings (default.cfg)
[ FRONT ]
ArticleTitle = article-title
[ FRONT-PARENTS ]
article-title = title-group
title-group = article-meta
article-meta = front
![Page 27: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/27.jpg)
Sample internal representation: article title
Title: Color variability and body size of larvae of two Epomis species (Coleoptera, Carabidae) in Israel, with a key to the larval stages
In Norm’s internal representation:('article-title',
[('Color vari...of two', None, None),
('Epomis', ['i'], None)
('(Coleoptera...stages', None, None)],
'ArticleTitle')
![Page 28: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/28.jpg)
Step 2: Render JATS output from internal representationCreate empty Document Object Model (DOM) tree
For each section (front, body, back):● Add node for section to tree● For each tuple for section (see step 1):
○ Create node for JATS element tuple○ Find parent for element (configuration)○ Attach node to parent
Marshall output to XML.
![Page 29: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/29.jpg)
An article title in JATS
<article>
<front>
<title>
<article-meta>
<title-group>
<article-title>
Color variability and body size of
larvae of two <i>Epomis</i> species
(Coleoptera, Carabidae) in Israel,
with a key to the larval stages
</article-title>
![Page 30: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/30.jpg)
Future Plans for Norm
The <body> of the article is where we’re seeing the most feature creep, making configuration and styles increasingly complicated.
Options:1. “Norm lite” for the front, another tool (meTypeset) for the body2. Norm to handle both front and body, refactor of the codebase needed
![Page 31: Integrated Publishing and Archiving of Journals in HathiTrustjats.nlm.nih.gov/jats-con/2013/presentations/hawkins2013.pdf · Ruby on Rails application. Prepper Article Prep (1 of](https://reader035.vdocuments.net/reader035/viewer/2022063013/5fcc49adc3d10e32f02253fe/html5/thumbnails/31.jpg)
www.lib.umich.edu/mpach