mediax (jan 2013) -- pkp xml parsing

11
Left to Their Own Devices: Automating XML Parsing and Rendering for Scholarly Publishing Alex Garnett & John Willinsky Public Knowledge Project

Upload: alex-garnett

Post on 25-May-2015

359 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: MediaX (Jan 2013) -- PKP XML Parsing

Left to Their Own Devices:Automating XML Parsing

and Rendering for Scholarly Publishing

Alex Garnett & John WillinskyPublic Knowledge Project

Page 2: MediaX (Jan 2013) -- PKP XML Parsing

What do we want? XML Publishing!

• When do we want it? 2004 would’ve been nice…

• We’ve known the value of properly marked up documents for a few decades now–Unfortunately, this entails hours of marking.

• Open-source publishers on limited budgets can’t afford the outsourcing or the grad students that normally make this possible– And that’s not such a great precedent anyhow.

Page 3: MediaX (Jan 2013) -- PKP XML Parsing

The Public Knowledge Project

• Developers of Open Journal Systems & Open Monograph Press– Open source software to

support open accesspublishing.

– http://pkp.sfu.ca

• Our userbase happens to include many such small publishers, who publish almost exclusively in PDF, given its ease.

Page 4: MediaX (Jan 2013) -- PKP XML Parsing

Nice things that PDF doesn’t have

• Well-structured text mining & indexing

• Rendering in different formats (e.g. mobile)

• Embedded dynamic content• Citation parsing and lookup• Reliable metadata

• So why are we still using it, again?

Page 5: MediaX (Jan 2013) -- PKP XML Parsing

XML Publishing Workflows

• Are complex and underdocumented, requiring lots of manual labour, since no author will ever write in XML, and only a small fraction will use Markdown or LaTeX or some other text format that’s easy to transform, and most automated parsing tools are in deplorable condition anyhow, rant rant rant, despite the fact that there are many very good piecemeal tools available at different stages of these workflows. We put some of them together.

Page 6: MediaX (Jan 2013) -- PKP XML Parsing
Page 7: MediaX (Jan 2013) -- PKP XML Parsing

Toolchain

• External Services:– LibreOffice – document conversion– pdfx – fuzzy parsing– ParsCit – fuzzy citation parsing– citeproc/CSL – citation transformation

Page 8: MediaX (Jan 2013) -- PKP XML Parsing

Future Work

• After incorporating upstream changes from pdfx (fixing punctutation & non-English languages) we’re aiming to have an OJS plugin by March.

• OMP will follow soon after.

• By the end of our initial funding period in June, we’ll have a source release (without pdfx) and plan to be supporting a set of OJS/OMP users.

Page 9: MediaX (Jan 2013) -- PKP XML Parsing

Future Work not done by us

• Collaborators at Heidelberg University are working on a WYSIWYG in-browser XML editor for manually revising article formatting.

• The University of Michigan’s mPach system will add ePub generation and HathiTrust ingest.

• CrossRef will be contributing functionality to look up, verify, and link parsed citations.

Page 10: MediaX (Jan 2013) -- PKP XML Parsing

Thanks

• Damion Dooley, our primary developer• Steve Pettifer and the University of

Manchester for allowing us to use pdfx• Juan Alperin and the rest of the PKP

team for their support and earlier work • Alf Eaton from the NLM for stylesheets• MediaX for funding this project

Page 11: MediaX (Jan 2013) -- PKP XML Parsing

Questions?

• If you want to use our service for document preparation right now, contact me (Alex) at [email protected].

• We’ll have a stable version available by the end of January (probably free with registration)

• OJS/OMP integration and standalone release (without pdfx) coming soon!