mediax (jan 2013) -- pkp xml parsing

Left to Their Own Devices:Automating XML Parsing

and Rendering for Scholarly Publishing

Alex Garnett & John WillinskyPublic Knowledge Project

What do we want? XML Publishing!

• When do we want it? 2004 would’ve been nice…

• We’ve known the value of properly marked up documents for a few decades now–Unfortunately, this entails hours of marking.

• Open-source publishers on limited budgets can’t afford the outsourcing or the grad students that normally make this possible– And that’s not such a great precedent anyhow.

The Public Knowledge Project

• Developers of Open Journal Systems & Open Monograph Press– Open source software to

support open accesspublishing.

– http://pkp.sfu.ca

• Our userbase happens to include many such small publishers, who publish almost exclusively in PDF, given its ease.

http://pkp.sfu.ca/

Nice things that PDF doesn’t have

• Well-structured text mining & indexing

• Rendering in different formats (e.g. mobile)

• Embedded dynamic content• Citation parsing and lookup• Reliable metadata

• So why are we still using it, again?

XML Publishing Workflows

• Are complex and underdocumented, requiring lots of manual labour, since no author will ever write in XML, and only a small fraction will use Markdown or LaTeX or some other text format that’s easy to transform, and most automated parsing tools are in deplorable condition anyhow, rant rant rant, despite the fact that there are many very good piecemeal tools available at different stages of these workflows. We put some of them together.

Toolchain

• External Services:– LibreOffice – document conversion– pdfx – fuzzy parsing– ParsCit – fuzzy citation parsing– citeproc/CSL – citation transformation

Future Work

• After incorporating upstream changes from pdfx (fixing punctutation & non-English languages) we’re aiming to have an OJS plugin by March.

• OMP will follow soon after.

• By the end of our initial funding period in June, we’ll have a source release (without pdfx) and plan to be supporting a set of OJS/OMP users.

Future Work not done by us

• Collaborators at Heidelberg University are working on a WYSIWYG in-browser XML editor for manually revising article formatting.

• The University of Michigan’s mPach system will add ePub generation and HathiTrust ingest.

• CrossRef will be contributing functionality to look up, verify, and link parsed citations.

Thanks

• Damion Dooley, our primary developer• Steve Pettifer and the University of

Manchester for allowing us to use pdfx• Juan Alperin and the rest of the PKP

team for their support and earlier work • Alf Eaton from the NLM for stylesheets• MediaX for funding this project

Questions?

• If you want to use our service for document preparation right now, contact me (Alex) at [email protected].

• We’ll have a stable version available by the end of January (probably free with registration)

• OJS/OMP integration and standalone release (without pdfx) coming soon!

mailto:[email protected]

mediax (jan 2013) -- pkp xml parsing

Documents

opensource publishers

rant rant rant

xml parsing andrendering

pdfx andplan

future work

xml publishing workflows

standalone releasewithout

pdfx juan alperin