digitisation at scale: automating the mass acquisition of digitised content
TRANSCRIPT
![Page 1: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/1.jpg)
Digitisation at Scale:Automating the mass acquisition of
digitised contentIS&T Archiving Conference, Washington, April 2016
Dave ThompsonDigital Curator, Wellcome Library
![Page 2: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/2.jpg)
The Wellcome Library
• Part of Wellcome Collection, astonishing public venue in London developed by the Wellcome Trust. Where people can learn more about medicine through the ages & across cultures
• Five-year plan for transforming the Wellcome Library.
![Page 3: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/3.jpg)
Driver for digitisation
• To make our collections available to anyone, anywhere, we are digitising as much of our physical collection as we can, for both our website and the websites of other organisations. We are also digitising and hosting collections from partners that complement our holdings
Transforming the Wellcome Library: 2009-2014. http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-the-wellcome-library/
![Page 4: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/4.jpg)
The problem
• How to scale systems & processes to deliver on our ambition
• How to design & build new high volume systems & processes for; acquisition, storage, processing, access
• How to manage volumes of data during creation/acquisition
![Page 5: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/5.jpg)
Process design – sources of content
Goobi(METS/OCR)
Preservica
In-house
Institutions
Contractors
Harvesting
TIFF or JP2
TIFF or JP2HD & ftp
TIFF or JP2
Normalises TIFF to JP2
Manual
Automatic
Jpylyzer validates JP2
Auto harvesting of JP2 & DMD
Grey literature
Ingest Officer / Digital Curator
Snagging
Snagging
![Page 6: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/6.jpg)
The approach
• (Re)Use/develop existing systems were possible, e.g. bibliographic system Sierra, Preservica EE repository
• Identify where new systems would be required, e.g. workflow middle ware
• Take a practical approach & accept that it would be iterative learning as we go
![Page 7: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/7.jpg)
The solution was to use Goobi
![Page 8: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/8.jpg)
Why Goobi?
• Dedicated to digitisation
• Flexibility & process control
• Adaptable & scalable
• Vendor expertise/support
http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/
![Page 9: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/9.jpg)
Role of Goobi
• Role of Goobi is overall management & tracking of processes
• Initiate ingest into our DAM Preservica
• Reporting & statistics
![Page 10: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/10.jpg)
Role of humans
• Working at volume did not imply more staff, it implied efficiency
• Also implied automation
• Human work was focussed on tasks machines couldn't do
http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/
![Page 11: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/11.jpg)
System & process design
• High volume doesn’t imply use of many systems
• Requires design to be as simple as possible, with as few moving parts as possible
• Processes need to be efficient & scalable, human as well as system
http://www.nivenswealthstrategies.com/keeping-it-simple/
![Page 12: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/12.jpg)
Partnership for scalable digitisation
• Relationship with Internet Archive digitising our Library content
• High volume long term project
• Content harvested from Internet Archive website & processed automatically
• Dedicated Goobi process for fully automated harvesting
![Page 13: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/13.jpg)
Harvesting from Internet Archive
Content processed automatically, including creation of METS & ALTO.
Goobi has a ‘repository’ of IA identifiers for searching/harvesting.
Goobi harvests data from Internet Archive website.
Content available in the player.Content stored in Preservica. DDS creates JSON for the player & pre-
caches some content.
![Page 14: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/14.jpg)
Challenges - M&Ms
• Multi volume works
• No metadata to support their union
• Have to construct them manually, but process can be simplified
• Time consuming, still to be fully automated
![Page 15: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/15.jpg)
Challenges – Working with partners
• Changes to Internet Archive website broke our harvesting
• For automated ftp to work 3rd parties need to follow instructions
• Creation of JPEG2000 images/video
• Incorrect identifiers trips up processes
![Page 16: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/16.jpg)
Opportunities
• Working with IT, flexibility of virtualised environment
• Working with Intranda, brings in vendor expertise
• Distributed system brings in feedback from many users
• Small team simplifies decision making
• Success leads to success
![Page 17: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/17.jpg)
Life cycle management
• Good place with regard to life cycle management
• Consistent processes based on common workflows
• Goobi outputs consistent & predictable
• Unified data set easier to manage in the future
![Page 18: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/18.jpg)
Has automation been successful?
• Yes with a but
• Automation can be complex, easy to make mistakes
• Automation requires metadata to be available
• Automated processes still require a human minder
![Page 19: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/19.jpg)
The scale of things
![Page 20: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/20.jpg)
![Page 21: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/21.jpg)
Lessons learned
• Complexity Vs simplicity
• Iterative approaches work but are time consuming
• Vendor support/input crucial when starting from scratch
• Process design essential
![Page 22: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/22.jpg)
Be bold. Sometimes it’s the way we work that has to change
![Page 23: Digitisation at Scale: Automating the mass acquisition of digitised content](https://reader035.vdocuments.net/reader035/viewer/2022081521/58ecd6fd1a28ab727e8b45f1/html5/thumbnails/23.jpg)
Thank you
Questions now, questions later…?
Dave Thompson, Digital CuratorWellcome Library
[email protected] @d_n_t
http://wellcomelibrary.org/