mechanical curator - technical notes

"Mechanical Curator"

(The technical story)

It began with dogfood...

• "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"

It began with dogfood...

• "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"

• So we contrived a research question:

"Can we find the faces in the 19th C scanned book collection?"

Outcome:

• Majority of tools and libraries expect local filesystem or in-memory access; no network/API knowledge needed by researcher.

• While lookup by layout is awkward, it is a pragmatic approach when distributing content by sneakernet. Might be pairable by a light online search-engine and documentation/wiki for best practices.

'Project' success?

• Computer Vision algorithms are predominantly based on photographic input. Room for improvement.

• Catch-22 with respect to training sets.

'Project' success?

• Computer Vision algorithms are predominantly based on photographic input. Room for improvement.

• Catch-22 with respect to training sets.

• But... applying Haar cascade profiles, based on a photo training set, had some reasonable success!

19C depictions of faces

• Likelyhood of detection: • Female faces > Male

19C depictions of faces

• Likelyhood of detection: • Female faces > Male

• Why women?•Drawn more symmetrically - male faces

were more likely to be exaggerated.•Depiction is typically 'clean' and posed• Fashion: beards, spectacles and hats -

very different to the training sets

An Interesting By-product emerged

• The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.

An Interesting By-product emerged

• The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.– polygonal boundaries for areas where it

detected contiguous content but where OCR didn't work.

A map to all* the images?

* Unlikely to be comprehensive

A map to all* the images?

The 'Mechanical Curator' found:– Maps– Portraits– Marginalia– Covers– Charts and diagrams– Decorations

Microsoft Books

• Context:– 47k 'works' digitised, 68k volumes– 15.3Tb images, 1.3Tb ALTO XML– circa 22+ million JP2000 images, 150-

200DPI (unconfirmed), a zipfile ('store') per volume

– 360 pages per volume on average– No explicit subjects in metadata, but

heavy on travel, geography, ethnology, (English) literature and plenty of 'misc'

Accessible?

• In theory, the books were accessible online.

• In practice, it was a real challenge to find anything viewable.

Image extraction process

• Worker-based, using a message queue to coordinate.

• Thread-unsafe (due to zips) so limited to one worker per zip.– Local network storage was nearly full– Limited by hardware too (4 months to

get RAM upgrade)

Tech used:

• Virtualbox• Redis (msg queue, semaphore,

metadata cache)• Python

– OpenCV main library used:• Opens JP2000 with colour profiles• Quick to work with image regions• Also saved region as JPG (92%) for reuse

Filter first!

• ALTO with Illustration element is only concern.

• Grep - quickly discerned the 1 million XML files of interest (only 4-5% of total)

Resilience

• Never trust a process– Did it fail?

Resilience

• Never trust a process– Did it fail?– Did it fail silently?

Resilience

• Never trust a process– Did it fail?– Did it fail silently?– Does the expected JPG exist on disc? Is

it non-zero in length?

Resilience

• Never trust a process– Did it fail?– Did it fail silently?– Does the expected JPG exist on disc? Is

it non-zero in length?– Did IT services hard reboot your desktop

machine hosting the VMs you use in a given night?

Overview:

• Started with one desktop VM, and a connection to a local NAS

• Ended having used multiple VMs on Azure as well, after piping content to their store.– Redis replicated natively w/ SSH tunnel

to write node

Identifiers...

• Little help available from overstretched IT architecture team.

• Naive filename syntax to begin with:– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg– Stored by publication year.

We have images!

• 580Gb JPGs• From dogfooding, hybrid approach

seemed necessary:•Online, sharable, linkable, easy to find

presence, with a unique ID per image.• Easy mapping between local image and

online image.

Images already available

• ... in theory.• We needed something else in the

short-term.

Options

• Wikimedia Commons: we know about the books, but have no idea about the actual content! WC wouldn't be able to handle 1mil images in one go.

• Er... Flickr?

Upload by worker

• Again, similar structure - job was simply a filepath (metadata deduceable)

• Ran approximately 16-18 workers for 9 days to upload images.

• High 90s upload success rate (time of day dependent)

Outcome

• Launched 13 December on Flickr Commons

• Spike: 55 million image views in 5 days

• By March 2014, 70k+ tags added by community - map, portrait, cover, childrensbook, and so on.

Keeping track

• Many bad/misleading API calls• (people.photos.)recentlyUpdated

seems to mostly work

Current scheme

• Every morning, call recentlyUpdated for list of images that had some change

• Re-scan images and deduce changes in tags, comments, views and favourites.– (Same pattern, rescan jobs taken by

get_activity workers. Running 4 is enough outside of spike times)

Caching

• Redis sets:– PeopleID links to set of

FlickrID+tagadded– FlickrID links to set of user tags– Sorted sets for 'high score' lists:

contributors, favourites, tags

Summary

• Workers to spin up when required• Variety of workers, variety of queues• Never trust a worker or process• Never trust an API• Sample where you can't test.

mechanical curator - technical notes

Education