mechanical curator - technical notes

37
"Mechanical Curator" (The technical story)

Upload: benosteen

Post on 06-May-2015

211 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Mechanical curator - Technical notes

"Mechanical Curator"

(The technical story)

Page 2: Mechanical curator - Technical notes

It began with dogfood...

• "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"

Page 3: Mechanical curator - Technical notes

It began with dogfood...

• "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"

• So we contrived a research question:

Page 4: Mechanical curator - Technical notes

"Can we find the faces in the 19th C scanned book collection?"

Page 5: Mechanical curator - Technical notes
Page 6: Mechanical curator - Technical notes

Outcome:

• Majority of tools and libraries expect local filesystem or in-memory access; no network/API knowledge needed by researcher.

• While lookup by layout is awkward, it is a pragmatic approach when distributing content by sneakernet. Might be pairable by a light online search-engine and documentation/wiki for best practices.

Page 7: Mechanical curator - Technical notes

'Project' success?

• Computer Vision algorithms are predominantly based on photographic input. Room for improvement.

• Catch-22 with respect to training sets.

Page 8: Mechanical curator - Technical notes

'Project' success?

• Computer Vision algorithms are predominantly based on photographic input. Room for improvement.

• Catch-22 with respect to training sets.

• But... applying Haar cascade profiles, based on a photo training set, had some reasonable success!

Page 9: Mechanical curator - Technical notes

19C depictions of faces

• Likelyhood of detection: • Female faces > Male

Page 10: Mechanical curator - Technical notes

19C depictions of faces

• Likelyhood of detection: • Female faces > Male

• Why women?•Drawn more symmetrically - male faces

were more likely to be exaggerated.•Depiction is typically 'clean' and posed• Fashion: beards, spectacles and hats -

very different to the training sets

Page 11: Mechanical curator - Technical notes

An Interesting By-product emerged

• The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.

Page 12: Mechanical curator - Technical notes

An Interesting By-product emerged

• The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.– polygonal boundaries for areas where it

detected contiguous content but where OCR didn't work.

Page 13: Mechanical curator - Technical notes

A map to all* the images?

* Unlikely to be comprehensive

Page 14: Mechanical curator - Technical notes

A map to all* the images?

The 'Mechanical Curator' found:– Maps– Portraits– Marginalia– Covers– Charts and diagrams– Decorations

Page 15: Mechanical curator - Technical notes
Page 16: Mechanical curator - Technical notes
Page 17: Mechanical curator - Technical notes

Microsoft Books

• Context:– 47k 'works' digitised, 68k volumes– 15.3Tb images, 1.3Tb ALTO XML– circa 22+ million JP2000 images, 150-

200DPI (unconfirmed), a zipfile ('store') per volume

– 360 pages per volume on average– No explicit subjects in metadata, but

heavy on travel, geography, ethnology, (English) literature and plenty of 'misc'

Page 18: Mechanical curator - Technical notes

Accessible?

• In theory, the books were accessible online.

• In practice, it was a real challenge to find anything viewable.

Page 19: Mechanical curator - Technical notes

Image extraction process

• Worker-based, using a message queue to coordinate.

• Thread-unsafe (due to zips) so limited to one worker per zip.– Local network storage was nearly full– Limited by hardware too (4 months to

get RAM upgrade)

Page 20: Mechanical curator - Technical notes

Tech used:

• Virtualbox• Redis (msg queue, semaphore,

metadata cache)• Python

– OpenCV main library used:• Opens JP2000 with colour profiles• Quick to work with image regions• Also saved region as JPG (92%) for reuse

Page 21: Mechanical curator - Technical notes

Filter first!

• ALTO with Illustration element is only concern.

• Grep - quickly discerned the 1 million XML files of interest (only 4-5% of total)

Page 22: Mechanical curator - Technical notes

Resilience

• Never trust a process– Did it fail?

Page 23: Mechanical curator - Technical notes

Resilience

• Never trust a process– Did it fail?– Did it fail silently?

Page 24: Mechanical curator - Technical notes

Resilience

• Never trust a process– Did it fail?– Did it fail silently?– Does the expected JPG exist on disc? Is

it non-zero in length?

Page 25: Mechanical curator - Technical notes

Resilience

• Never trust a process– Did it fail?– Did it fail silently?– Does the expected JPG exist on disc? Is

it non-zero in length?– Did IT services hard reboot your desktop

machine hosting the VMs you use in a given night?

Page 26: Mechanical curator - Technical notes

Overview:

• Started with one desktop VM, and a connection to a local NAS

• Ended having used multiple VMs on Azure as well, after piping content to their store.– Redis replicated natively w/ SSH tunnel

to write node

Page 27: Mechanical curator - Technical notes

Identifiers...

• Little help available from overstretched IT architecture team.

• Naive filename syntax to begin with:– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg– Stored by publication year.

Page 28: Mechanical curator - Technical notes

We have images!

• 580Gb JPGs• From dogfooding, hybrid approach

seemed necessary:•Online, sharable, linkable, easy to find

presence, with a unique ID per image.• Easy mapping between local image and

online image.

Page 29: Mechanical curator - Technical notes

Images already available

• ... in theory.• We needed something else in the

short-term.

Page 30: Mechanical curator - Technical notes

Options

• Wikimedia Commons: we know about the books, but have no idea about the actual content! WC wouldn't be able to handle 1mil images in one go.

• Er... Flickr?

Page 31: Mechanical curator - Technical notes

Upload by worker

• Again, similar structure - job was simply a filepath (metadata deduceable)

• Ran approximately 16-18 workers for 9 days to upload images.

• High 90s upload success rate (time of day dependent)

Page 32: Mechanical curator - Technical notes

Outcome

• Launched 13 December on Flickr Commons

• Spike: 55 million image views in 5 days

• By March 2014, 70k+ tags added by community - map, portrait, cover, childrensbook, and so on.

Page 33: Mechanical curator - Technical notes
Page 34: Mechanical curator - Technical notes

Keeping track

• Many bad/misleading API calls• (people.photos.)recentlyUpdated

seems to mostly work

Page 35: Mechanical curator - Technical notes

Current scheme

• Every morning, call recentlyUpdated for list of images that had some change

• Re-scan images and deduce changes in tags, comments, views and favourites.– (Same pattern, rescan jobs taken by

get_activity workers. Running 4 is enough outside of spike times)

Page 36: Mechanical curator - Technical notes

Caching

• Redis sets:– PeopleID links to set of

FlickrID+tagadded– FlickrID links to set of user tags– Sorted sets for 'high score' lists:

contributors, favourites, tags

Page 37: Mechanical curator - Technical notes

Summary

• Workers to spin up when required• Variety of workers, variety of queues• Never trust a worker or process• Never trust an API• Sample where you can't test.