bhl technologies: review for bhl-australia
DESCRIPTION
A review of technologies in use within the Biodiversity Heritage Library, as presented to BHL-Australia partners and the Atlas of Living Australia.TRANSCRIPT
TECHNOLOGY
Chris Freeland, Technical Director
Biodiversity Heritage Library: http://biodiversitylibrary.org
Topics Covered
Development History Usage Scanning & Content Acquisition Technologies
Data Mining Services & APIs CiteBank
Global BHL
http://www.biodiversitylibrary.org/item/38659
Biodiversity Heritage Library: http://biodiversitylibrary.org
Tech History
Preliminary work: MOBOT’s Botanicus http://www.botanicus.org
Funded by Keck Foundation & IMLS Working demonstration of how
nomenclators/databases (like Tropicos) can link into digitized scientific literature
Codebase reused for BHL, then changed to fit requirements for EOL
Biodiversity Heritage Library: http://biodiversitylibrary.org
Usage
Biodiversity Heritage Library: http://biodiversitylibrary.org
Referrers: 2008 - 2009
Biodiversity Heritage Library: http://biodiversitylibrary.org
Referrers: 2010
Jan 1 – Mar 15, 2010
SCANNING & CONTENT ACQUISITION
Workflow
SelectionSelection PreparationPreparation
Post ProductionPost Production(Re)publication(Re)publication
DigitizationDigitization
ConservationConservation
Biodiversity Heritage Library: http://biodiversitylibrary.org
Complexities of distributed, mass scanning
from NYBG
from Smithsonian
BHL ScanList
http://bhl.nhm-wien.ac.at/scanlist/index.php
http://bhl.nhm-wien.ac.at/scanlist/index.php/Bibs/view/1018
Biodiversity Heritage Library: http://biodiversitylibrary.org
Scanning = human work
Biodiversity Heritage Library: http://biodiversitylibrary.org
Scan & Store: Internet Archive
Scanning on Scribes
Storage in Petaboxes
Biodiversity Heritage Library: http://biodiversitylibrary.org
Scanning Derivatives
XML JP2
PDF JPG TXT DJVu
Master Derivatives
OCR
XML
JP2
Biodiversity Heritage Library: http://biodiversitylibrary.org
Ingest from other IA Partners Used mixture of subject analysis & other
bibliographic metadata to identify content for inclusion in BHL
BHL TECHNOLOGIES
Biodiversity Heritage Library: http://biodiversitylibrary.org
Distributed (Somewhat)
Internet Archive:Digitized content / files
MOBOT:Database & web application
MBL:Redundant cluster
Biodiversity Heritage Library: http://biodiversitylibrary.org
BHL Development Team
Biodiversity Heritage Library: http://biodiversitylibrary.org
http://biodiversitylibrary.org/page/10165550
Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response
Name Finding in action
with Taxonomic Intelligence…
Biodiversity Heritage Library: http://biodiversitylibrary.org
http://biodiversitylibrary.org/page/10165550
http://biodiversitylibrary.org/name/Petalostigma_banksii
http://eol.org/pages/1153286
Biodiversity Heritage Library: http://biodiversitylibrary.org
Name finding statistics
30 million pages scanned 70 million name strings found 60 million names verified with a
NameBankID
1.5 million unique names with a NameBankID
3.5 million unique names *without* a NameBankID This is where the interesting data live!!!
Biodiversity Heritage Library: http://biodiversitylibrary.org
Services & APIs• OpenURL
– Facilitate links to citations: protologues, articles, references• Documentation:
http://www.biodiversitylibrary.org/openurlhelp.aspx– Useful to Nomenclators, Reference Systems
• IPNI• Tropicos
• Names Service– Return all occurrences of a name throughout BHL digitized
corpus• Documentation: http://bit.ly/2e6sg9
– Working out a strategy for obscure species– Algorithm improvements to detect nomenclatural &
taxonomic acts• New API
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
http://www.tropicos.org/Name/1200408
Biodiversity Heritage Library: http://biodiversitylibrary.org
Services: OpenURL Disambiguation Looking for:
BHL returns:
Biodiversity Heritage Library: http://biodiversitylibrary.org
Services: OpenURL Results
Biodiversity Heritage Library: http://biodiversitylibrary.org
But where are the articles??
BHL scans cover to cover for monographs & serials
Have tested automated markup and article boundary extraction techniques Variety of typefaces & printing techniques
make a wholly automated solution close to impossible
So, when in need, crowdsource…
Biodiversity Heritage Library: http://biodiversitylibrary.org
PDF Generation Stats
Biodiversity Heritage Library: http://biodiversitylibrary.org
No, really, where are the articles?
Biodiversity Heritage Library: http://biodiversitylibrary.org
http://www.citebank.org
Biodiversity Heritage Library: http://biodiversitylibrary.org
http://citebank.org/search
Biodiversity Heritage Library: http://biodiversitylibrary.org
http://citebank.org/node/47423
Biodiversity Heritage Library: http://biodiversitylibrary.org
CiteBank boundaries
Scanned Books
Citation
Pageturning UIPDFOCR
eBook/Kindle
Stored *somewhere* & retrievable via HTTP URI
CitationCitationCitation
Bibliography
CiteBank
TOWARDS A GLOBAL BHL
Biodiversity Heritage Library: http://biodiversitylibrary.org
Opportunities
New technologies BHL-Europe: Scan List
New use cases & user communities BHL-Europe: Cultural history
New initiatives Data mining, markup, text correction
Redundancy, localization CONTENT!!
Biodiversity Heritage Library: http://biodiversitylibrary.org
BHL is…
A unique software tool Built to serve taxonomists’ & other
scientists’ research Enhanced by 250+ years of accumulated
knowledge Complementary to physical libraries
A shared, global resource An unparalleled opportunity for
collaboration
Biodiversity Heritage Library: http://biodiversitylibrary.org
Thanks!
Chris FreelandTechnical Director, BHL
Director, Center for Biodiversity Informatics,
Missouri Botanical Garden
[email protected]://twitter.com/chrisfreeland