lichens, bryophytes and climate change
DESCRIPTION
Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin. Lichens, Bryophytes and Climate Change. Goals and Scope. 16 digitization centers > 60 non-governmental US herbaria (95%) Mexico, US, Canada ~ 2.3 million specimen 90% of all specimens 900,000 lichens - PowerPoint PPT PresentationTRANSCRIPT
Lichens, Bryophytes and Climate Change
Edward GilbertCorinna GriesThomas H. Nash IIIRobert Anglin
Goals and Scope 16 digitization centers > 60 non-
governmental US herbaria (95%) Mexico, US, Canada
~ 2.3 million specimen 90% of all specimens 900,000 lichens 1.4 million bryophytes
Project Information
http://lbcc.limnology.wisc.edu/
Digitization Workflow
National Portals Lichen Consortium
http://lichenportal.org Started in 2009 24 Collections ~ 797,916 Records
Bryophyte Consortium http://bryophyteportal/ Started in 2010 16 Collections 1,059,063 Records
Imaging Stage
Capture Image
barcode in file name
Create Skeleton
Filebarcode, species name,
exsiccati, etc.
Upload to FTP server
Image processing
extract barcode,
create web versions, map to portal DBs
Duplicate Harvesti
ng
Existing Herbarium Database
Automated ProcessingOCR / NLP /
Georeferencingaugmented with raw OCR, parsed fields, coordinates, etc.
Existing Record
simply link image
Upload to FTP server
Image URLs
Manage Specimen
Data in Portal
Manage / Review
Records in Portal
SymbiotaEditor
review, edit, keystroke, and finalize
Create New Record
barcode, image, skeletal data
LBCC: Workflow Overview Image all specimen / specimen labels Collect and load skeletal data
Barcode, scientific name, country, state Upload to portal
Record exists => link image to existing record Record absent => create empty “unprocessed” record
Automated OCR label Block of raw text => database
Automated NLP (field parsing) Review data
Keystroke full record Collector name & number => look for dups Reparse full record => learnable parsers
Optical Character Recognition Tesseract V3 Dual cycle
Automatic Manual review
Expected hurtles Handwritten
labels Old fonts Faded labels Form labels
Adjustable image variables
¢_].L.|»‘¢ .'».f.'._..‘~,(.Jfin-x‘*\'a:"511z:1 wf .~\:'i/.onli State UniversityP.’~.r"~2= ,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12�‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESXZ»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“» »4 xx, ,"""‘“â€T"’ <1;-.rs f3'a,1.z>.t;;a¢f~rus ’�V4 J 'if . r°'° M '1?nies ivain.) Sav.neutal Station - " '1 ~»r';;4-\P ` 1.T11 ./P.. ,J ..-.ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE_ ,. W5. (> f- , -:‘; i f>i_T ~~ . A 1:». v\ .-v »~. 4. a xvala 8/27/73
PLANTS OF NEW r~1ExIcoHerbarium of Arizona State UniversityParmelia ulophyllodes (Vain.) Sav.COUNTY “°â€â€œâ€œ �Joranada Experimental Station -New Mexico State University"“““' on JuniperusELEV. ‘ 4400EEILLEETUR DATEDU T. H. Nash #7914 8/27/73T. H. N.
Auto-Processing: OCR
1. Iterate through new “unprocessed” images1. 81439 bryophytes images2. 147122 lichens images
2. OCR via Tesseract (version 3)a) Untreated imageb) Treated image (contrast, brightness, etc)
3. Store raw text linked to skeletal record4. Progress to next step
1. Low OCR return => hand processing2. “Unprocessed-OCR” => NLP
Auto-Processing: NLP
1. Iterate through raw OCR text blocksa) 147122 lichen OCR blocksb) 81439 bryophyte OCR blocks
2. Collector, number, and datea) Attempt duplicate harvesting
3. Field-by-field parsing4. Full-parsing5. Parsing based on NLP profiles
1. E.g. targeted label formats
NLP: Duplicate Harvesting1. Extract collector data
a) Last name, number, date2. Harvest duplicates from consortium DB
a) Exact duplicatesb) Duplicate events
3. Compare return field-by-field4. Compare fields with raw OCR5. Populate fields that have high similarity
indexes6. Processing status: “pending review”
NLP: Targeted Parsing Profiles1. Premise: Target similar label formats2. Use raw OCR to locate “Nash” labels3. Need to exclude:
a) Determined by Nashb) Author of scientific namec) Associated collector
4. Test for similarity to target label format
5. Targeted parsing algorithms
Label Review