bhl: big data, big challenges
DESCRIPTION
Presented at EOL Semantic Reasoning Workshop, 6-7 Sep 2012, Washington DC.TRANSCRIPT
![Page 1: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/1.jpg)
BHL: Big Data, Big Challenges
Chris Freeland
Founding Technical Director, BHL
Sr. Director, University Academic Computing, Washington University
@chrisfreeland
![Page 2: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/2.jpg)
>100,000 books, > 39 million pages, > 70TB data
![Page 3: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/3.jpg)
Firewall
Images (JP2)PDFCoordinate-based OCRXML metadata
BHL Architecture: Window Seat Ed.
BHL DB
Internet Archive
Storage
Logic
APIs UI DataExports
Access
Data TransformUtilities
Geocoding
Name Finding
![Page 4: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/4.jpg)
BHL Content: Structured Data
• Metadata from Library catalogues at title/volume level
<titleInfo><title>Caroli Linnaei ... Species plantarum :exhibentes plantas rite cognitas, ad genera relatas, cum differentiis specificis, nominibus trivialibus, synonymis selectis, locis natalibus, secundum systema sexuale digestas...</title></titleInfo><titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo><name type="personal"><namePart>Linné, Carl von,</namePart><namePart type="date">1707-1778</namePart></name><typeOfResource>text</typeOfResource><genre authority="marcgt">book</genre><originInfo><place><placeTerm type="text">Holmiae :</placeTerm></place><publisher>Impensis Laurentii Salvii,</publisher><dateIssued>1753.</dateIssued>
![Page 5: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/5.jpg)
BHL Content: Unstructured Data
More than 39 million pages!
….of uncorrected OCR
![Page 6: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/6.jpg)
Abbild ungen und Beschreibungen der
Fische Syriens, nebst
einer neuen Classification und Characteristik sämmtlicher Gattungen
der i
JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART. E. Schweizerbart' sehe Verlagshandlung,
1843.
![Page 7: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/7.jpg)
*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �
', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �
r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
![Page 8: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/8.jpg)
How to connect/consumehttp://biodivlib.wikispaces.com/Developer+Tools+and+API
BHL DB
APIs UI DataExports
Internet ArchiveImages (JP2)PDFCoordinate-based OCRXML metadata
![Page 9: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/9.jpg)
BHL Data Challenge: Name Findingpre-2007
![Page 10: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/10.jpg)
BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports & Internet
Archive• New collaboration with Global Names– Improved algorithm, better precision & recall– More data!
![Page 11: BHL: Big Data, Big Challenges](https://reader033.vdocuments.net/reader033/viewer/2022061119/546b47c5af795919088b6e1a/html5/thumbnails/11.jpg)
New Data Challengeshttp://biodivlib.wikispaces.com/BHL+and+Gaming
• Correcting OCR• Rekeying Tables of Contents• Researching candidate Scientific Names• Image identification & extraction
– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH
^Challenges framed as games