bitcurator nlp · bitcurator nlp overview andrew w. mellon foundation funded project (oct 2016 –...
TRANSCRIPT
![Page 1: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/1.jpg)
BitCurator NLPMiningCollectionsforNEs,Relationships,andTopicstoEnrichAccess
nlp4arc– February3,2017KamWoodsResearchScientist/BitCurator NLPTechnicalLeadUniversityofNorthCarolinaatChapelHillSchoolofInformationandLibraryScience
![Page 2: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/2.jpg)
BitCurator NLPOverview
AndrewW.MellonFoundationfundedproject(Oct2016– Oct2018)
“TheBitCurator NLPprojectwillproducesoftwareallowinginstitutionstoextract,analyze,andproducereportsaboutrelevantfeaturesfoundinopentextwithindigitalmaterialsheldincollections.ThesoftwarewillrelyonexistingNLPlibrariestoidentifyandreportonthoseitemslikelytoberelevant toongoingpreservation,informationorganization,andaccessactivities,includingentities(e.g.persons,places,andorganizations),potentialrelationshipsamongentities(forexample,bydescribingthoseentitiesthatappeartogetherwithindocumentsorsetofdocuments),andtopicmodelstoprovideinsightintohowconceptsarenaturallyclusteredwithinthedocuments.”
![Page 3: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/3.jpg)
Itoftenstartsthesameway…
Source: “Digital Forensics and creation of a narrative.” Da Blog: ULCC Digital Archives Blog. http://dablog.ulcc.ac.uk/2011/07/04/forensics/
![Page 4: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/4.jpg)
CoreApproach
Assume(simulateorreplicate)awiderangeofarchivalcollections
• Rawandforensicallypackageddiskimages• Heterogeneouscollectionsoffiles(manyfiletypes,limitedmetadata)
• UseestablishedcorporasuchasGovDocs1
Firststeps…extractingtextfromseveraldozenextremelycommonformats(disregardthelongtailtobeginwith)
• Nosingletoolappropriateforthistask– useexistingwrappersaroundmaturetools
https://textract.readthedocs.io/en/stable/
![Page 5: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/5.jpg)
CoreApproach
AdvantagesofusingacorpuslikeGovDocs1:• Inmanycases,thesedocumentsareactualrecords(publiclyavailableontheweb)
• Testscanbeeasilyreplicated,assessedbypartners
• Partnersoftenwon’t(orcan’t)giveuscollectiondata.Providesadditionaloptionsforsharing.
Disadvantages:• Excludesmanylegacyfiletypes
![Page 6: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/6.jpg)
CoreApproach
UsespaCy.io forentityrecognition,topicmodeling,othertasks…• WhyspaCy?
• Gearedtowardsproductdevelopmentmorethanresearch(e.g.NLTK,openNLP)• High-performance(multi-threaded,runsin64-bitPythonstack)• RelativelysimpleAPI• Goodpre-trainedmodelsforentityanditemrecognition• Integrateseasilywithmachinelearningplatforms(e.g TensorFlow,Keras,Scikit-Learn,Gensim)
• Striveforsimplestacks• Inthisinstance,Python +PIP +textract +spaCy,deployonanyplatform• ProvideflexibleAPIsbutsimplifybasicusecases:“Textgoesin,entityspancomesout”
![Page 7: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/7.jpg)
CoreApproach
Whydoitlocallyatall? (WhynotGCloud LanguageAPI?)
• Pricingstructureismodestbutcouldbeprohibitiveforinstitutionsworkingwithlargecollections
• AllresultsinJSON• Manyinstitutionsrestrictedfromrunningcollectionsthroughthiskindofworkflow
https://cloud.google.com/natural-language/
![Page 8: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/8.jpg)
Generatingentityviewsfortheweb
Originaltext(shownhere– clipofPDF)
Webrendering(autogen’d HTML+CSS)
<divclass="entities"><markdata-entity="org">TheVilem Flusser Archive</mark>ownsapersonalcomputerassociated<br>withtheproductionofasoftwaretitled“FlusserHypertext”.<br>Thiscomputercontainsarareworkingcopyof<br>thesoftwarewhichisdependentontheobsoleteauthoring<br>system<markdata-entity="product">HyperCard</mark>.Thediskimagehasbeenacquired6from<br>an<markdata-entity="org">Apple</mark><markdata-entity="product">Mac</mark>Performa630containinga270<markdata-entity="org”>MB</mark>IDEdisk.The goalwastoenableweb-basedaccesstotheFlusserHypertextthrough thearchive’swebsite.</div>
.entities{line-height:2;}[data-entity]{padding:0.25em0.35em;margin:0px0.25em;line-height:1;display:inline-block;border-radius:0.25em;border:1pxsolid;}[data-entity]::after{box-sizing:border-box;content:attr(data-entity);font-size:0.6em;line-height:1;padding:0.35em;border-radius:0.35em;text-transform:uppercase;display:inline-block;vertical-align:middle;margin:0px0px0.1rem0.5rem;}[data-entity][data-entity="person"]{background:rgba(166,226,45,0.2);border-color:rgb(166,226,45);}
Textextraction:textractEntityident:spaCyWebdisplay:displaCy API
![Page 9: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/9.jpg)
Generatingentityviewsfortheweb
Originaltext(shownhere– clipofPDF)
<divclass="entities"><markdata-entity="org">TheVilem Flusser Archive</mark>ownsapersonalcomputerassociated<br>withtheproductionofasoftwaretitled“FlusserHypertext”.<br>Thiscomputercontainsarareworkingcopyof<br>thesoftwarewhichisdependentontheobsoleteauthoring<br>system<markdata-entity="product">HyperCard</mark>.Thediskimagehasbeenacquired6from<br>an<markdata-entity="org">Apple</mark><markdata-entity="product">Mac</mark>Performa630containinga270<markdata-entity="org”>MB</mark>IDEdisk.The goalwastoenableweb-basedaccesstotheFlusserHypertextthrough thearchive’swebsite.</div>
.entities{line-height:2;}[data-entity]{padding:0.25em0.35em;margin:0px0.25em;line-height:1;display:inline-block;border-radius:0.25em;border:1pxsolid;}[data-entity]::after{box-sizing:border-box;content:attr(data-entity);font-size:0.6em;line-height:1;padding:0.35em;border-radius:0.35em;text-transform:uppercase;display:inline-block;vertical-align:middle;margin:0px0px0.1rem0.5rem;}[data-entity][data-entity="person"]{background:rgba(166,226,45,0.2);border-color:rgb(166,226,45);}
![Page 10: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/10.jpg)
Notalwaysascleanaswe’dlike…
![Page 11: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/11.jpg)
Notalwaysascleanaswe’dlike…
Hmmmmm……
![Page 12: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/12.jpg)
Entity type Description
PERSON PeopleNORP Nationalities, religious, and political groups.FACILITY Buildings, airports, highways, bridges, etc.ORG Companies, agencies, and institutions.GPE Countries, cities, and states.LOC Locations other than GPE (e.g. mountain ranges, bodies of water)
PRODUCT Objects other than services (e.g. devices, foods)
EVENT Historical events (e.g. cultural, weather, conflicts)
WORK_OF_ART Titles of works of artLANGUAGE Named languagesAdditional feature types DescriptionDATE Dates or periods (absolute / relative)TIME Time periods less than a dayPERCENT Percentages (also marked by ‘%’)MONEY Monetary values, including by unitQUANTITY Weight, distance, other measurementsORDINAL E.g ‘first’, ‘second’CARDINAL Numeral identifiers other than those typed above
![Page 13: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/13.jpg)
Weexpectthiscodetobedeployedinreal-worldinstitutions– performanceisaconsideration.
Baselinetestonacirca-2014Corei7ThinkPad:
• 1336files(approx.1GB)• Textextractionviatextacy ->entityextractionviaspaCy
• 52minutes(includingOCRofimageformats)
real 51m55.043suser 46m25.511ssys 1m37.768s(venv)sunitha@sm-T440s:~/BC/NLP/displaCy$lsindir|wc
27 27 308(venv)sunitha@sm-T440s:~/BC/NLP/displaCy$lsindir000000.swf 000004.doc 000007.doc gif_files new_infile.pdf wp_files000001.doc 000004.doc.span 000008.ppt html_files pdffiles000 xls_files000002.doc 000005.doc 000009.pdf infile.txt ppt_files000002.doc.span 000005.doc.span csv_files infile.txt.span ps_files000003.doc 000006.doc dir1 jpg_files txtfiles000(venv)sunitha@sm-T440s:~/BC/NLP/displaCy$cdindir(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsgif_files|wc
46 46 621(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lshtml_files|wc362 362 5249
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lswp_files|wc2 2 20
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lspdffiles000|wc200 200 2200
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsxls_files|wc124 124 1674
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsppt_files|wc88 88 968
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsps_files|wc30 30 370
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lstxtfiles000|wc283 283 3758
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsjpg_files|wc178 178 2403
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lscsv_files|wc21 21 281
(venv)sunitha@sm-T440s:~/BC/NLP/displaCy/indir$lsdir1|wc2 2 14
![Page 14: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/14.jpg)
ForasamplesetofseveralhundredfilesfromtheGovDocs corpus,inclockwiseorderfromtopleft:
• Entitytypes• Persons• Organizations• Geopolitical
entities
![Page 15: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/15.jpg)
![Page 16: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/16.jpg)
ForasamplesetofseveralhundredfilesfromtheGovDocs corpus,inclockwiseorderfromtopleft:
• Entitytypes• Persons• Organizations• Geopolitical
entities
Defaultsmaybenoisy/inaccurate
![Page 17: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/17.jpg)
DevelopmentandInfrastructureNotes
• BitCurator teamkeepsin-developmentsoftwareonGitHub• https://bitcurator.github.io• https://github.com/bitcurator/bitcurator-nlp-tools
• Developmentandprojectdocumentationpostedtowiki• https://wiki.bitcurator.net/
• In-housedevelopmentservers:• azalea.ils.unc.edu (large)• dogwood.ils.unc.edu (small)
• Weoftenhavepublicly-availabledeploymentsofthetoolsavailableonatleastonemachine…
![Page 18: BitCurator NLP · BitCurator NLP Overview Andrew W. Mellon Foundation funded project (Oct 2016 – Oct 2018) “The BitCurator NLP project will produce software allowing institutions](https://reader033.vdocuments.net/reader033/viewer/2022060413/5f115d6fd192501e512e0766/html5/thumbnails/18.jpg)
Questions?
https://wiki.bitcurator.net/ https://bitcurator.github.io/