a global commons for scientific data: molecules and wikidata
TRANSCRIPT
A global Commons for scientific DataPeter Murray-Rust,
Dept of Chemistry, University of Cambridge and ContentMine
At Molecular Engineering, Cavendish, Cambridge, UK, 2016-11-07
contentmine.org is supported by a grant to PMR as a
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
Some topics
• Content in scientific publications• Extracting data from text and tables• Dictionaries• Extracting data from images• Extracting data from computational logs and theses• Wikidata
Everything is open (CC-BY). Please steal and re-use
Wikidata demos• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md• https://tools.wmflabs.org/wikishootme/#lat=52.204082366142&lng=0.11
190176010131837&zoom=16&layers=wikidata_image,wikidata_no_image&sparql_filter=%3Fq%20wdt%3AP1435%20wd%3AQ15700834 (Listed buildings)
Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html
1 year’s scholarly output!
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
Most Publishers destroy structured information (LaTeX, Word) into PDF …
• Characters (NOT words or higher structure) WORD is simply 4 characters, no space chars• Paths (NOT circles, squares …) “Vectors”
… APIs then destroy it further into Pixels (e.g. PNG or JPG )
Content Mine will read 10,000 PNGs a day and try to recover the science.
Chemical Text
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
ChemDataExtractor
• http://chemdataextractor.org/docs/intro • http://chemdataextractor.org/demo
Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207
ChemDataExtractor
Search for “Zika” in EuropePMC and Wikidata
• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md#contentmine-demos (list of demos)
• https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html (datatables extracted - disease, gene, species, etc.)
• Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136
• https://contentmine-demo.herokuapp.com/cooccurrences Coocurrence of diseases - suggest select 25 and disease.
Dictionaries!
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
Dict A
Dict B
ImageCaption
TableCaption
MININGwith sectionsand dictionaries
[W3C Annotation / https://hypothes.is/ ]
Disease Dictionary (ICD-10)
<dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" />
SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }}
wdt:P494 = ICD-10 (P494) identifierwd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism
Wikidata ontology for disease
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
100, 000 pages/day Semantic ScholarlyHTML(W3C community group)
Facts
Latest 20150908
Amanuens.is demo
These slides represent snapshot of an interactive demo…
Subject: Flavour
What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
WIKIDATA
Carvone in WikidataAlso SPARQL endpoint
Search for carvone
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is
Annotation (entity in context)
prefixsurface
label
location
suffix
ARTICLES FACETS
gene disease drug Phytochem
species genus words
Remote &Local papers
DiseaseICD-10
phytochemicals
species
Commonest words
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is
Annotation (entity in context)
prefixsurface
label
location
suffix
Annotation sent to hypothes.is
prefix suffixsource
usertext
uri
maybe 100+ annotations per paper
text
Wikidata: monoclinic systems with mass < 200
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fmass%0AWHERE%0A%7B%0A%20%20%09%3Fitem%20wdt%3AP556%20wd%3AQ624543%20.%0A%20%20%09%3Fitem%20wdt%3AP2067%20%3Fmass%20.%0A%20%20%20%20FILTER%28%3Fmass%20%3C%20200%20%29%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D
Image mining
PMR is collaborating with the European Bioinformatics Institute to liberate metabolic information from journals
Chemistry in Patents
Obfuscation?
Chemical Computer Vision
Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1)
Irregular edges
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014
Ross Mounce (Bath), Panton Fellow
• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:
Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)
4300 images
Note Jaggy and broken pixels
NEW Bacteria must have a phylogenetic tree
Length_________Weight Binomial Name Culture/Strain GENBANK ID
EvolutionRate
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
IJSEM phylotrees
• International Journal Systematic and Evolutionary Microbiology
• All new microorganisms are expected to be published there
• Consistent (though primitive) approach to trees
“Root”
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
Automatic Open Notebook of computations
Everything is posted to Github before being analyzed
Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens
• [Identifier in Wikidata] • Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.
Display your own tree• Cut and paste…• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),
((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));
• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or• http://www.trex.uqam.ca/index.php?action=newick&project= trex
Supertree for 924 species
Tree
Supertree created from 4300 papers
Plots
To be extracted: * Symbol(x,y) * Error bar (y+,y-) * Line
Yaxis• Extent
Typical PDF with vectors - hyperlink
But we can now turn PDFs into
Science
We can’t turn a hamburger into a cow
Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
After AMI2 processing…..
… AMI2 has detected a square
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
Precision + Recall for ImageAnalysis?
• Chemical Patents (obfuscation) ca 25% PR• Binomial names from text > 99% PR• Binomial from images (lookup) 95%+ • Trees from images (pred.) • Molecules: image ca 90% SVG > • Analysis massively hampered by Copyright
Software Availability and collaboration
• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg, • http://bitbucket.org/petermr, svgbuilder, xhtml2stm,
imageanalysis, diagramanalyzer• http://bitbucket.org/AndyHowlett/ami2-poc• http://github.com/petermr/ami-plugin • http://github.com/ContentMine • http://boofcv.org • collaboration with PDFBox, TabulaPDF, JailbreakingThePDF
• Extracted data CC 0
(2x digital music industry!)
Questions and comments
Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer for anyone interested.
contentmine.org