a global commons for scientific data: molecules and wikidata

A global Commons for scientific DataPeter Murray-Rust,

Dept of Chemistry, University of Cambridge and ContentMine

At Molecular Engineering, Cavendish, Cambridge, UK, 2016-11-07

contentmine.org is supported by a grant to PMR as a

http://contentmine.org/

http://contentmine.org/

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Some topics

• Content in scientific publications• Extracting data from text and tables• Dictionaries• Extracting data from images• Extracting data from computational logs and theses• Wikidata

Everything is open (CC-BY). Please steal and re-use

Wikidata demos• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md• https://tools.wmflabs.org/wikishootme/#lat=52.204082366142&lng=0.11

190176010131837&zoom=16&layers=wikidata_image,wikidata_no_image&sparql_filter=%3Fq%20wdt%3AP1435%20wd%3AQ15700834 (Listed buildings)

https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md

https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md

https://tools.wmflabs.org/wikishootme/%23lat=52.204082366142&lng=0.11190176010131837&zoom=16&layers=wikidata_image,wikidata_no_image&sparql_filter=?q%20wdt:P1435%20wd:Q15700834




Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

1 year’s scholarly output!

https://en.wikipedia.org/wiki/Mont_Blanc%23/media/File:Mont_Blanc_depuis_Valmorel.jpg

http://www.crossref.org/01company/crossref_indicators.html

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

Most Publishers destroy structured information (LaTeX, Word) into PDF …

• Characters (NOT words or higher structure) WORD is simply 4 characters, no space chars• Paths (NOT circles, squares …) “Vectors”

… APIs then destroy it further into Pixels (e.g. PNG or JPG )

Content Mine will read 10,000 PNGs a day and try to recover the science.

Chemical Text

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

http://chemicaltagger.ch.cam.ac.uk/

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

ChemDataExtractor

• http://chemdataextractor.org/docs/intro • http://chemdataextractor.org/demo

Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207

http://chemdataextractor.org/docs/intro

http://chemdataextractor.org/docs/intro

http://chemdataextractor.org/demo

http://chemdataextractor.org/demo

http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207

http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207

ChemDataExtractor

Search for “Zika” in EuropePMC and Wikidata

• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md#contentmine-demos (list of demos)

• https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html (datatables extracted - disease, gene, species, etc.)

• Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136

• https://contentmine-demo.herokuapp.com/cooccurrences Coocurrence of diseases - suggest select 25 and disease.

https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md%23contentmine-demos



https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html

https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html

https://tarrow.github.io/factvis/%23cmid=CM.wikidatacountry136



https://contentmine-demo.herokuapp.com/cooccurrences

https://contentmine-demo.herokuapp.com/cooccurrences

Dictionaries!

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

Dict A

Dict B

ImageCaption

TableCaption

MININGwith sectionsand dictionaries

[W3C Annotation / https://hypothes.is/ ]

https://hypothes.is/

https://hypothes.is/

Disease Dictionary (ICD-10)

<dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" />

SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }}

wdt:P494 = ICD-10 (P494) identifierwd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism

Wikidata ontology for disease

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

Amanuens.is demo

These slides represent snapshot of an interactive demo…

Subject: Flavour

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone




WIKIDATA

Carvone in WikidataAlso SPARQL endpoint

Search for carvone

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is

Annotation (entity in context)

prefixsurface

label

location

suffix

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Remote &Local papers

DiseaseICD-10

phytochemicals

species

Commonest words

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Annotation (entity in context)

prefixsurface

label

location

suffix

Annotation sent to hypothes.is

prefix suffixsource

usertext

uri

maybe 100+ annotations per paper

text

Wikidata: monoclinic systems with mass < 200

https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fmass%0AWHERE%0A%7B%0A%20%20%09%3Fitem%20wdt%3AP556%20wd%3AQ624543%20.%0A%20%20%09%3Fitem%20wdt%3AP2067%20%3Fmass%20.%0A%20%20%20%20FILTER%28%3Fmass%20%3C%20200%20%29%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D

Image mining

PMR is collaborating with the European Bioinformatics Institute to liberate metabolic information from journals

Chemistry in Patents

Obfuscation?

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

Binarization (pixels = 0,1)

Irregular edges

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Ross Mounce (Bath), Panton Fellow

• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:

Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)

http://www.slideshare.net/rossmounce

https://docs.google.com/presentation/d/%2014WfNOpdRkb5QHsXELJdrOI8NHHYmDnJi7PYze7YmVtc/%20edit%23slide=id.g1805fa150_057

https://docs.google.com/presentation/d/14WfNOpdRkb5QHsXELJdrOI8NHHYmDnJi7PYze7YmVtc/edit%23slide=id.g1805fa150_057

http://bit.ly/PLOStrees

http://bit.ly/phylofigs

4300 images

Note Jaggy and broken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight Binomial Name Culture/Strain GENBANK ID

EvolutionRate

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/




IJSEM phylotrees

• International Journal Systematic and Evolutionary Microbiology

• All new microorganisms are expected to be published there

• Consistent (though primitive) approach to trees

“Root”

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Automatic Open Notebook of computations

Everything is posted to Github before being analyzed

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Display your own tree• Cut and paste…• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),

((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or• http://www.trex.uqam.ca/index.php?action=newick&project= trex

http://www.unc.edu/~bdmorris/treelib-js/demo.html



http://www.trex.uqam.ca/index.php?action=newick&project=trex

http://www.trex.uqam.ca/index.php?action=newick&project=trex

Supertree for 924 species

Tree

Supertree created from 4300 papers

To be extracted: * Symbol(x,y) * Error bar (y+,y-) * Line

Yaxis• Extent

Typical PDF with vectors - hyperlink

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

After AMI2 processing…..

… AMI2 has detected a square

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

http://dx.doi.org/10.3390/metabo2010100

https://bytebucket.org/petermr/xhtml2stm/wiki/animation.svg

Precision + Recall for ImageAnalysis?

• Chemical Patents (obfuscation) ca 25% PR• Binomial names from text > 99% PR• Binomial from images (lookup) 95%+ • Trees from images (pred.) • Molecules: image ca 90% SVG > • Analysis massively hampered by Copyright

Software Availability and collaboration

• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg, • http://bitbucket.org/petermr, svgbuilder, xhtml2stm,

imageanalysis, diagramanalyzer• http://bitbucket.org/AndyHowlett/ami2-poc• http://github.com/petermr/ami-plugin • http://github.com/ContentMine • http://boofcv.org • collaboration with PDFBox, TabulaPDF, JailbreakingThePDF

• Extracted data CC 0

http://bitbucket.org/wwmm

http://bitbucket.org/petermr



http://bitbucket.org/AndyHowlett/ami2-poc

http://bitbucket.org/AndyHowlett/ami2-poc

http://github/petermr/ami-plugin

http://github.com/ContentMine

http://boofcv.org/

(2x digital music industry!)

Questions and comments

Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation

PM-R has offered to mentor an MSc project this summer for anyone interested.

contentmine.org