a global commons for scientific data: molecules and wikidata

76
A global Commons for scientific Data Peter Murray-Rust, Dept of Chemistry, University of Cambridge and ContentMine At Molecular Engineering, Cavendish, Cambridge, UK, 2016-11-07 contentmine.org is supported by a grant to PMR as a

Upload: petermurrayrust

Post on 14-Feb-2017

177 views

Category:

Science


0 download

TRANSCRIPT

Page 1: A Global Commons for Scientific Data: Molecules and Wikidata

A global Commons for scientific DataPeter Murray-Rust,

Dept of Chemistry, University of Cambridge and ContentMine

At Molecular Engineering, Cavendish, Cambridge, UK, 2016-11-07

contentmine.org is supported by a grant to PMR as a

Page 2: A Global Commons for Scientific Data: Molecules and Wikidata

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 3: A Global Commons for Scientific Data: Molecules and Wikidata

Some topics

• Content in scientific publications• Extracting data from text and tables• Dictionaries• Extracting data from images• Extracting data from computational logs and theses• Wikidata

Everything is open (CC-BY). Please steal and re-use

Page 5: A Global Commons for Scientific Data: Molecules and Wikidata
Page 6: A Global Commons for Scientific Data: Molecules and Wikidata
Page 7: A Global Commons for Scientific Data: Molecules and Wikidata
Page 8: A Global Commons for Scientific Data: Molecules and Wikidata

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

1 year’s scholarly output!

Page 9: A Global Commons for Scientific Data: Molecules and Wikidata

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 10: A Global Commons for Scientific Data: Molecules and Wikidata

Most Publishers destroy structured information (LaTeX, Word) into PDF …

• Characters (NOT words or higher structure) WORD is simply 4 characters, no space chars• Paths (NOT circles, squares …) “Vectors”

… APIs then destroy it further into Pixels (e.g. PNG or JPG )

Content Mine will read 10,000 PNGs a day and try to recover the science.

Page 11: A Global Commons for Scientific Data: Molecules and Wikidata

Chemical Text

Page 12: A Global Commons for Scientific Data: Molecules and Wikidata

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 13: A Global Commons for Scientific Data: Molecules and Wikidata

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 14: A Global Commons for Scientific Data: Molecules and Wikidata

ChemDataExtractor

• http://chemdataextractor.org/docs/intro • http://chemdataextractor.org/demo

Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207

Page 15: A Global Commons for Scientific Data: Molecules and Wikidata

ChemDataExtractor

Page 16: A Global Commons for Scientific Data: Molecules and Wikidata
Page 17: A Global Commons for Scientific Data: Molecules and Wikidata

Search for “Zika” in EuropePMC and Wikidata

• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md#contentmine-demos (list of demos)

• https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html (datatables extracted - disease, gene, species, etc.)

• Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136

• https://contentmine-demo.herokuapp.com/cooccurrences Coocurrence of diseases - suggest select 25 and disease.

Page 18: A Global Commons for Scientific Data: Molecules and Wikidata

Dictionaries!

Page 19: A Global Commons for Scientific Data: Molecules and Wikidata

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

Dict A

Dict B

ImageCaption

TableCaption

MININGwith sectionsand dictionaries

[W3C Annotation / https://hypothes.is/ ]

Page 20: A Global Commons for Scientific Data: Molecules and Wikidata

Disease Dictionary (ICD-10)

<dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" />

SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }}

wdt:P494 = ICD-10 (P494) identifierwd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism

Wikidata ontology for disease

Page 21: A Global Commons for Scientific Data: Molecules and Wikidata

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

Page 22: A Global Commons for Scientific Data: Molecules and Wikidata

Amanuens.is demo

These slides represent snapshot of an interactive demo…

Subject: Flavour

Page 23: A Global Commons for Scientific Data: Molecules and Wikidata

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone

https://en.wikipedia.org/wiki/Carvone

Page 24: A Global Commons for Scientific Data: Molecules and Wikidata

https://en.wikipedia.org/wiki/Carvone

WIKIDATA

Page 25: A Global Commons for Scientific Data: Molecules and Wikidata

Carvone in WikidataAlso SPARQL endpoint

Page 26: A Global Commons for Scientific Data: Molecules and Wikidata

Search for carvone

Page 27: A Global Commons for Scientific Data: Molecules and Wikidata

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is

Page 28: A Global Commons for Scientific Data: Molecules and Wikidata

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 29: A Global Commons for Scientific Data: Molecules and Wikidata

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Page 30: A Global Commons for Scientific Data: Molecules and Wikidata

Remote &Local papers

DiseaseICD-10

phytochemicals

species

Commonest words

Page 31: A Global Commons for Scientific Data: Molecules and Wikidata

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Page 32: A Global Commons for Scientific Data: Molecules and Wikidata

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 33: A Global Commons for Scientific Data: Molecules and Wikidata

Annotation sent to hypothes.is

prefix suffixsource

usertext

uri

maybe 100+ annotations per paper

text

Page 34: A Global Commons for Scientific Data: Molecules and Wikidata
Page 35: A Global Commons for Scientific Data: Molecules and Wikidata

Wikidata: monoclinic systems with mass < 200

https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fmass%0AWHERE%0A%7B%0A%20%20%09%3Fitem%20wdt%3AP556%20wd%3AQ624543%20.%0A%20%20%09%3Fitem%20wdt%3AP2067%20%3Fmass%20.%0A%20%20%20%20FILTER%28%3Fmass%20%3C%20200%20%29%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D

Page 36: A Global Commons for Scientific Data: Molecules and Wikidata

Image mining

Page 37: A Global Commons for Scientific Data: Molecules and Wikidata

PMR is collaborating with the European Bioinformatics Institute to liberate metabolic information from journals

Page 38: A Global Commons for Scientific Data: Molecules and Wikidata
Page 39: A Global Commons for Scientific Data: Molecules and Wikidata
Page 40: A Global Commons for Scientific Data: Molecules and Wikidata

Chemistry in Patents

Obfuscation?

Page 41: A Global Commons for Scientific Data: Molecules and Wikidata

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

Page 42: A Global Commons for Scientific Data: Molecules and Wikidata

Binarization (pixels = 0,1)

Irregular edges

Page 43: A Global Commons for Scientific Data: Molecules and Wikidata

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 44: A Global Commons for Scientific Data: Molecules and Wikidata

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Page 45: A Global Commons for Scientific Data: Molecules and Wikidata

Ross Mounce (Bath), Panton Fellow

• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:

Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)

Page 46: A Global Commons for Scientific Data: Molecules and Wikidata

4300 images

Page 47: A Global Commons for Scientific Data: Molecules and Wikidata

Note Jaggy and broken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight Binomial Name Culture/Strain GENBANK ID

EvolutionRate

Page 49: A Global Commons for Scientific Data: Molecules and Wikidata

IJSEM phylotrees

• International Journal Systematic and Evolutionary Microbiology

• All new microorganisms are expected to be published there

• Consistent (though primitive) approach to trees

Page 50: A Global Commons for Scientific Data: Molecules and Wikidata

“Root”

Page 51: A Global Commons for Scientific Data: Molecules and Wikidata

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 52: A Global Commons for Scientific Data: Molecules and Wikidata
Page 53: A Global Commons for Scientific Data: Molecules and Wikidata

Automatic Open Notebook of computations

Everything is posted to Github before being analyzed

Page 54: A Global Commons for Scientific Data: Molecules and Wikidata

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Page 55: A Global Commons for Scientific Data: Molecules and Wikidata

Display your own tree• Cut and paste…• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),

((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or• http://www.trex.uqam.ca/index.php?action=newick&project= trex

Page 56: A Global Commons for Scientific Data: Molecules and Wikidata

Supertree for 924 species

Tree

Page 57: A Global Commons for Scientific Data: Molecules and Wikidata

Supertree created from 4300 papers

Page 58: A Global Commons for Scientific Data: Molecules and Wikidata

Plots

Page 59: A Global Commons for Scientific Data: Molecules and Wikidata
Page 60: A Global Commons for Scientific Data: Molecules and Wikidata

To be extracted: * Symbol(x,y) * Error bar (y+,y-) * Line

Yaxis• Extent

Page 61: A Global Commons for Scientific Data: Molecules and Wikidata
Page 62: A Global Commons for Scientific Data: Molecules and Wikidata
Page 63: A Global Commons for Scientific Data: Molecules and Wikidata

Typical PDF with vectors - hyperlink

Page 64: A Global Commons for Scientific Data: Molecules and Wikidata

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE

Page 65: A Global Commons for Scientific Data: Molecules and Wikidata

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 66: A Global Commons for Scientific Data: Molecules and Wikidata

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 67: A Global Commons for Scientific Data: Molecules and Wikidata
Page 68: A Global Commons for Scientific Data: Molecules and Wikidata
Page 69: A Global Commons for Scientific Data: Molecules and Wikidata

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 70: A Global Commons for Scientific Data: Molecules and Wikidata

After AMI2 processing…..

… AMI2 has detected a square

Page 71: A Global Commons for Scientific Data: Molecules and Wikidata
Page 72: A Global Commons for Scientific Data: Molecules and Wikidata

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 73: A Global Commons for Scientific Data: Molecules and Wikidata

Precision + Recall for ImageAnalysis?

• Chemical Patents (obfuscation) ca 25% PR• Binomial names from text > 99% PR• Binomial from images (lookup) 95%+ • Trees from images (pred.) • Molecules: image ca 90% SVG > • Analysis massively hampered by Copyright

Page 74: A Global Commons for Scientific Data: Molecules and Wikidata

Software Availability and collaboration

• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg, • http://bitbucket.org/petermr, svgbuilder, xhtml2stm,

imageanalysis, diagramanalyzer• http://bitbucket.org/AndyHowlett/ami2-poc• http://github.com/petermr/ami-plugin • http://github.com/ContentMine • http://boofcv.org • collaboration with PDFBox, TabulaPDF, JailbreakingThePDF

• Extracted data CC 0

Page 75: A Global Commons for Scientific Data: Molecules and Wikidata

(2x digital music industry!)

Page 76: A Global Commons for Scientific Data: Molecules and Wikidata

Questions and comments

Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation

PM-R has offered to mentor an MSc project this summer for anyone interested.

contentmine.org