high throughput mining of the scholarly literature; talk at nih

77
NIH, Bethesda, US, 2016- 11-15 High throughput mining of the scholarly literature Peter Murray-Rust 1,2 [1]University of Cambridge [2]TheContentMine pm286 AT cam DOT ac DOT uk Scientific knowledge is for everyon

Upload: petermurrayrust

Post on 13-Apr-2017

334 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: High throughput mining of the scholarly literature; talk at NIH

NIH, Bethesda, US, 2016-11-15

High throughput mining of the scholarly literature

Peter Murray-Rust1,2

[1]University of Cambridge[2]TheContentMine pm286 AT cam DOT ac DOT uk

Scientific knowledge is for everyone

Page 2: High throughput mining of the scholarly literature; talk at NIH

Themes

• 500 Billion$ of funded STM research/year• 85% of medical research is wasted (Lancet 2011)• An Open mining toolset• Wikidata as the semantic backbone• Community involvement• Sociopolitical issues• My gratitude to NIH• Offers of collaboration; data ingestion? Software?

Sources?

Page 3: High throughput mining of the scholarly literature; talk at NIH
Page 4: High throughput mining of the scholarly literature; talk at NIH

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 5: High throughput mining of the scholarly literature; talk at NIH

CLOSED ACCESS MEANS PEOPLE DIE

Page 6: High throughput mining of the scholarly literature; talk at NIH

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 7: High throughput mining of the scholarly literature; talk at NIH

(2x digital music industry!)

Page 8: High throughput mining of the scholarly literature; talk at NIH

Scholarly publishing is “Big Data”

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] per month2.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

1 year’s scholarly output!

Page 9: High throughput mining of the scholarly literature; talk at NIH

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 10: High throughput mining of the scholarly literature; talk at NIH

Demos of mining

Page 11: High throughput mining of the scholarly literature; talk at NIH

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 12: High throughput mining of the scholarly literature; talk at NIH

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 13: High throughput mining of the scholarly literature; talk at NIH

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATIONhttps://bytebucket.org/petermr/xhtml2stm/wiki/animation.svg?rev=793a4d9ffa0616a84ff4aeabf80e657b5142ed33

(may be browser dependent)

Andy Howlett, Cambridge

Page 14: High throughput mining of the scholarly literature; talk at NIH

ChemDataExtractor

• http://chemdataextractor.org/docs/intro • http://chemdataextractor.org/demo

Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00207

Page 15: High throughput mining of the scholarly literature; talk at NIH

Europe PubMedCentral

Page 16: High throughput mining of the scholarly literature; talk at NIH

2015

2016

Page 17: High throughput mining of the scholarly literature; talk at NIH

Dictionaries!

Page 18: High throughput mining of the scholarly literature; talk at NIH

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

Crossref PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

Page 19: High throughput mining of the scholarly literature; talk at NIH

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

Dict A

Dict B

ImageCaption

TableCaption

MININGwith sectionsand dictionaries

[W3C Annotation / https://hypothes.is/ ]

Page 20: High throughput mining of the scholarly literature; talk at NIH

Disease Dictionary (ICD-10)

<dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" />

SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }}

wdt:P494 = ICD-10 (P494) identifierwd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism

Wikidata ontology for disease

Page 21: High throughput mining of the scholarly literature; talk at NIH

Example statistics dictionary<dictionary title="statistics2"> <entry term="ANCOVA" name="ANCOVA"/> <entry term="ANOVA" name="ANOVA"/> <entry term="CFA" name="CFA"/> <entry term="EFA" name="EFA"/> <entry term="Likert" name="Likert"/> <entry term="Mann-Whitney" name="Mann-Whitney"/> <entry term="MANOVA" name="MANOVA"/> <entry term="McNemar" name="McNemar"/> <entry term="PCA" name="PCA"/> <entry term="Pearson" name="Pearson"/> <entry term="Spearman" name="Spearman"/> <entry term="t-test" name="t-test"/> <entry term="Wilcoxon" name="Wilcoxon"/></dictionary>

“Mann-Whitney” link to Wikipedia entry and Wikidata (Q1424533) entry

Page 22: High throughput mining of the scholarly literature; talk at NIH

Annotation (entity in context)prefix

surface

label

location

suffix

Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136

Machine version

Page 23: High throughput mining of the scholarly literature; talk at NIH

Wikidata demo

• Find all architecturally significant buildings in Cambridge UK

• https://tools.wmflabs.org/wikishootme/#lat=52.204082366142&lng=0.11190176010131837&zoom=16&layers=wikidata_image,wikidata_no_image&sparql_filter=%3Fq%20wdt%3AP1435%20wd%3AQ15700834

credit: Magnus Manske https://en.wikipedia.org/wiki/Magnus_Manske

Story: Magnus used FOI to get metadata for tens of thousands of “listed buildings” [1] from English Heritage and put all data into Wikidata

[1] https://www.wikidata.org/wiki/Q570600

Page 24: High throughput mining of the scholarly literature; talk at NIH
Page 25: High throughput mining of the scholarly literature; talk at NIH
Page 26: High throughput mining of the scholarly literature; talk at NIH
Page 27: High throughput mining of the scholarly literature; talk at NIH

Is chemistry in Wikidata?

Page 28: High throughput mining of the scholarly literature; talk at NIH

• https://pubchem.ncbi.nlm.nih.gov/

Page 29: High throughput mining of the scholarly literature; talk at NIH

PubChem (P662) is a Wikidata “Property”

143347 PubChem items

Wikidata knows about PubChem

PubChem Item (Q27140241)label O-acetylcarnitine

Page 30: High throughput mining of the scholarly literature; talk at NIH

Wikidata

Page 32: High throughput mining of the scholarly literature; talk at NIH

Search for “Zika” in EuropePMC and Wikidata

• https://github.com/ContentMine/amidemos/blob/master/WIKIDATA.md#contentmine-demos (list of demos)

• https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html • (datatables extracted - disease, gene, species, etc.)• Lars Willighagen (NL) and Tom Arrow. visualisation of single facts and groups from

Corpus. https://tarrow.github.io/factvis/#cmid=CM.wikidatacountry136

• https://contentmine-demo.herokuapp.com/cooccurrences Coocurrence of diseases - suggest select 25 and disease.

Page 33: High throughput mining of the scholarly literature; talk at NIH

https://rawgit.com/ContentMine.amidemos/master/zika/full.dataTables.html

Search on publicly accessible papers on “Zika”

Page 34: High throughput mining of the scholarly literature; talk at NIH
Page 35: High throughput mining of the scholarly literature; talk at NIH
Page 36: High throughput mining of the scholarly literature; talk at NIH

<dictionary title="tropicalVirus"> <entry term="ZIKV" name="Zika virus"/> <entry term="Zika" name="Zika virus"/> <entry term="DENV" name="Dengue virus"/> <entry term="Dengue" name="Dengue virus"/> <entry term="CHIKV" name="Chikungunya virus"/> <entry term="Chikungunya" name="Chikungunya virus"/> <entry term="WNV" name="West Nile virus"/> <entry term="West Nile" name="West Nile virus"/> <entry term="YFV" name="Yellow fever virus"/> <entry term="Yellow fever" name="Yellow fever virus"/> <entry term="HPV" name="Human papilloma virus"/> <entry term="Human papilloma virus" name="Human papilloma virus"/></dictionary>

Terms co-ocurring with “Zika”

Page 37: High throughput mining of the scholarly literature; talk at NIH

Diagram mining

• TL; DR We can get high-precision scientific data out of diagrams

Page 38: High throughput mining of the scholarly literature; talk at NIH

PMR is collaborating with the European Bioinformatics Institute to liberate metabolic information from journals

Page 39: High throughput mining of the scholarly literature; talk at NIH
Page 40: High throughput mining of the scholarly literature; talk at NIH
Page 41: High throughput mining of the scholarly literature; talk at NIH

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

Page 42: High throughput mining of the scholarly literature; talk at NIH

Binarization (pixels = 0,1)

Irregular edges

Page 43: High throughput mining of the scholarly literature; talk at NIH

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 44: High throughput mining of the scholarly literature; talk at NIH

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Page 45: High throughput mining of the scholarly literature; talk at NIH

Ross Mounce (Bath), Panton Fellow

• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]:

Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)

Page 46: High throughput mining of the scholarly literature; talk at NIH

4300 images

Page 47: High throughput mining of the scholarly literature; talk at NIH

Note Jaggy and broken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight Binomial Name Culture/Strain GENBANK ID

EvolutionRate

Page 48: High throughput mining of the scholarly literature; talk at NIH

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 49: High throughput mining of the scholarly literature; talk at NIH
Page 50: High throughput mining of the scholarly literature; talk at NIH

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Page 51: High throughput mining of the scholarly literature; talk at NIH

Supertree for 924 species

Tree

Page 52: High throughput mining of the scholarly literature; talk at NIH

Supertree created from 4300 papers

Page 53: High throughput mining of the scholarly literature; talk at NIH

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE

Page 54: High throughput mining of the scholarly literature; talk at NIH

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 55: High throughput mining of the scholarly literature; talk at NIH

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 56: High throughput mining of the scholarly literature; talk at NIH
Page 57: High throughput mining of the scholarly literature; talk at NIH
Page 58: High throughput mining of the scholarly literature; talk at NIH

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 59: High throughput mining of the scholarly literature; talk at NIH

After AMI2 processing…..

… AMI2 has detected a square

Page 60: High throughput mining of the scholarly literature; talk at NIH
Page 61: High throughput mining of the scholarly literature; talk at NIH
Page 62: High throughput mining of the scholarly literature; talk at NIH

http://www.lisboncouncil.net/publication/publication/134-text-and-data-mining-for-research-and-innovation-.html

Asian and U.S. scholars continue to show a huge interest in text and data mining as measured by academic research on the topic. And Europe’s position is falling relative to the rest of the world.

Legal clarity also matters. Some countries apply the “fair-use” doctrine, whichallows “exceptions” to existing copyright law, including for text and data mining.Israel, the Republic of Korea, Singapore, Taiwan and the U.S. are in this group.Others have created a new copyright “exception” for text and data mining – Japan,for instance, which adopted a blanket text-and-data-mining exception in 2009, andmore recently the United Kingdom, where text and data mining was declared fullylegal for non-commercial research purposes in 2014. Some researchers worry thatthe UK exception does not go far enough; others report that British researchers arenow at an advantage over their continental counterparts.

the Middle East is now the world’s fourth largest region for research on text and data mining, led by Iran and Turkey.

Page 63: High throughput mining of the scholarly literature; talk at NIH

@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:

"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ …

#opencon #TDM

Elsevier stopped me doing my researchChris Hartgerink

Page 64: High throughput mining of the scholarly literature; talk at NIH

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post

Page 65: High throughput mining of the scholarly literature; talk at NIH

WILEY … “new security feature… to prevent systematic download of content

“[limit of] 100 papers per day”

“essential security feature … to protect both parties (sic)”

CAPTCHAUser has to type words

Page 66: High throughput mining of the scholarly literature; talk at NIH

http://onsnetwork.org/chartgerink/2016/02/23/wiley-also-stopped-my-doing-my-research/

Wiley also stopped me (Chris Hartgerink) doing my researchIn November, I wrote about how Elsevier wanted me to stop downloading scientific articles for my research. Today, Wiley also ordered me to stop downloading.

As a quick recapitulation: I am a statistician doing research into detecting potentially problematic research such as data fabrication and estimating how often it occurs. For this, I need to download many scientific articles, because my research applies content mining methods that extract facts from them (e.g., test statistics). These facts serve as my data to answer my research questions. If I cannot download these research articles, I cannot collect the data I need to do my research.I was downloading psychology research articles from the Wiley library, with a maximum of 5 per minute. I did this using the tool quickscrape, developed by the ContentMine organization. With this, I have downloaded approximately 18,680 research articles from the Wiley library, which I was downloading solely for research purposes.Wiley noticed my downloading and notified my university library that they detected a compromised proxy, which they

had immediately restricted. They called it “illegally downloading copyrighted content licensed by your institution”. However, at no point was there any investigation into whether my user credentials were actually compromised (they were not). Whether I had legitimate reasons to download these articles was never discussed. The original email from Wiley is available here.

As a result of Wiley denying me to download these research articles, I cannot collect data from another one of the big publishers, alongside Elsevier. Wiley is more strict than Elsevier by immediately condemning the downloading as illegal, whereas Elsevier offers an (inadequate) API with additional terms of use (while legitimate access has already been obtained). I am really confused about what the publisher’s stance on content mining is, because Sage and Springer seemingly allow it; I have downloaded 150,210 research articles from Springer and 12,971 from Sage and they never complained about it.

Page 67: High throughput mining of the scholarly literature; talk at NIH

Julia Reda, Pirate MEP, running ContentMine software to liberate science 2016-04-16

Page 68: High throughput mining of the scholarly literature; talk at NIH

WikiFactMine• https://meta.wikimedia.org/wiki/Grants:Project/ContentMine/WikiFactMine

Page 69: High throughput mining of the scholarly literature; talk at NIH

anyone can review the grant

Page 70: High throughput mining of the scholarly literature; talk at NIH

comments help to refine proposal

Page 71: High throughput mining of the scholarly literature; talk at NIH

(2x digital music industry!)

Page 72: High throughput mining of the scholarly literature; talk at NIH

Themes

• 500 Billion$ of funded STM research/year• 85% of medical research is wasted (Lancet 2011)• An Open mining toolset• Wikidata as the semantic backbone• Community involvement• Sociopolitical issues• My gratitude to NIH• Offers of collaboration; data ingestion? Software?

Sources?

Page 73: High throughput mining of the scholarly literature; talk at NIH

Additional material

Page 76: High throughput mining of the scholarly literature; talk at NIH

Table in a scientific paperhttp://dx.doi.org/10.1371/journal.pmed.1002150.t001

Page 77: High throughput mining of the scholarly literature; talk at NIH

Typical scientific diagram(bitmap, so not machine-understandable)