contentmining in neuroscience

Open Mining of the Bioscience Literature

Peter Murray-Rust, ContentMine.org and the University of Cambridge

UNAM, MX 2015-10-09

Millions of data points are hidden in the bioscience literature.ContentMine has Open technology to liberate them automatically.

Using OpenNotebook approachesThe major problem is politico-legal

This is an exploratory talk, looking for ideas and projectsThe future depends on young people

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Panton Authors and Fellows

Some particularly relevant Fellows/Alumni and projects:• Rufus Pollock: Open Knowledge Foundation• Mark Surman: Mozilla• Dan Whaley: Hypothes.is• Daniel Lombrana-Gonzales: PyBossa/Crowdcrafting

Erin McKiernan, 2015 Flash Award

ContentMine and Peter Murray-Rust are funded by:

The Right to Read is the Right to Mine

http://contentmine.org

http://contentmine.org/

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Typical scientific paper

Why do we publish science?

• Communicate our results• Archival• Get feedback from peers.• Provide material that others can re-use.• Priority and esteem.

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

[Liberian Ministry of Health] were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health

centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to

hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing



Re-use

You cannot assume how others will want to re-use your work.

PM-R’s “first real paper”, doing science by re-using the results of others in a novel way

1974:Each point is a separate paper!Needing 1-4 hours in library – discovery,hardcopy delivery, Transcription, hand calculation.

1976-9:PMR and WDSM developed software And protocols to search and analyze Cambridge Crystallographic DB

We need machines to read the literature

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201,507 [1] per month1.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

https://en.wikipedia.org/wiki/Mont_Blanc

http://www.crossref.org/01company/crossref_indicators.html

Scientific and Medical publication (STM)[+]

• World Citizens pay $450,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens of the

world …• 85% of medical research is wasted (not published, badly conceived,

duplicated, …) [Lancet 2009]

[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

ContentMine approaches0. Open software, Open content, Open notebooks

1. Daily liberation of facts which are easy and widely useful.

– Species (Bacillus subtilis, Okapia johnstoni)– Genes (BRCA1*, APOE)– Chemicals (acetone, CH3OH)– Identifiers (RRIDs, museum specimens, )

2. CMunities of practice with bespoke tools:– Clinical Trials– Phylogenetic trees– Systematic reviews

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

http://dx.doi.org/10.1021/ol2015972

After AMI2 processing…..

… AMI2 has detected a square

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupDAILY CONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

http://opentrials.net/

ContentMine will work with OpenTrials

“adult nonpregnant patients, aged ≥18 years”,

“randomization sequence using a permuted block design with random block sizes stratified by study center”.

“blinding of the patients and caregivers is not possible”.

“Investigators performing analysis are blinded for the intervention”.

“Continuous normally distributed variables … mean and standard deviation, counts (n) and percentages (%). … Student’s t-test … or the Mann–Whitney U test … Categorical … Chi-square test or Fisher's exact tests. Statistical significance is considered to be at a P value <0.05 …”

Formulaic language in reporting clinical trials

Text-based plugins

• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)

• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).

https://en.wikipedia.org/wiki/Bag-of-words_model

https://en.wikipedia.org/wiki/Bag-of-words_model

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

“Bag of Words”

Three fulltext articles from trialsjournal.com

Regular Expressions for Systematic Reviews of Animal Tests

Preceding TextFollowing Text

Extracted term

In 30 minutes 6 scientists (most were unfamiliar with regex) wrote 200 regexes for ARRIVE (NC3R guidelines)

TEMPLATES

https://en.wikipedia.org/wiki/Consolidated_Standards_of_Reporting_Trials

Some communities have standard Reporting, which helps extraction




Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Aves

Apterygidae

Marsupialia

Monotremata

Mammalia

Reptilia

Amphibia

Arthropoda

Myriapodia

Okapia johnstoni

Pyrus

Stuffed Tree of Life

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/




PMR’s Tribute

Planned Memorial Meeting July 14th 2014 Cambridge

OPEN NOTEBOOK SCIENCE

http://blogs.ch.cam.ac.uk/pmr/2014/05/19/jean-claude-bradley-hero-of-open-notebook-science-it-must-become-the-central-way-of-doing-science/

Traditional Research and Publication

“Lab” work paper/thesis

Write

rewrite

Re-experiment

publish

???

Validation??

DATA

output “belongs” to publisher

TOOLS

Open Notebook ScienceOpen engineeredrepository

Worldcommunity

INSTRUMENT

validate

merge

MODELCODE

DATA

DATAknowledge

calibrate

Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC

Machines and humansWorking together

Open Notebook Content Mining

• “No insider knowledge”• Anyone can become involved• All raw non-copyright material on Github• Planning and discussion on Open Discourse• All output (however imperfect) on Github CC0• Immediate upload

• Inspired by Free/Libre/Open Source, Wikipedia, Open StreetMap.

4300 images

“Root”

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Automatic Open Notebook of computations

Everything is posted to Github before being analyzed

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Supertree for 924 species

Tree

Supertree created from 4300 papers

Minor branch

Part of major branch

Ideas for Neuroscience

Can we extract digital information from published electroneurophysiology traces?...

…and build super-information?

Raw trace (pixels)

Thinned trace (pixels)

Line segments (SVG)

Reconstructed trace (SVG)

Extraction into data format (CSV, Excel)

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

PeterMurray-Rust

BMC publisher

Blue Obelisk paper (20 co-authors)

Sub-network From CATalog

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

The problem

©

Prof. Ian Hargreaves (2011): "David Cameron's exam question”: "Could it be true that laws designed more than three centuries ago with the express purpose of creating economic incentives for innovation by protecting creators' rights are today obstructing innovation and economic growth?” “yes. We have found that the UK's intellectual property framework, especially with regard to copyright, is falling behind what is needed.” "Digital

Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg

Elsevier wants to control Open Data

[asked by Michelle Brook]

http://www.epip2015.org/copyright-wars-frozen-conflict/

UPDATE 20150902: Ian Hargreaves "the voices of the digital many should not be drowned out by the digital self-interested few"

http://www.epip2015.org/copyright-wars-frozen-conflict/

contentmine.org team