contentmining in neuroscience

75
Open Mining of the Bioscience Literature Peter Murray-Rust, ContentMine.org and the University of Cambridge UNAM, MX 2015-10-09 Millions of data points are hidden in the bioscience literature. ContentMine has Open technology to liberate them automatically. Using OpenNotebook approaches The major problem is politico-legal This is an exploratory talk, looking for ideas and projects The future depends on young people

Upload: thecontentmine

Post on 14-Apr-2017

109 views

Category:

Science


0 download

TRANSCRIPT

Page 1: ContentMining in Neuroscience

Open Mining of the Bioscience Literature

Peter Murray-Rust, ContentMine.org and the University of Cambridge

UNAM, MX 2015-10-09

Millions of data points are hidden in the bioscience literature.ContentMine has Open technology to liberate them automatically.

Using OpenNotebook approachesThe major problem is politico-legal

This is an exploratory talk, looking for ideas and projectsThe future depends on young people

Page 2: ContentMining in Neuroscience

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 3: ContentMining in Neuroscience

Panton Authors and Fellows

Page 4: ContentMining in Neuroscience

Some particularly relevant Fellows/Alumni and projects:• Rufus Pollock: Open Knowledge Foundation• Mark Surman: Mozilla• Dan Whaley: Hypothes.is• Daniel Lombrana-Gonzales: PyBossa/Crowdcrafting

Erin McKiernan, 2015 Flash Award

ContentMine and Peter Murray-Rust are funded by:

Page 5: ContentMining in Neuroscience

The Right to Read is the Right to Mine

http://contentmine.org

Page 6: ContentMining in Neuroscience

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 7: ContentMining in Neuroscience

Typical scientific paper

Page 8: ContentMining in Neuroscience

Why do we publish science?

• Communicate our results• Archival• Get feedback from peers.• Provide material that others can re-use.• Priority and esteem.

Page 9: ContentMining in Neuroscience
Page 10: ContentMining in Neuroscience

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

[Liberian Ministry of Health] were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health

centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to

hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 11: ContentMining in Neuroscience

Re-use

You cannot assume how others will want to re-use your work.

Page 12: ContentMining in Neuroscience

PM-R’s “first real paper”, doing science by re-using the results of others in a novel way

Page 13: ContentMining in Neuroscience

1974:Each point is a separate paper!Needing 1-4 hours in library – discovery,hardcopy delivery, Transcription, hand calculation.

Page 14: ContentMining in Neuroscience

1976-9:PMR and WDSM developed software And protocols to search and analyze Cambridge Crystallographic DB

Page 15: ContentMining in Neuroscience

We need machines to read the literature

Page 16: ContentMining in Neuroscience

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201,507 [1] per month1.5 million (papers + supplemental data) /year [citation needed]*

each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

Page 17: ContentMining in Neuroscience

Scientific and Medical publication (STM)[+]

• World Citizens pay $450,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens of the

world …• 85% of medical research is wasted (not published, badly conceived,

duplicated, …) [Lancet 2009]

[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper

Page 18: ContentMining in Neuroscience

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 19: ContentMining in Neuroscience

ContentMine approaches0. Open software, Open content, Open notebooks

1. Daily liberation of facts which are easy and widely useful.

– Species (Bacillus subtilis, Okapia johnstoni)– Genes (BRCA1*, APOE)– Chemicals (acetone, CH3OH)– Identifiers (RRIDs, museum specimens, )

2. CMunities of practice with bespoke tools:– Clinical Trials– Phylogenetic trees– Systematic reviews

Page 20: ContentMining in Neuroscience

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 21: ContentMining in Neuroscience

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 22: ContentMining in Neuroscience

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 23: ContentMining in Neuroscience

After AMI2 processing…..

… AMI2 has detected a square

Page 24: ContentMining in Neuroscience
Page 25: ContentMining in Neuroscience

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupDAILY CONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

Page 26: ContentMining in Neuroscience
Page 27: ContentMining in Neuroscience

http://opentrials.net/

ContentMine will work with OpenTrials

Page 28: ContentMining in Neuroscience

“adult nonpregnant patients, aged ≥18 years”,

“randomization sequence using a permuted block design with random block sizes stratified by study center”.

“blinding of the patients and caregivers is not possible”.

“Investigators performing analysis are blinded for the intervention”.

“Continuous normally distributed variables … mean and standard deviation, counts (n) and percentages (%). … Student’s t-test … or the Mann–Whitney U test … Categorical … Chi-square test or Fisher's exact tests. Statistical significance is considered to be at a P value <0.05 …”

Formulaic language in reporting clinical trials

Page 29: ContentMining in Neuroscience

Text-based plugins

• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)

• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).

Page 30: ContentMining in Neuroscience

“Bag of Words”

Three fulltext articles from trialsjournal.com

Page 31: ContentMining in Neuroscience

Regular Expressions for Systematic Reviews of Animal Tests

Preceding TextFollowing Text

Extracted term

In 30 minutes 6 scientists (most were unfamiliar with regex) wrote 200 regexes for ARRIVE (NC3R guidelines)

Page 32: ContentMining in Neuroscience

TEMPLATES

Page 33: ContentMining in Neuroscience

https://en.wikipedia.org/wiki/Consolidated_Standards_of_Reporting_Trials

Some communities have standard Reporting, which helps extraction

Page 34: ContentMining in Neuroscience
Page 35: ContentMining in Neuroscience

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 36: ContentMining in Neuroscience
Page 37: ContentMining in Neuroscience

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 38: ContentMining in Neuroscience

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 39: ContentMining in Neuroscience

PLUTo

Page 40: ContentMining in Neuroscience

Aves

Apterygidae

Marsupialia

Monotremata

Mammalia

Reptilia

Amphibia

Arthropoda

Myriapodia

Okapia johnstoni

Pyrus

Stuffed Tree of Life

Page 42: ContentMining in Neuroscience

PMR’s Tribute

Planned Memorial Meeting July 14th 2014 Cambridge

OPEN NOTEBOOK SCIENCE

Page 43: ContentMining in Neuroscience

Traditional Research and Publication

“Lab” work paper/thesis

Write

rewrite

Re-experiment

publish

???

Validation??

DATA

output “belongs” to publisher

Page 44: ContentMining in Neuroscience

TOOLS

Open Notebook ScienceOpen engineeredrepository

Worldcommunity

INSTRUMENT

validate

merge

MODELCODE

DATA

DATAknowledge

calibrate

Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC

Machines and humansWorking together

Page 45: ContentMining in Neuroscience

Open Notebook Content Mining

• “No insider knowledge”• Anyone can become involved• All raw non-copyright material on Github• Planning and discussion on Open Discourse• All output (however imperfect) on Github CC0• Immediate upload

• Inspired by Free/Libre/Open Source, Wikipedia, Open StreetMap.

Page 46: ContentMining in Neuroscience

4300 images

Page 47: ContentMining in Neuroscience

“Root”

Page 48: ContentMining in Neuroscience

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 49: ContentMining in Neuroscience
Page 50: ContentMining in Neuroscience

Automatic Open Notebook of computations

Everything is posted to Github before being analyzed

Page 51: ContentMining in Neuroscience

Bacillus subtilis [131238]*Bacteroides fragilis [221817]Brevibacillus brevisCyclobacterium marinumEscherichia coli [25419]Filobacillus milosensisFlectobacillus major [15809775]Flexibacter flexilis [15809789]Formosa algaeGelidibacter algens [16982233]Halobacillus halophilusLentibacillus salicampi [18345921]Octadecabacter arcticusPsychroflexus torquis [16988834]Pseudomonas aeruginosa [31856]Sagittula stellata [16992371]Salegentibacter salegensSphingobacterium spiritivorumTerrabacter tumescens

• [Identifier in Wikidata] • Missing = not found with Wikidata API

20 commonest organisms (in > 30 papers) in trees from IJSEM*

Half do not appear to be in Wikidata

Can the Wikipedia Scientists comment?

*Int. J. Syst. Evol. Microbiol.

Page 52: ContentMining in Neuroscience

Supertree for 924 species

Tree

Page 53: ContentMining in Neuroscience

Supertree created from 4300 papers

Page 54: ContentMining in Neuroscience

Minor branch

Page 55: ContentMining in Neuroscience

Part of major branch

Page 56: ContentMining in Neuroscience

Part of major branch

Page 57: ContentMining in Neuroscience

Ideas for Neuroscience

Can we extract digital information from published electroneurophysiology traces?...

…and build super-information?

Page 58: ContentMining in Neuroscience
Page 59: ContentMining in Neuroscience
Page 60: ContentMining in Neuroscience
Page 61: ContentMining in Neuroscience

Raw trace (pixels)

Page 62: ContentMining in Neuroscience

Thinned trace (pixels)

Page 63: ContentMining in Neuroscience

Line segments (SVG)

Page 64: ContentMining in Neuroscience

Reconstructed trace (SVG)

Page 65: ContentMining in Neuroscience

Extraction into data format (CSV, Excel)

Page 66: ContentMining in Neuroscience

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

Page 67: ContentMining in Neuroscience

PeterMurray-Rust

BMC publisher

Blue Obelisk paper (20 co-authors)

Sub-network From CATalog

Page 68: ContentMining in Neuroscience

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 69: ContentMining in Neuroscience

The problem

©

Page 70: ContentMining in Neuroscience

Prof. Ian Hargreaves (2011): "David Cameron's exam question”: "Could it be true that laws designed more than three centuries ago with the express purpose of creating economic incentives for innovation by protecting creators' rights are today obstructing innovation and economic growth?” “yes. We have found that the UK's intellectual property framework, especially with regard to copyright, is falling behind what is needed.” "Digital

Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg

Page 71: ContentMining in Neuroscience

Elsevier wants to control Open Data

[asked by Michelle Brook]

Page 72: ContentMining in Neuroscience

http://www.epip2015.org/copyright-wars-frozen-conflict/

UPDATE 20150902: Ian Hargreaves "the voices of the digital many should not be drowned out by the digital self-interested few"

Page 73: ContentMining in Neuroscience
Page 74: ContentMining in Neuroscience

contentmine.org team

Page 75: ContentMining in Neuroscience