automatic extraction of science and medicine from the scholarly literature

50
omatic Extraction of Science an icine from the scholarly litera Peter Murray-Rust contentmine.org CAMARADES group UK, Edinburgh, 2015-05-27 OPEN Platform for Machines+humans to automatically “read” the STM literature and extract facts Grow communities and give everyone the tools and know-how to mine

Upload: thecontentmine

Post on 20-Jan-2017

94 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Automatic Extraction of Science and Medicine from the scholarly literature

Automatic Extraction of Science and Medicine from the scholarly literature

Peter Murray-Rustcontentmine.org

CAMARADES group UK, Edinburgh, 2015-05-27

• OPEN Platform for Machines+humans to automatically “read” the STM literature and extract facts

• Grow communities and give everyone the tools and know-how to mine

Page 2: Automatic Extraction of Science and Medicine from the scholarly literature

Background• Contentmine aims to make large areas of scientific fact OPEN

(100 million facts/year)• We’re working with WellcomeTrust, Europe PubMedCentral,

etc.• A politically “hot” area (Hargreaves legislation, EU activity)• 2015 WellcomeTrust workshop on TDM and Neuroscience;

“rough consensus” on what was needed.• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr,

Ben Goldacre)• In the last few months we’ve prototyped a good starting

point… the software is alpha/beta.

Page 3: Automatic Extraction of Science and Medicine from the scholarly literature

Regular Expressions for Systematic Reviews of Animal Tests

Preceding TextFollowing Text

Extracted term

Today’s Results!! We searched papers for 200 regex-based Terms and got ca 100 hits per paper

Page 4: Automatic Extraction of Science and Medicine from the scholarly literature

Questions we can tackle• How to we find (mentions of) clinical/animal trials?• Is a document a trial?• What is the subject of the trial?• What is the methodology used?• Does the design and practice conform to

CONSORT/ARRIVE?• What are the outcomes?• Can we extract specific re-usable information?• Who are involved? (researchers, sponsors, patients?)• Has a proposed trial been completed and reported?

Page 5: Automatic Extraction of Science and Medicine from the scholarly literature

Linked Open Data – the world’s knowledge

very little physical science and THESES?? http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

Page 6: Automatic Extraction of Science and Medicine from the scholarly literature

Liberation Software

Page 7: Automatic Extraction of Science and Medicine from the scholarly literature

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 8: Automatic Extraction of Science and Medicine from the scholarly literature

“Free” and “Open”

• "Free software is a matter of liberty, not price. ’free speech', not 'free beer'”. (R M Stallman)

• “A piece of data or content is open if anyone is free to use, reuse, and redistribute it” (OKFN)http://opendefinition.org/

• “open” (access) has multiple incompatible “definitions”. Major split is “human eyeballs” vs copying and machine “reusability”

• “Open” is a marketing term for publishers, who frequently (often deliberately) do not grant full Openness.

“Gratis” vs “Libre”

Page 9: Automatic Extraction of Science and Medicine from the scholarly literature

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. …

…Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Page 10: Automatic Extraction of Science and Medicine from the scholarly literature

Scientific and Medical publication (STM)[+]

• World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens of the

world …• 85% of medical research is wasted (not published, badly conceived,

duplicated, …)

[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper

Page 11: Automatic Extraction of Science and Medicine from the scholarly literature

• “creative use of these large data sets in the US health care sector could generate more than $300bn in value per annum” [MGI, McKinsey]

• Gartner Inc. has identified 'Big Data' and 'Next-Generation Analytics' as two of the 'Top 10 Strategic Technologies' for 2012.

• Given the volume of text generated by business, academic and social activities – in for example competitor reports, research publications or customer opinions on social networking sites – text mining is, however, highly important. [JISC]

• there are some tasks that simply could not be achieved without using text mining. For example, a major pharmaceutical company used text mining tools to evaluate 50,000 patents in 18 months. This would have taken 50 person years to achieve manually, meaning that it would not even have been contemplated. [JISC]

“Big Data – and Analytics (ContentMining)

Page 12: Automatic Extraction of Science and Medicine from the scholarly literature

Prof. Ian Hargreaves (2011): "David Cameron's exam question”: "Could it be true that laws designed more than three centuries ago with the express purpose of creating economic incentives for innovation by protecting creators' rights are today obstructing innovation and economic growth?” “yes. We have found that the UK's intellectual property framework, especially with regard to copyright, is falling behind what is needed.” "Digital

Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg

Page 13: Automatic Extraction of Science and Medicine from the scholarly literature

PUBLISHER TDM LICENCE INITIATIVES GENERALLY DO NOT HELP

• Publishers have started offering their own TDM licences and policies• Their licences often impose unfair (and in the case of the UK, unenforceable)

constraints on researchers’ freedom to exploit TDM, e.g., requiring users to employ publisher’s API, putting unnecessary restrictions on how much can be copied, or how fast it can be copied.

• Why “unenforceable”? Because, as noted earlier, UK law specifically states that any contract or licence term that prevents anyone from doing TDM in the manner prescribed in the new exception shall be deemed null and void.

• Really need a test case on these attempted restrictions.• Springer and Royal Society offer generous TDM provisions. • So why are so many publishers offering restrictive licences in the UK? Maybe they

hope licensees are ignorant of the strength of the new law, or the publishers in fact don’t know about it. So they are either deliberately misleading, or ignorant

Prof Charles Oppenheim and contentmine.org

Page 14: Automatic Extraction of Science and Medicine from the scholarly literature

Elsevier wants to control Open Data

[asked by Michelle Brook]

Page 15: Automatic Extraction of Science and Medicine from the scholarly literature

Pirate Party, MEP

Page 16: Automatic Extraction of Science and Medicine from the scholarly literature

The Right to Read is the Right to Mine

http://contentmine.org

Page 17: Automatic Extraction of Science and Medicine from the scholarly literature

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

Page 18: Automatic Extraction of Science and Medicine from the scholarly literature

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 19: Automatic Extraction of Science and Medicine from the scholarly literature

PLoSONE BMC1

BMC2

Closed1 Closed2Hybrid

CATalog

Enhanced annotated articles

FACTSFACTS

Daily Crawl

Crawl … Scrape … Normalize … Mine

Linked OpenData

Semantic Scientific Objects

2000-5000 Articles

Page 20: Automatic Extraction of Science and Medicine from the scholarly literature

What is “Content”?

Page 21: Automatic Extraction of Science and Medicine from the scholarly literature

Machine-Human symbioses

• Wikipedia• Open StreetMap

• Google

We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities

Page 22: Automatic Extraction of Science and Medicine from the scholarly literature

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 23: Automatic Extraction of Science and Medicine from the scholarly literature

Facts Marked by “non-scientists” in ContentMine workshops

With Wikipedia everyone can be a scientist

Page 24: Automatic Extraction of Science and Medicine from the scholarly literature

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 25: Automatic Extraction of Science and Medicine from the scholarly literature

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral

Page 26: Automatic Extraction of Science and Medicine from the scholarly literature

• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 27: Automatic Extraction of Science and Medicine from the scholarly literature

quickscrapeCrawlFeed Norma Index &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 28: Automatic Extraction of Science and Medicine from the scholarly literature

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

Page 29: Automatic Extraction of Science and Medicine from the scholarly literature

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

Page 30: Automatic Extraction of Science and Medicine from the scholarly literature

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

PDF

HTML

XML quickscrape*

*Scrapers created by Richard Smith-Unna + Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

Page 31: Automatic Extraction of Science and Medicine from the scholarly literature

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

NORMA

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

Page 32: Automatic Extraction of Science and Medicine from the scholarly literature

AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

Page 33: Automatic Extraction of Science and Medicine from the scholarly literature

Text-based plugins

• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)

• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).

Page 34: Automatic Extraction of Science and Medicine from the scholarly literature

“Bag of Words”

Three fulltext articles from trialsjournal.com

Page 35: Automatic Extraction of Science and Medicine from the scholarly literature

Regular Expressions for Systematic Reviews of Animal Tests

Preceding TextFollowing Text

Extracted term

Page 36: Automatic Extraction of Science and Medicine from the scholarly literature

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Page 37: Automatic Extraction of Science and Medicine from the scholarly literature

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 38: Automatic Extraction of Science and Medicine from the scholarly literature

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 39: Automatic Extraction of Science and Medicine from the scholarly literature
Page 40: Automatic Extraction of Science and Medicine from the scholarly literature

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 41: Automatic Extraction of Science and Medicine from the scholarly literature
Page 42: Automatic Extraction of Science and Medicine from the scholarly literature

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 43: Automatic Extraction of Science and Medicine from the scholarly literature

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 44: Automatic Extraction of Science and Medicine from the scholarly literature

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 46: Automatic Extraction of Science and Medicine from the scholarly literature

PeterMurray-Rust

BMC publisher

Blue Obelisk paper (20 co-authors)

Sub-network From CATalog

Page 47: Automatic Extraction of Science and Medicine from the scholarly literature

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 48: Automatic Extraction of Science and Medicine from the scholarly literature

Problems

• Cannot do handwriting• Scanned documents give poorer results• The older the document the poorer the result• Tables are a major problem• Always try to get the original document • XML better than > Word better than > PDF• Vector images >> PNG > JPEG• Maths, chemistry are specialist

Page 49: Automatic Extraction of Science and Medicine from the scholarly literature

contentmine.org proposed Services

• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data• Creation of community-led groups

Page 50: Automatic Extraction of Science and Medicine from the scholarly literature

contentmine.org team