mining facts from the plant science iterature

34
Mining science from the plant literature ContentMine Open Plant Forum, Norwich, UK, 2016-07-27 Peter Murray-Rust [1]University of Cambridge [2]TheContentMine 10,000 scholarly publications every day. How many relate to plants?

Upload: petermurrayrust

Post on 19-Jan-2017

40 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Mining facts from the plant science iterature

Mining science from the plant literature

ContentMine

Open Plant Forum, Norwich, UK, 2016-07-27

Peter Murray-Rust[1]University of Cambridge [2]TheContentMine

10,000 scholarly publications every day.How many relate to plants?

Page 2: Mining facts from the plant science iterature

(2x digital music industry!)

Non-profit

Downloading several thousand papers per day and making search results open for everyone

http://contentmine.org

Downloadable Open source

Page 3: Mining facts from the plant science iterature

MozFest 2015

ContentMine + TGAC / hack

Page 4: Mining facts from the plant science iterature

Terpinome Phytochemists!

Salvia officinalis

Salvia microphylla

Origanum vulgare Ocimum basilicum

Laurus nobilis [1]

[1] Lauraceae

Page 5: Mining facts from the plant science iterature

We can search for

• Plants• Compounds• Other species• Diseases• Frequent terms

• We’ll need: sources, dictionaries, software

Page 6: Mining facts from the plant science iterature

Europe PubMedCentral

Over 1 million biomedical papers

Page 7: Mining facts from the plant science iterature
Page 8: Mining facts from the plant science iterature

Dictionaries!

Diseases (WHO)

Page 9: Mining facts from the plant science iterature
Page 10: Mining facts from the plant science iterature
Page 11: Mining facts from the plant science iterature

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

CONTENTMINE SOFTWARE

Crossref

Page 12: Mining facts from the plant science iterature

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone

https://en.wikipedia.org/wiki/Carvone

Page 13: Mining facts from the plant science iterature

https://en.wikipedia.org/wiki/Carvone

WIKIDATA

Page 14: Mining facts from the plant science iterature

Carvone in WikidataAlso SPARQL endpointWP identifier

Chemical type

Chemical identifier

Page 15: Mining facts from the plant science iterature

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Page 16: Mining facts from the plant science iterature

Suggest the title of this article

Page 17: Mining facts from the plant science iterature
Page 18: Mining facts from the plant science iterature
Page 19: Mining facts from the plant science iterature

species words

drug Phytochemdisease

Page 20: Mining facts from the plant science iterature

species words

drug Phytochemdisease

disease

Page 21: Mining facts from the plant science iterature
Page 22: Mining facts from the plant science iterature
Page 23: Mining facts from the plant science iterature
Page 24: Mining facts from the plant science iterature
Page 25: Mining facts from the plant science iterature

(2x digital music industry!)

Non-profit

Downloading several thousand papers per day and making search results open for everyone

http://contentmine.org

Downloadable Open source

Page 26: Mining facts from the plant science iterature

end

Page 27: Mining facts from the plant science iterature

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Page 28: Mining facts from the plant science iterature

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 29: Mining facts from the plant science iterature
Page 30: Mining facts from the plant science iterature

Search for carvone

Page 31: Mining facts from the plant science iterature

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is

Page 32: Mining facts from the plant science iterature

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 33: Mining facts from the plant science iterature

Facilitating synthetic biology literature mining and searching for the plant community

Robert Davey, joined TGAC in February 2010 as the lead software engineer on the MISO LIMS project, which was released as an open source framework in June 2012. He went on to become the Core Bioinformatics Project Leader, and was then appointed as Data Infrastructure and Algorithms (DIA) Group Leader in late 2012. Ksenia Krasileva, Group Leader with a joint appointment at The Genome Analysis Centre and The Sainsbury Laboratory. Ksenia joined Norwich Research park in December 2014 moving from University of California Davis where she held Fellowship from National Institute of Food and Agriculture (NIFA) to develop functional genomic tools for wheat working with Jorge Dubcovsky. Nicola Patron, molecular and synthetic biologist at The Sainsbury Laboratory (TSL), a world-leading research institute working on the science of plant-microbe interactions. Richard Smith-Unna, PhD student, Plant Sciences Cambridge. Peter Murray-Rust, a (retired but highly active) chemist in Cambridge University.

Report of 2-day workshop (hack) held at TGAC 2016-03-10/11

The workshop centered on novel methods for discovering information about plants from the existing literature (“Content Mining”). We prepared ContentMine software specifically for the workshop on the basis that “anyone can run it and get useful results “. Everyone was asked to install the software on whatever platform they commonly used (Mac, Windows, Unix). There were few problems and most people were running within an hour. A typical example was “find all you can about diseases of oats” using EuropePubMedCentral (with over 1 million Open Access papers). This retrieves about 500 papers, which were further filtered for chemicals, diseases, species, etc. and displayed within a minute or two, significantly increasing the speed of knowledge-driven scientific discovery. We also jointly made considerable improvements to the software and have agreed to meet regularly to take this forward.

Page 34: Mining facts from the plant science iterature