mining facts from the plant science iterature

Mining science from the plant literature

ContentMine

Open Plant Forum, Norwich, UK, 2016-07-27

Peter Murray-Rust[1]University of Cambridge [2]TheContentMine

10,000 scholarly publications every day.How many relate to plants?

(2x digital music industry!)

Non-profit

Downloading several thousand papers per day and making search results open for everyone

http://contentmine.org

Downloadable Open source

MozFest 2015

ContentMine + TGAC / hack

Terpinome Phytochemists!

Salvia officinalis

Salvia microphylla

Origanum vulgare Ocimum basilicum

Laurus nobilis [1]

[1] Lauraceae

We can search for

• Plants• Compounds• Other species• Diseases• Frequent terms

• We’ll need: sources, dictionaries, software

Europe PubMedCentral

Over 1 million biomedical papers

Dictionaries!

Diseases (WHO)

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

CONTENTMINE SOFTWARE

Crossref

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone




WIKIDATA

Carvone in WikidataAlso SPARQL endpointWP identifier

Chemical type

Chemical identifier

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Suggest the title of this article

species words

drug Phytochemdisease

species words

drug Phytochemdisease

disease

(2x digital music industry!)

Non-profit

Downloading several thousand papers per day and making search results open for everyone

http://contentmine.org

Downloadable Open source

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Annotation (entity in context)

prefixsurface

label

location

suffix

Search for carvone

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is

Annotation (entity in context)

prefixsurface

label

location

suffix

Facilitating synthetic biology literature mining and searching for the plant community

Robert Davey, joined TGAC in February 2010 as the lead software engineer on the MISO LIMS project, which was released as an open source framework in June 2012. He went on to become the Core Bioinformatics Project Leader, and was then appointed as Data Infrastructure and Algorithms (DIA) Group Leader in late 2012. Ksenia Krasileva, Group Leader with a joint appointment at The Genome Analysis Centre and The Sainsbury Laboratory. Ksenia joined Norwich Research park in December 2014 moving from University of California Davis where she held Fellowship from National Institute of Food and Agriculture (NIFA) to develop functional genomic tools for wheat working with Jorge Dubcovsky. Nicola Patron, molecular and synthetic biologist at The Sainsbury Laboratory (TSL), a world-leading research institute working on the science of plant-microbe interactions. Richard Smith-Unna, PhD student, Plant Sciences Cambridge. Peter Murray-Rust, a (retired but highly active) chemist in Cambridge University.

Report of 2-day workshop (hack) held at TGAC 2016-03-10/11

The workshop centered on novel methods for discovering information about plants from the existing literature (“Content Mining”). We prepared ContentMine software specifically for the workshop on the basis that “anyone can run it and get useful results “. Everyone was asked to install the software on whatever platform they commonly used (Mac, Windows, Unix). There were few problems and most people were running within an hour. A typical example was “find all you can about diseases of oats” using EuropePubMedCentral (with over 1 million Open Access papers). This retrieves about 500 papers, which were further filtered for chemicals, diseases, species, etc. and displayed within a minute or two, significantly increasing the speed of knowledge-driven scientific discovery. We also jointly made considerable improvements to the software and have agreed to meet regularly to take this forward.