mining facts from the plant science iterature
TRANSCRIPT
Mining science from the plant literature
ContentMine
Open Plant Forum, Norwich, UK, 2016-07-27
Peter Murray-Rust[1]University of Cambridge [2]TheContentMine
10,000 scholarly publications every day.How many relate to plants?
(2x digital music industry!)
Non-profit
Downloading several thousand papers per day and making search results open for everyone
http://contentmine.org
Downloadable Open source
MozFest 2015
ContentMine + TGAC / hack
Terpinome Phytochemists!
Salvia officinalis
Salvia microphylla
Origanum vulgare Ocimum basilicum
Laurus nobilis [1]
[1] Lauraceae
We can search for
• Plants• Compounds• Other species• Diseases• Frequent terms
• We’ll need: sources, dictionaries, software
Europe PubMedCentral
Over 1 million biomedical papers
Dictionaries!
Diseases (WHO)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
100, 000 pages/day Semantic ScholarlyHTML(W3C community group)
Facts
Latest 20150908
CONTENTMINE SOFTWARE
Crossref
What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
WIKIDATA
Carvone in WikidataAlso SPARQL endpointWP identifier
Chemical type
Chemical identifier
ARTICLES FACETS
gene disease drug Phytochem
species genus words
Suggest the title of this article
species words
drug Phytochemdisease
species words
drug Phytochemdisease
disease
(2x digital music industry!)
Non-profit
Downloading several thousand papers per day and making search results open for everyone
http://contentmine.org
Downloadable Open source
end
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is
Annotation (entity in context)
prefixsurface
label
location
suffix
Search for carvone
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send IUCN redlist plant annotations -> hypothes.is
Annotation (entity in context)
prefixsurface
label
location
suffix
Facilitating synthetic biology literature mining and searching for the plant community
Robert Davey, joined TGAC in February 2010 as the lead software engineer on the MISO LIMS project, which was released as an open source framework in June 2012. He went on to become the Core Bioinformatics Project Leader, and was then appointed as Data Infrastructure and Algorithms (DIA) Group Leader in late 2012. Ksenia Krasileva, Group Leader with a joint appointment at The Genome Analysis Centre and The Sainsbury Laboratory. Ksenia joined Norwich Research park in December 2014 moving from University of California Davis where she held Fellowship from National Institute of Food and Agriculture (NIFA) to develop functional genomic tools for wheat working with Jorge Dubcovsky. Nicola Patron, molecular and synthetic biologist at The Sainsbury Laboratory (TSL), a world-leading research institute working on the science of plant-microbe interactions. Richard Smith-Unna, PhD student, Plant Sciences Cambridge. Peter Murray-Rust, a (retired but highly active) chemist in Cambridge University.
Report of 2-day workshop (hack) held at TGAC 2016-03-10/11
The workshop centered on novel methods for discovering information about plants from the existing literature (“Content Mining”). We prepared ContentMine software specifically for the workshop on the basis that “anyone can run it and get useful results “. Everyone was asked to install the software on whatever platform they commonly used (Mac, Windows, Unix). There were few problems and most people were running within an hour. A typical example was “find all you can about diseases of oats” using EuropePubMedCentral (with over 1 million Open Access papers). This retrieves about 500 papers, which were further filtered for chemicals, diseases, species, etc. and displayed within a minute or two, significantly increasing the speed of knowledge-driven scientific discovery. We also jointly made considerable improvements to the software and have agreed to meet regularly to take this forward.