architecture of contentmine components contentmine.org

11
RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access

Upload: petermurrayrust

Post on 02-Aug-2015

158 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Architecture of ContentMine Components contentmine.org

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Page 2: Architecture of ContentMine Components contentmine.org

quickscrapeCrawlFeed

Norma Index &Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 3: Architecture of ContentMine Components contentmine.org
Page 4: Architecture of ContentMine Components contentmine.org

Starting points

• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CMDir(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG) good

• PDF,XML,HTML -> Norma -> CMDir(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR -> CMDir(sHTML,TXT,SVG) variable

Page 5: Architecture of ContentMine Components contentmine.org

Conversions

• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG

fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.

slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate

Page 6: Architecture of ContentMine Components contentmine.org

Raw HTMLNot wellformedBad charactersemantics

ScholarlyHTML

Well-formed XHTML

PNG

TaggedSections

CaptionedFigures

Tables

CaptionedTables

XMLHtmlTidyJsoupHtmlUnit

XSLT1/2

XSLT1/2

NORMA

Per-journalStylesheets

Page 7: Architecture of ContentMine Components contentmine.org

End points

• Norma -> CMDir(OpenSHTML-SVG) • Norma -> CMDir(sHTML. sections) -> AMI ->

all text + species, chemistry, sequences)• Norma -> CMDir(TXT (unsectioned)) ->

AMI -> bagOfWords, regex, • Norma -> CMDir(PNG) -> AMI -> phylo, bar/xy-

plots, • Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-

plots, chemistry

Page 8: Architecture of ContentMine Components contentmine.org

PDFNon-UnicodePixel glyphsNo wordsNo structures

ScholarlyHTML

SVG

High-levelgraphics

PDF2SVG

characters

SentencesParastables

PNG OCR

TaggedSections

SVGBuilder

CaptionedFigures

NORMA

XSLT1/2

Page 9: Architecture of ContentMine Components contentmine.org

NORMALIZE

NormaConvert PDF,XMLTo sHTMLTag sections

Normalized Scientific Literature

AMIIndexTransformExtractSearch

PDF2SVGXSL stylesheetsTaggers

normalizationParameters

“Permanent” Filestore

Temporary Filestore

Extracted factsindexes

PluginsRegex

Page 10: Architecture of ContentMine Components contentmine.org

quickscrape Norma Index &Transform

PDF

XML

URL

DOI

DOC

CSV

sHTML

Plugins

SequencesSpecies

Bespoke

Scrapers

XPath

Taggers

Per- Journal

Chemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

CAT-alogue index

getpapersquery

Titles+ links

DailyCrawl/feed

EuPMC

JToCs

Page 11: Architecture of ContentMine Components contentmine.org

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerSectionerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts