architecture of contentmine components contentmine.org

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

quickscrapeCrawlFeed

Norma Index &Transform

Scientificliterature

Repositories DOC

Plugins

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

BadHTML

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Starting points

• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CMDir(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG) good

• PDF,XML,HTML -> Norma -> CMDir(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR -> CMDir(sHTML,TXT,SVG) variable

Conversions

• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG

fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.

slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate

Raw HTMLNot wellformedBad charactersemantics

ScholarlyHTML

Well-formed XHTML

TaggedSections

CaptionedFigures

Tables

CaptionedTables

XMLHtmlTidyJsoupHtmlUnit

XSLT1/2

Per-journalStylesheets

End points

• Norma -> CMDir(OpenSHTML-SVG) • Norma -> CMDir(sHTML. sections) -> AMI ->

all text + species, chemistry, sequences)• Norma -> CMDir(TXT (unsectioned)) ->

AMI -> bagOfWords, regex, • Norma -> CMDir(PNG) -> AMI -> phylo, bar/xy-

plots, • Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-

plots, chemistry

PDFNon-UnicodePixel glyphsNo wordsNo structures

ScholarlyHTML

High-levelgraphics

PDF2SVG

characters

SentencesParastables

PNG OCR

TaggedSections

SVGBuilder

CaptionedFigures

XSLT1/2

NORMALIZE

NormaConvert PDF,XMLTo sHTMLTag sections

Normalized Scientific Literature

AMIIndexTransformExtractSearch

PDF2SVGXSL stylesheetsTaggers

normalizationParameters

“Permanent” Filestore

Temporary Filestore

Extracted factsindexes

PluginsRegex

quickscrape Norma Index &Transform

Plugins

SequencesSpecies

Bespoke

Scrapers

Taggers

Per- Journal

Chemistry

Phylogenetics Farming

BadHTML

Diagrams

CAT-alogue index

getpapersquery

Titles+ links

DailyCrawl/feed

catalogue

getpapers

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

quickscrape

normaNormalizerSectionerSemanticTagger

DataFigures

UNIVRepos

search

LookupCONTENTMINING

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

architecture of contentmine components contentmine.org

Science

electric vehicle university - 103 ev architecture,...

national its architecture: components and subsystems

islamic financial architecture: key components and framework

contentmine + epmc: finding zika!

swing components & mvc architecture

contentmine at europepmc agm

java architecture and components · java architecture and...

components and architecture cs 543 – data warehousing

over view, architecture main components

introduction to android, architecture & components

contentmine architecture

hardware components and architecture

software architecture taxonomies - behaviour: components &...

contentmine (embl-ebi industry programme)

specification of the neurolog architecture components...

open data and sharing science - graham steel, contentmine

archjava a software architecture tool –components...

contentmine (tdm) at jisc digifest

contentmine and wikidata

plc architecture and hardware components