the contentmine scraping stack: how we'll harvest 100,00,000 facts from the scientific...
DESCRIPTION
The ContentMine project (http://contentmine.org) will harvest 100 million facts from the literature. Here we summarise the technology stack we're building to enable the first step: collecting the literature. This presentation was given with a paper (https://github.com/Blahah/scraperJSON-demo-paper) at WOSP 2014.TRANSCRIPT
![Page 1: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/1.jpg)
The ContentMine Scraping Stack
Richard Smith-Unna! ! Peter Murray-RustUniversity of Cambridge
![Page 2: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/2.jpg)
“make 100,000,000 facts from the scholarly literature open, accessible and reusable”
our mission
![Page 3: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/3.jpg)
• ~ 27,000 peer reviewed journals (Ulrich's)
• > 5,000 publishers
• new papers every day
The scale of the task
![Page 4: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/4.jpg)
The pipeline
![Page 5: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/5.jpg)
scraperJSON• scrapers all have the same plumbing
• ignore the plumbing, just configure
• benefits
• supports large collections of scrapers
• no programming required
• not limited to one piece of software
![Page 6: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/6.jpg)
Basic scraperJSON{
"name": "PLOS",
"url": "plos\\w*.org",
"elements": {
"title": {
"selector": “//h1[@property=‘dc:title’]”,
}
}
}
!
name of the scraper
the URL(s) it applies to
the elements to capture
element name
where to find it
!
!
http://github.com/ContentMine/scraperJSON
![Page 7: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/7.jpg)
![Page 8: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/8.jpg)
![Page 9: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/9.jpg)
![Page 10: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/10.jpg)
Basic scraperJSON{
"name": "PLoS",
"url": "plos\\w*.org",
"elements": {
"title": {
"selector": “//h1[@property=‘dc:title’]”,
}
}
}
!
name of the scraper
the URL(s) it applies to
the elements to capture
element name
where to find it
!
!
http://github.com/ContentMine/scraperJSON
![Page 11: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/11.jpg)
bibJSON output
{
"title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation"
}
![Page 12: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/12.jpg)
thresher & quickscrape• reference implementation of scraperJSON
• thresher is the scraping library
• http://github.com/ContentMine/thresher
• quickscrape is the command-line tool
• http://github.com/ContentMine/quickscrape
• Node.js, MIT licensed
![Page 13: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/13.jpg)
journal-scrapershttp://github.com/ContentMine/journal-scrapers
a self-testing collection of scraperJSON scrapers for academic journals
• PLOS • MDPI • PeerJ • Wiley • ScienceDirect • Springer • Taylor & Francis • NPG, AAAS, RSC, ACS, …
![Page 14: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/14.jpg)
Future work
• GUI (browser plugin) for creating scrapers
• Standalone GUI for scraping
![Page 15: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature](https://reader033.vdocuments.net/reader033/viewer/2022060119/558e23461a28ab62048b4586/html5/thumbnails/15.jpg)
Acknowledgements• Peter Murray-Rust
• Michelle Brook
• Mark MacGillivray
• Emanuil Tolev
• Ross Mounce
• Jenny Molloy
• Our volunteer community and collaborators
• Funding: Shuttleworth Foundation
http://contentmine.org http://github.com/ContentMine