web data from r
DESCRIPTION
TRANSCRIPT
![Page 1: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/1.jpg)
Web data acquisition with R
Scott ChamberlainOctober 28, 2011
![Page 2: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/2.jpg)
Why would you even need to do this?
Why not just get data through a browser?
![Page 3: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/3.jpg)
Some use cases
• Reason 1: It just takes too dam* long to manually search/get data on a web interface
• Reason 2: Workflow integration
• Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
![Page 4: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/4.jpg)
A few general methods of getting web data through R
![Page 5: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/5.jpg)
• Read file – ideal if available• HTML• XML• JSON• APIs that serve up XML/JSON
![Page 6: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/6.jpg)
Practice…read.csv (or xls, txt, etc.)
Get URL for file…see screenshoturl <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1”
mycsv <- read.csv(url)
mycsv
![Page 7: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/7.jpg)
‘Scraping’ web data
• Why? When there is no API– Can either scrape XML or HTML or JSON– XML and JSON are easier formats to deal with
from R
![Page 8: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/8.jpg)
Scraping E.g. 1: XMLhttp://www.fishbase.org/summary/speciessummary.php?id=2
![Page 9: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/9.jpg)
Scraping E.g. 1: XMLThe summary XML page behind the rendered page…
![Page 10: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/10.jpg)
Scraping E.g. 1: XMLWe can process the XML ourselves using a bunch of lines of code…
![Page 11: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/11.jpg)
Scraping E.g. 1: XML…OR just use a package someone already created - rfishbase
And you get this nice plot
![Page 12: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/12.jpg)
Practice…XML and JSON formatsdata from the USA National Phenology Network
install.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed alreadyrequire(RCurl); require(XML); require(RJSONIO)
XML Formatxmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/getObservationsForSpeciesIndividualAtLocation.xml?year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'xmlout <- getURLContent(xmlurl, curl = getCurlHandle())xmlTreeParse(xmlout)[[1]][[1]]
JSON Formatjsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/getObservationsForSpeciesIndividualAtLocation.json?year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'jsonout <- getURLContent(jsonurl, curl = getCurlHandle())fromJSON(jsonout)
![Page 13: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/13.jpg)
Scraping E.g. 2: HTMLAll this code can produce something like…
![Page 14: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/14.jpg)
Scraping E.g. 2: HTML…this
![Page 15: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/15.jpg)
Practice…scraping HTMLinstall.packages(c("XML","RCurl")) # if not already installedrequire(XML); require(RCurl)
# Lets look at the raw html firstrawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')rawhtml
# Scrape data from the websiterawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')rawPMIPMI <- data.frame(rawPMI[[1]])names(PMI)[1] <- 'Year'
![Page 16: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/16.jpg)
APIs (application programmatic interface)
• Many data sources have API’s – largely for talking to other web interfaces– we can use their API from R
• Consists of a set of methods to search, retrieve, or submit data to, a data source/repository
• One can write R code to interface with an API– Keep in mind some API’s require authentication
keys
![Page 17: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/17.jpg)
API Documentation• API docs for the Integrated Taxonomic
Information Service (ITIS): http://www.itis.gov/ws_description.html
http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
![Page 18: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/18.jpg)
Example: Simple call to API
![Page 19: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/19.jpg)
rOpenSci suite of R packages
• There are many packages on CRAN for specific data sources on the web – search on CRAN to find these
• rOpenSci is developing a lot of packages for as many open source data sources as possible– Please use and give feedback…
![Page 20: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/20.jpg)
Data Literature/metadata
http://ropensci.org/ , code at GitHub
![Page 21: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/21.jpg)
Three examples of packages that interact with an API
![Page 22: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/22.jpg)
API E.g. 1: Search literature: rplos You can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
![Page 23: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/23.jpg)
API E.g. 2: Get taxonomic information for your study species: taxize
A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
![Page 24: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/24.jpg)
API E.g. 3: Get some data: dryadA tutorial: http://ropensci.org/tutorials/dryad-tutorial/
![Page 25: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/25.jpg)
Calling external programs from R
![Page 26: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/26.jpg)
Why even think about doing this?
• Again, workflow integration
• It’s just easier to call X program from R if you have are going to run many analyses with said program
![Page 27: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/27.jpg)
Eg. 1: Phylometa…using the files in the dropbox Also, get Phylometa here: http://lajeunesse.myweb.usf.edu/publications.html
• On a Mac: doesn’t work on mac because it’s .exe– But system() often can work to run external programs
• On Windows: system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T)NOTE: intern = T, returns the output to the R console
Should give you something like this
![Page 28: Web data from R](https://reader036.vdocuments.net/reader036/viewer/2022062511/54bf767c4a7959e6348b458a/html5/thumbnails/28.jpg)
Resources
• rOpenSci (development of R packages for all open source data and literature)
• CRAN packages (search for a data source)• Tutorials/websites:
– http://www.programmingr.com/content/webscraping-using-readlines-and-rcurl
• Non-R based, but cool: http://ecologicaldata.org/