data handling in r

40
Data Handling in R Data Handling in R Getting,Reading and Cleaning data Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)

Upload: abhik-seal

Post on 28-Nov-2014

316 views

Category:

Technology


6 download

DESCRIPTION

The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.

TRANSCRIPT

Page 1: Data handling in r

Data Handling in RData Handling in RGetting,Reading and Cleaning data

Abhik SealIndiana University School of Informatics and Computing(dsdht.wikispaces.com)

Page 2: Data handling in r

Get/set your working directoryGet/set your working directoryA basic component of working with data is knowing your working directory

The two main commands are getwd() and setwd().

Be aware of relative versus absolute paths

Important difference in Windows setwd("C:\\Users\\datasc\\Downloads")

·

·

·

Relative - setwd("./data"), setwd("../")

Absolute - setwd("/Users/datasc/data/")

-

-

·

2/40

Page 3: Data handling in r

Checking for and creating directoriesChecking for and creating directoriesfile.exists("directoryName") will check to see if the directory exists

dir.create("directoryName") will create a directory if it doesn't exist

Here is an example checking for a "data" directory and creating it if it doesn't exist

·

·

·

if(!file.exists("data")){ dir.create("data")}

3/40

Page 4: Data handling in r

Reading data filesReading data files

We wil look at each of the methods

From Internet

Reading local files

Reading Excel Files

Reading XML

Reading JSON

Reading MySQL

Reading HDF5

Reading from other resources

·

·

·

·

·

·

·

·

4/40

Page 5: Data handling in r

Getting data from InternetGetting data from Internet

Data from Healthit.gov

Use of download.file()

Useful for downloading tab-delimited, csv, and other files

·

·

fileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv"download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl")list.files("./data")

## [1] "NAMCS.csv"

5/40

Page 6: Data handling in r

Getting data from InternetGetting data from InternetReading the data using read.csv()

data<-read.csv("http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv")head(data,2)

## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices## 1 Alabama 2013 0.48## 2 Alaska 2013 0.50## Adoption.of.Basic.EHRs..Primary.Care.Providers## 1 0.50## 2 0.52## Adoption.of.Basic.EHRs..Rural.Providers## 1 0.54## 2 0.37## Adoption.of.Basic.EHRs..Small.Practices## 1 0.40## 2 0.39## Percent.of.office.based.physicians.with.computerized.capability.to.view.lab.results## 1 0.74## 2 0.75 6/40

Page 7: Data handling in r

Some notes about download.file()Some notes about download.file()If the url starts with http you can use download.file()

If the url starts with https on Windows you may be ok

If the url starts with https on Mac you may need to set method="curl"

If the file is big, this might take a while

Be sure to record when you downloaded.

·

·

·

·

·

7/40

Page 8: Data handling in r

Loading flat files - read.table()Loading flat files - read.table()This is the main function for reading data into R

Flexible and robust but requires more parameters

Reads the data into RAM - big data can cause problems

Important parameters file, header, sep, row.names, nrows

Related: read.csv(), read.csv2()

Both read.table() and read.fwf() use scan to read the file, and then process the results of scan.They are very convenient, but sometimes it is better to use scan directly

·

·

·

·

·

·

8/40

Page 9: Data handling in r

Example dataExample datafileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv"download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl")list.files("./data")

## [1] "NAMCS.csv"

Data <- read.table("./data/NAMCS.csv")

## Error: line 2 did not have 87 elements

head(Data,2)

## Error: object 'Data' not found

9/40

Page 10: Data handling in r

Example parametersExample parametersread.csv sets sep="," and header=TRUE

same as

cameraData <- read.table("./data/NAMCS.csv",sep=",",header=TRUE)

cameraData <- read.csv("./data/NAMCS.csv")head(cameraData)

## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices## 1 Alabama 2013 0.48## 2 Alaska 2013 0.50## 3 Arizona 2013 0.51## 4 Arkansas 2013 0.46## 5 California 2013 0.54## 6 Colorado 2013 0.39## Adoption.of.Basic.EHRs..Primary.Care.Providers## 1 0.50## 2 0.52## 3 0.63

10/40

Page 11: Data handling in r

Some more important parametersSome more important parameters

People face trouble with reading flat files those have quotation marks ` or " placed in data values,setting quote="" often resolves these.

quote - you can tell R whether there are any quoted values quote="" means no quotes.

na.strings - set the character that represents a missing value.

nrows - how many rows to read of the file (e.g. nrows=10 reads 10 lines).

skip - number of lines to skip before starting to read

·

·

·

·

11/40

Page 12: Data handling in r

read.xlsx(), read.xlsx2() {xlsx package}read.xlsx(), read.xlsx2() {xlsx package}

Reading specific rows and columnsReading specific rows and columns

library(xlsx)Data <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,header=TRUE)

colIndex <- 2:3rowIndex <- 1:4dataSub <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1, colIndex=colIndex,rowIndex=rowIndex)

12/40

Page 13: Data handling in r

Further notesFurther notesThe write.xlsx function will write out an Excel file with similar arguments.

read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable.

The XLConnect is a Java-based solution, so it is cross platform and returns satisfactory results.For large data sets it may be very slow.

xlsReadWrite is very fast: it doesn't support .xlsx files

gdata package provides a good cross platform solutions. It is available for Windows, Mac orLinux. gdata requires you to install additional Perl libraries. Perl is usually already installed inLinux and Mac, but sometimes require more effort in Windows platforms.

In general it is advised to store your data in either a database or in comma separated files (.csv)or tab separated files (.tab/.txt) as they are easier to distribute.

I found on the web a self made function to easily import xlsx files. It should work in all platformsand use XML

·

·

·

·

·

·

·

source("https://gist.github.com/schaunwheeler/5825002/raw/3526a15b032c06392740e20b6c9a179add2cee49/xlsxToR = function("myfile.xlsx", header = TRUE)

13/40

Page 14: Data handling in r

Working with XMLWorking with XML

http://en.wikipedia.org/wiki/XML

Extensible markup language

Frequently used to store structured data

Particularly widely used in internet applications

Extracting XML is the basis for most web scraping

Components

·

·

·

·

·

Markup - labels that give the text structure

Content - the actual text of the document

-

-

14/40

Page 15: Data handling in r

Read the file into RRead the file into Rlibrary(XML)fileUrl <- "http://www.w3schools.com/xml/simple.xml"doc <- xmlTreeParse(fileUrl,useInternal=TRUE)rootNode <- xmlRoot(doc)xmlName(rootNode)

## [1] "breakfast_menu"

names(rootNode)

## food food food food food ## "food" "food" "food" "food" "food"

15/40

Page 16: Data handling in r

Directly access parts of the XML documentDirectly access parts of the XML documentrootNode[[1]]

## <food>## <name>Belgian Waffles</name>## <price>$5.95</price>## <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>## <calories>650</calories>## </food>

rootNode[[1]][[1]]

## <name>Belgian Waffles</name>

Go for a tour of XML package

Official XML tutorials short, long

An outstanding guide to the XML package

·

·

·

16/40

Page 17: Data handling in r

JSONJSON

http://en.wikipedia.org/wiki/JSON

Javascript Object Notation

Lightweight data storage

Common format for data from application programming interfaces (APIs)

Similar structure to XML but different syntax/format

Data stored as

·

·

·

·

·

Numbers (double)

Strings (double quoted)

Boolean (true or false)

Array (ordered, comma separated enclosed in square brackets [])

Object (unorderd, comma separated collection of key:value pairs in curley brackets {})

-

-

-

-

-

17/40

Page 18: Data handling in r

Example JSON fileExample JSON file

18/40

Page 19: Data handling in r

Reading data from JSON {jsonlite package}Reading data from JSON {jsonlite package}library(jsonlite)# Using chembl apijsonData <- fromJSON("https://www.ebi.ac.uk/chemblws/compounds/CHEMBL1.json")names(jsonData)

## [1] "compound"

jsonData$compound$chemblId

## [1] "CHEMBL1"

jsonData$compound$stdInChiKey

## [1] "GHBOEFUAGSHXPO-XZOTUCIWSA-N"

19/40

Page 20: Data handling in r

Writing data frames to JSONWriting data frames to JSONmyjson <- toJSON(iris, pretty=TRUE)cat(myjson)

## [## {## "Sepal.Length" : 5.1,## "Sepal.Width" : 3.5,## "Petal.Length" : 1.4,## "Petal.Width" : 0.2,## "Species" : "setosa"## },## {## "Sepal.Length" : 4.9,## "Sepal.Width" : 3,## "Petal.Length" : 1.4,## "Petal.Width" : 0.2,## "Species" : "setosa"## },## {## "Sepal.Length" : 4.7, 20/40

Page 21: Data handling in r

Further resourcesFurther resourceshttp://www.json.org/

A good tutorial on jsonlite - http://www.r-bloggers.com/new-package-jsonlite-a-smarter-json-encoderdecoder/

jsonlite vignette

·

·

·

21/40

Page 22: Data handling in r

mySQLmySQL

http://en.wikipedia.org/wiki/MySQL http://www.mysql.com/

Free and widely used open source database software

Widely used in internet based applications

Data are structured in

Each row is called a record

·

·

·

Databases

Tables within databases

Fields within tables

-

-

-

·

22/40

Page 23: Data handling in r

Step 2 - Install RMySQL ConnectorStep 2 - Install RMySQL ConnectorOn a Mac: install.packages("RMySQL")

On Windows:

·

·

Official instructions - http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL (may be useful forMac/UNIX users as well)

Potentially useful guide - http://www.ahschulz.de/2013/07/23/installing-rmysql-under-windows/

-

-

23/40

Page 24: Data handling in r

UCSC MySQLUCSC MySQL

http://genome.ucsc.edu/goldenPath/help/mysql.html

24/40

Page 25: Data handling in r

Connecting and listing databasesConnecting and listing databaseslibrary(DBI)library(RMySQL)

ucscDb <- dbConnect(MySQL(),user="genome", host="genome-mysql.cse.ucsc.edu")result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb);

## [1] TRUE

head(result)

## Database## 1 information_schema## 2 ailMel1## 3 allMis1## 4 anoCar1## 5 anoCar2## 6 anoGam1

25/40

Page 26: Data handling in r

Connecting to hg19 and listing tablesConnecting to hg19 and listing tableslibrary(RMySQL)hg19 <- dbConnect(MySQL(),user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu")allTables <- dbListTables(hg19)length(allTables)

## [1] 11006

allTables[1:5]

## [1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass"## [5] "acemblyPep"

26/40

Page 27: Data handling in r

Get dimensions of a specific tableGet dimensions of a specific tabledbListFields(hg19,"affyU133Plus2")

## [1] "bin" "matches" "misMatches" "repMatches" "nCount" ## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand" ## [11] "qName" "qSize" "qStart" "qEnd" "tName" ## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes" ## [21] "qStarts" "tStarts"

dbGetQuery(hg19, "select count(*) from affyU133Plus2")

## count(*)## 1 58463

27/40

Page 28: Data handling in r

Read from the tableRead from the tableaffyData <- dbReadTable(hg19, "affyU133Plus2")head(affyData)

## bin matches misMatches repMatches nCount qNumInsert qBaseInsert## 1 585 530 4 0 23 3 41## 2 585 3355 17 0 109 9 67## 3 585 4156 14 0 83 16 18## 4 585 4667 9 0 68 21 42## 5 585 5180 14 0 167 10 38## 6 585 468 5 0 14 0 0## tNumInsert tBaseInsert strand qName qSize qStart qEnd tName## 1 3 898 - 225995_x_at 637 5 603 chr1## 2 9 11621 - 225035_x_at 3635 0 3548 chr1## 3 2 93 - 226340_x_at 4318 3 4274 chr1## 4 3 5743 - 1557034_s_at 4834 48 4834 chr1## 5 1 29 - 231811_at 5399 0 5399 chr1## 6 0 0 - 236841_at 487 0 487 chr1## tSize tStart tEnd blockCount## 1 249250621 14361 15816 5## 2 249250621 14381 29483 17 28/40

Page 29: Data handling in r

Select a specific subsetSelect a specific subsetquery <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")affyMis <- fetch(query); quantile(affyMis$misMatches)

## 0% 25% 50% 75% 100% ## 1 1 2 2 3

affyMisSmall <- fetch(query,n=10); dbClearResult(query);

## [1] TRUE

dim(affyMisSmall)

## [1] 10 22

# close connectiondbDisconnect(hg19)

29/40

Page 30: Data handling in r

Further resourcesFurther resourcesRMySQL vignette http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf

R data import and export

Set up R odbc with postgres

A nice blog post summarizing some other commands http://www.r-bloggers.com/mysql-and-r/

·

·

·

·

30/40

Page 31: Data handling in r

HDF5HDF5

http://www.hdfgroup.org/

Used for storing large data sets

Supports storing a range of data types

Heirarchical data format

groups containing zero or more data sets and metadata

datasets multidimensional array of data elements with metadata

·

·

·

·

Have a group header with group name and list of attributes

Have a group symbol table with a list of objects in group

-

-

·

Have a header with name, datatype, dataspace, and storage layout

Have a data array with the data

-

-

31/40

Page 32: Data handling in r

R HDF5 packageR HDF5 packageThe rhdf5 package works really well, although it is not in CRAN. To install it:

source("http://bioconductor.org/biocLite.R")

## Bioconductor version 2.13 (BiocInstaller 1.12), ?biocLite for help## A newer version of Bioconductor is available after installing a new## version of R, ?BiocUpgrade for help

biocLite("rhdf5")

## BioC_mirror: http://bioconductor.org## Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.0.3.## Installing package(s) 'rhdf5'

## ## The downloaded binary packages are in## /var/folders/pm/jg6blwt55b71g8jl64wfw8ch0000gn/T//RtmpuYnNzs/downloaded_packages

32/40

Page 33: Data handling in r

Creating an HDF5 file and group hierarchyCreating an HDF5 file and group hierarchylibrary(rhdf5)h5createFile("myhdf5.h5")

## [1] TRUE

h5createGroup("myhdf5.h5","foo")

## [1] TRUE

h5createGroup("myhdf5.h5","baa")

## [1] TRUE

h5createGroup("myhdf5.h5","foo/foobaa")

## [1] TRUE33/40

Page 34: Data handling in r

hdf5 continuedhdf5 continued

Saving multiple objects to an HDF5 file

h5ls("myhdf5.h5")

## group name otype dclass dim## 0 / baa H5I_GROUP ## 1 / foo H5I_GROUP ## 2 /foo foobaa H5I_GROUP

A = 1:7; B = 1:18; D = seq(0,1,by=0.1)h5save(A, B, D, file="newfile2.h5")h5dump("newfile2.h5")

## $A## [1] 1 2 3 4 5 6 7## ## $B## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18## 34/40

Page 35: Data handling in r

Reading from other resourcesReading from other resourcesforeign package

Loads data from Minitab, S, SAS, SPSS, Stata,Systat

Basic functions read.foo

See the help page for more details http://cran.r-project.org/web/packages/foreign/foreign.pdf

·

·

read.arff (Weka)

readline() read from console

read.dta (Stata)

read.clipboard()

read.mtp (Minitab)

read.octave (Octave)

read.spss (SPSS)

read.xport (SAS)

-

-

-

-

-

-

-

-

·

35/40

Page 36: Data handling in r

Reading imagesReading imagesjpeg - http://cran.r-project.org/web/packages/jpeg/index.html

readbitmap - http://cran.r-project.org/web/packages/readbitmap/index.html

png - http://cran.r-project.org/web/packages/png/index.html

EBImage (Bioconductor) - http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html

·

·

·

·

36/40

Page 37: Data handling in r

Reading GIS dataReading GIS datargdal - http://cran.r-project.org/web/packages/rgdal/index.html

rgeos - http://cran.r-project.org/web/packages/rgeos/index.html

raster - http://cran.r-project.org/web/packages/raster/index.html

·

·

·

37/40

Page 38: Data handling in r

Reading music dataReading music datatuneR - http://cran.r-project.org/web/packages/tuneR/

seewave - http://rug.mnhn.fr/seewave/

·

·

38/40

Page 39: Data handling in r

AcknowledgemntAcknowledgemntJeff Leek University of Washington and Coursera Getting and Cleaning data

R For Natural Resources Course

R Data import comprehensive guide

·

·

·

39/40

Page 40: Data handling in r

40/40