introduction to data mining with r and data import/export in r

29
Introduction to Data Mining with R and Data Import/Export in R Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 25

Upload: yanchang-zhao

Post on 04-Jul-2015

914 views

Category:

Software


0 download

DESCRIPTION

RDataMining Slides Series: Introduction to Data Mining with R and Data Import/Export in R

TRANSCRIPT

Page 1: Introduction to Data Mining with R and Data Import/Export in R

Introduction to Data Mining with Rand Data Import/Export in R

Yanchang Zhao

http://www.RDataMining.com

30 September 2014

1 / 25

Page 2: Introduction to Data Mining with R and Data Import/Export in R

Questions

I Do you know data mining and its algorithms and techniques?

I Have you heard of R?

I Have you used R in your research or projects?

2 / 25

Page 3: Introduction to Data Mining with R and Data Import/Export in R

Questions

I Do you know data mining and its algorithms and techniques?

I Have you heard of R?

I Have you used R in your research or projects?

2 / 25

Page 4: Introduction to Data Mining with R and Data Import/Export in R

Questions

I Do you know data mining and its algorithms and techniques?

I Have you heard of R?

I Have you used R in your research or projects?

2 / 25

Page 5: Introduction to Data Mining with R and Data Import/Export in R

Outline

Introduction to R

R Packages and Functions for Data Mining

Data Import and Export

Online Resources

3 / 25

Page 6: Introduction to Data Mining with R and Data Import/Export in R

What is R?

I R 1 is a free software environment for statistical computingand graphics.

I R can be easily extended with 5,800+ packages available onCRAN2 (as of 13 Sept 2014).

I Many other packages provided on Bioconductor3, R-Forge4,GitHub5, etc.

I R manuals on CRAN6

I An Introduction to RI The R Language DefinitionI R Data Import/ExportI . . .

1http://www.r-project.org/2http://cran.r-project.org/3http://www.bioconductor.org/4http://r-forge.r-project.org/5https://github.com/6http://cran.r-project.org/manuals.html

4 / 25

Page 7: Introduction to Data Mining with R and Data Import/Export in R

Why R?

I R is widely used in both academia and industry.

I R was ranked no. 1 in the KDnuggets 2014 poll on TopLanguages for analytics, data mining, data science7 (actuallyR has been no. 1 in 2011, 2012 & 2013!).

I The CRAN Task Views 8 provide collections of packages fordifferent tasks.

I Machine learning & atatistical learningI Cluster analysis & finite mixture modelsI Time series analysisI Multivariate statisticsI Analysis of spatial dataI . . .

7http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html

8http://cran.r-project.org/web/views/

5 / 25

Page 8: Introduction to Data Mining with R and Data Import/Export in R

Outline

Introduction to R

R Packages and Functions for Data Mining

Data Import and Export

Online Resources

6 / 25

Page 9: Introduction to Data Mining with R and Data Import/Export in R

Classification with R

I Decision trees: rpart, party

I Random forest: randomForest, party

I SVM: e1071, kernlab

I Neural networks: nnet, neuralnet, RSNNS

I Performance evaluation: ROCR

7 / 25

Page 10: Introduction to Data Mining with R and Data Import/Export in R

Clustering with R

I k-means: kmeans(), kmeansruns()9

I k-medoids: pam(), pamk()

I Hierarchical clustering: hclust(), agnes(), diana()

I DBSCAN: fpc

I BIRCH: birch

9Functions are followed with “()”, and others are packages.8 / 25

Page 11: Introduction to Data Mining with R and Data Import/Export in R

Association Rule Mining with R

I Association rules: apriori(), eclat() in package arules

I Sequential patterns: arulesSequence

I Visualisation of associations: arulesViz

9 / 25

Page 12: Introduction to Data Mining with R and Data Import/Export in R

Text Mining with R

I Text mining: tm

I Topic modelling: topicmodels, lda

I Word cloud: wordcloud

I Twitter data access: twitteR

10 / 25

Page 13: Introduction to Data Mining with R and Data Import/Export in R

Time Series Analysis with R

I Time series decomposition: decomp(), decompose(), arima(),stl()

I Time series forecasting: forecast

I Time Series Clustering: TSclust

I Dynamic Time Warping (DTW): dtw

11 / 25

Page 14: Introduction to Data Mining with R and Data Import/Export in R

Social Network Analysis with R

I Packages: igraph, sna

I Centrality measures: degree(), betweenness(), closeness(),transitivity()

I Clusters: clusters(), no.clusters()

I Cliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()

I Community detection: fastgreedy.community(),spinglass.community()

12 / 25

Page 15: Introduction to Data Mining with R and Data Import/Export in R

R and Big Data

I HadoopI Hadoop (or YARN) - a framework that allows for the

distributed processing of large data sets across clusters ofcomputers using simple programming models

I R Packages: RHadoop, RHIPE

I SparkI Spark - a fast and general engine for large-scale data

processing, which can be 100 times faster than HadoopI SparkR - R frontend for Spark

I H2OI H2O - an open source in-memory prediction engine for big

data scienceI R Package: h2o

I MongoDBI MongoDB - an open-source document databaseI R packages: rmongodb, RMongo

13 / 25

Page 16: Introduction to Data Mining with R and Data Import/Export in R

R and Hadoop

I Packages: RHadoop, RHive

I RHadoop10 is a collection of R packages:

I rmr2 - perform data analysis with R via MapReduce on aHadoop cluster

I rhdfs - connect to Hadoop Distributed File System (HDFS)I rhbase - connect to the NoSQL HBase databaseI . . .

I You can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!

I Step-by-Step Guide to Setting Up an R-Hadoop Systemhttp://www.rdatamining.com/big-data/

r-hadoop-setup-guide

10https://github.com/RevolutionAnalytics/RHadoop/wiki14 / 25

Page 17: Introduction to Data Mining with R and Data Import/Export in R

Outline

Introduction to R

R Packages and Functions for Data Mining

Data Import and Export

Online Resources

15 / 25

Page 18: Introduction to Data Mining with R and Data Import/Export in R

Data Import and Export 11

Read data from and write data to

I R native formats (incl. Rdata and RDS)

I CSV files

I EXCEL files

I ODBC databases

I SAS databases

R Data Import/Export:I http://cran.r-project.org/doc/manuals/R-data.pdf

11Chapter 2: Data Import and Export, in book R and Data Mining: Examplesand Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf

16 / 25

Page 19: Introduction to Data Mining with R and Data Import/Export in R

Save and Load R Objects

I save(): save R objects into a .Rdata file

I load(): read R objects from a .Rdata file

I rm(): remove objects from R

a <- 1:10

save(a, file = "./data/dumData.Rdata")

rm(a)

a

## Error: object ’a’ not found

load("./data/dumData.Rdata")

a

## [1] 1 2 3 4 5 6 7 8 9 10

17 / 25

Page 20: Introduction to Data Mining with R and Data Import/Export in R

Save and Load R Objects - More Functions

I save.image():save current workspace to a fileIt saves everything!

I readRDS():read a single R object from a .rds file

I saveRDS():save a single R object to a file

I Advantage of readRDS() and saveRDS():You can restore the data under a different object name.

I Advantage of load() and save():You can save multiple R objects to one file.

18 / 25

Page 21: Introduction to Data Mining with R and Data Import/Export in R

Import from and Export to .CSV FilesI write.csv(): write an R object to a .CSV fileI read.csv(): read an R object from a .CSV file

# create a data frame

var1 <- 1:5

var2 <- (1:5)/10

var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")

df1 <- data.frame(var1, var2, var3)

names(df1) <- c("VarInt", "VarReal", "VarChar")

# save to a csv file

write.csv(df1, "./data/dummmyData.csv", row.names = FALSE)

# read from a csv file

df2 <- read.csv("./data/dummmyData.csv")

print(df2)

## VarInt VarReal VarChar

## 1 1 0.1 R

## 2 2 0.2 and

## 3 3 0.3 Data Mining

## 4 4 0.4 Examples

## 5 5 0.5 Case Studies19 / 25

Page 22: Introduction to Data Mining with R and Data Import/Export in R

Import from and Export to EXCEL Files

Package xlsx : read, write, format Excel 2007 and Excel97/2000/XP/2003 files

library(xlsx)

xlsx.file <- "./data/dummmyData.xlsx"

write.xlsx(df2, xlsx.file, sheetName = "sheet1", row.names = F)

df3 <- read.xlsx(xlsx.file, sheetName = "sheet1")

df3

## VarInt VarReal VarChar

## 1 1 0.1 R

## 2 2 0.2 and

## 3 3 0.3 Data Mining

## 4 4 0.4 Examples

## 5 5 0.5 Case Studies

20 / 25

Page 23: Introduction to Data Mining with R and Data Import/Export in R

Read from Databases

I Package RODBC : provides connection to ODBC databases.

I Function odbcConnect(): sets up a connection to database

I sqlQuery(): sends an SQL query to the database

I odbcClose() closes the connection.

library(RODBC)

db <- odbcConnect(dsn = "servername", uid = "userid",

pwd = "******")

sql <- "SELECT * FROM lib.table WHERE ..."

# or read query from file

sql <- readChar("myQuery.sql", nchars=99999)

myData <- sqlQuery(db, sql, errors=TRUE)

odbcClose(db)

Functions sqlFetch(), sqlSave() and sqlUpdate(): read, writeor update a table in an ODBC database

21 / 25

Page 24: Introduction to Data Mining with R and Data Import/Export in R

Read from Databases

I Package RODBC : provides connection to ODBC databases.

I Function odbcConnect(): sets up a connection to database

I sqlQuery(): sends an SQL query to the database

I odbcClose() closes the connection.

library(RODBC)

db <- odbcConnect(dsn = "servername", uid = "userid",

pwd = "******")

sql <- "SELECT * FROM lib.table WHERE ..."

# or read query from file

sql <- readChar("myQuery.sql", nchars=99999)

myData <- sqlQuery(db, sql, errors=TRUE)

odbcClose(db)

Functions sqlFetch(), sqlSave() and sqlUpdate(): read, writeor update a table in an ODBC database

21 / 25

Page 25: Introduction to Data Mining with R and Data Import/Export in R

Import Data from SAS

Package foreign provides function read.ssd() for importing SASdatasets (.sas7bdat files) into R.

library(foreign) # for importing SAS data

# the path of SAS on your computer

sashome <- "C:/Program Files/SAS/SASFoundation/9.2"

filepath <- "./data"

# filename should be no more than 8 characters, without extension

fileName <- "dumData"

# read data from a SAS dataset

a <- read.ssd(file.path(filepath), fileName,

sascmd=file.path(sashome, "sas.exe"))

Another way: using function read.xport() to read a file in SASTransport (XPORT) format

22 / 25

Page 26: Introduction to Data Mining with R and Data Import/Export in R

Import Data from SAS

Package foreign provides function read.ssd() for importing SASdatasets (.sas7bdat files) into R.

library(foreign) # for importing SAS data

# the path of SAS on your computer

sashome <- "C:/Program Files/SAS/SASFoundation/9.2"

filepath <- "./data"

# filename should be no more than 8 characters, without extension

fileName <- "dumData"

# read data from a SAS dataset

a <- read.ssd(file.path(filepath), fileName,

sascmd=file.path(sashome, "sas.exe"))

Another way: using function read.xport() to read a file in SASTransport (XPORT) format

22 / 25

Page 27: Introduction to Data Mining with R and Data Import/Export in R

Outline

Introduction to R

R Packages and Functions for Data Mining

Data Import and Export

Online Resources

23 / 25

Page 28: Introduction to Data Mining with R and Data Import/Export in R

Online Resources

I RDataMining websitehttp://www.rdatamining.com

I R Reference Card for Data MiningI R and Data Mining: Examples and Case Studies

I RDataMining Group on LinkedIn (7,000+ members)http://group.rdatamining.com

I RDataMining on Twitter (1,700+ followers)@RDataMining

I Free online courseshttp://www.rdatamining.com/resources/courses

I Online documentshttp://www.rdatamining.com/resources/onlinedocs

24 / 25

Page 29: Introduction to Data Mining with R and Data Import/Export in R

The End

Thanks!

Email: yanchang(at)rdatamining.com

25 / 25