l5-1-r and databasesb-tierney.com › ... › uploads › 2018 › 10 ›...

10
03/10/2018 1 Working with Data L5 - 1 –R and Databases R R Open source statistical computing and graphics language Started in 1993 as an alternative to SAS, SPSS and other proprietary statistical packages Originally called S, renamed to R in 1996 R is a client and server bundled together as one executable It is a single user tool It is not multi-threaded Constrained to a single CPU Millions of R users worldwide Thousands of libraries available at http://cran.r-project.org Free Milestones: 2018-10-03: 13122 packages 2017-06-10: 10793 packages 2017-01-09: 9870 packages 2016-06-01: 8492 packages 2015-03-13: 6400 packages 2015-02-15: 6325 packages 2014-10-29: 6000 packages 2013-11-08: 5000 packages 2012-08-23: 4000 packages 2011-05-12: 3000 packages 2009-10-04: 2000 packages 2007-04-12: 1000 packages 2004-10-01: 500 packages 2003-04-01: 250 packages

Upload: others

Post on 31-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

03/10/2018

1

Working with Data

L5 - 1 –R and Databases

R• R Open source statistical computing and graphics

language• Started in 1993 as an alternative to SAS, SPSS and

other proprietary statistical packages• Originally called S, renamed to R in 1996

• R is a client and server bundled together as one executable

• It is a single user tool• It is not multi-threaded• Constrained to a single CPU

• Millions of R users worldwide• Thousands of libraries available at

• http://cran.r-project.org

• Free

Milestones:2018-10-03: 13122 packages2017-06-10: 10793 packages2017-01-09: 9870 packages2016-06-01: 8492 packages2015-03-13: 6400 packages2015-02-15: 6325 packages2014-10-29: 6000 packages2013-11-08: 5000 packages2012-08-23: 4000 packages2011-05-12: 3000 packages2009-10-04: 2000 packages2007-04-12: 1000 packages2004-10-01: 500 packages2003-04-01: 250 packages

03/10/2018

2

> library(RJDBC)> # Create connection driver and open

> connectionjdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="c:/ojdbc6.jar")> jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//localhost:1521/orcl", "dmuser", "dmuser")> #list the tables in the schema

> #dbListTables(jdbcConnection)> #get the DB connections details - it get LOTS of info - Do not run unless it is really needed

> dbGetInfo(jdbcConnection)> # Query on the Oracle instance name.

> #instanceName <- dbGetQuery(jdbcConnection, "SELECT instance_name FROM v$instance")TABLE_NAME1

1 INSUR_CUST_LTV_SAMPLE2

2 OUTPUT_1_2> #print(instanceName)tableNames <- dbGetQuery(jdbcConnection, "SELECT table_name from user_tables where

table_name not like 'DM$%' and table_name not like 'ODMR$%'")> print(tableNames)

> viewNames <- dbGetQuery(jdbcConnection, "SELECT view_name from user_views")print(viewNames)1 MINING_DATA_APPLY_V

2 MINING_DATA_BUILD_V3 MINING_DATA_TEST_V4 MINING_DATA_TEXT_APPLY_V

5 MINING_DATA_TEXT_BUILD_V6 MINING_DATA_TEXT_TEST_V

> dbDisconnect(jdbcConnection)

Using RJDBC

03/10/2018

3

Different ODBC drivers

• RODBC

• RJDBC

• ROracle

• RMySQL

• blueR

The RJDBC package is based on the database interface (DBI) established in the R community. The DBI package contains virtual classes; it is the responsibility of the underlying driver to implement the classes. RJDBC uses a combination of a JDBC compliant database driver and Java Runtime Environment (JRE) to exchange data between R and the database server.

Using RODBC for Oracle is like using an ODBC connection for any database; so long as your platform provides an ODBC manager and drivers, you are OK. On Linux, this means unixODBC, and on Windows, this means the Oracle Data Access Components package.

Sometimes it can be difficult to configure. Hence JDBC

If possible always use the driver for your Database.

It will be specifically optimised for your Database.

Step 1 - Download the ODBC driver for your DB

• First thing is download the ODBC/JDBC driver for your Database

• Make sure it is the correct version

• Save it somewhere on your search Path

http://www.oracle.com/technetwork/apps-tech/jdbc-112010-090769.html

Also make sure you have JRE installed and on search Path

http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html You need to have JRE installed

You may have this already for SAS OnDemand

03/10/2018

4

Step 2 – Create your Connection

########################################################################## Create the connection to the Oracle schema##########################################################################library(RJDBC)

# Create connection driver and open connection to the database jdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="C:/Users/oracle/Downloads/ojdbc6.jar")

# dbConn <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//<hostname_or_ip>:<port_number>/<service_name>", ”<username>", ”<password>")

# Here is an exampledbConn <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//redwood.ict.ad.dit.ie:1521/pdb12c.ict.ad.dit.ie", "demo_student", ”D1234567890")

Use the connections details you used for SQL Developer

Next, make sure you know how to connect to your source database. You’ll need the following information for your database listener:

· Hostname or IP, e.g., database.company.com

· Port, e.g., 1521

· Service name or SID,

· Username

· Password

Step 3 – Explore the meta-data

• Don’t use meta-data functions in these packages = very very very very slow

– Plus they bring back way more information than you would expect

– Lots and lots of useless info from the DB

• Use dbGetQuery to query the data dictionary of the database.

– Runs the query in the DB and returns the results as a dataframe.

– Try for

• user_tables

• user_tab_columns

• user_indexes

• user_ind_columns

myTables <- dbGetQuery(dbConn, "select table_name from user_tables")myTables

dbListFields(dbConn, "CARS") [1] "MPG" "CYL" "DISP" "HP" "DRAT" "WT" "QSEC" "VS" "AM" "GEAR" "CARB”

dbExistsTable(dbConn, "CARS", "ORE_USER")[1] TRUE

03/10/2018

5

Step 4 – Query the data

• Return data as an R dataframeempData <- dbGetQuery(dbConn, "select * from emp")empData

empData2 <- dbGetQuery(dbConn, "select * from emp where salary > 2000")

• Read the entire tabletableData <- dbReadTable(dbConn, "WHITE_WINE" )tableData

• Prepared StatementsdbSendUpdate(dbConn, "UPDATE test1 set salary=? where id=?", teachersalary, teacherid)

dbSendUpdate(dbConn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)

Step 5 – Updating data

• Updating the data in a table

• dbSendUpdate

dbSendUpdate(dbConn, "UPDATE CUSTOMERS_USA SET cust_gender = 'X' WHERE cust_last_name = 'Everett'")

customers <- dbReadTable(dbConn, "CUSTOMERS_USA")head(customers)

# you should now see the updated rows

03/10/2018

6

Step 6 – Inserting data

• Updating data into a table

cars <- mtcarsdbWriteTable(dbConn, "CARS", cars)

carsData <- dbReadTable(dbConn, "CARS")dim(carsData)[1] 32 11

dbWriteTable(dbConn, "CARS", cars, append=TRUE, overwrite=FALSE)carsData <- dbReadTable(dbConn, "CARS")dim(carsData)

[1] 64 11

Step 7 – Deleting data

• Deleting records in a table

dbSendUpdate(dbConn, "UPDATE FROM CUSTOMERS_USA WHERE cust_last_name = 'Everett'")

03/10/2018

7

Step 8 – Creating a table

• Creating a new table based on an R dataframe

dbWriteTable(dbConn, "CARS", cars)

dbWriteTable(dbConn, "CARS", cars, overwrite=FALSE)

• Removing tables

dbRemoveTable(dbConn, "CARS")

Step 8 – Creating a table

• Creating a new table based on an R dataframe

# Write the Results table to the DatabaseSys.time()

if (dbExistsTable(dbConn, ”MY_TABLE_NAME", ”DEMO_STUDENT")){warning("WARNING: Delete existing table")dbRemoveTable(dbConn, ”MY_TABLE_NAME")warning("re-Creating table : Starting")dbWriteTable(dbConn, ”MY_TABLE_NAME", df_table_results)dbCommit(dbConn)warning("re-Creating table : Finished")

} else {warning("INFO: Creating New table : Starting")dbWriteTable(dbConn, ”MY_TABLE_NAME", df_table_results)dbCommit(dbConn)warning("INFO: Creating New table : Finished")

}

Sys.time()

03/10/2018

8

Step 9 – Saving changes

• Saving changes to the database (persisting changes)

dbCommit(dbConn)

• Rollback statements or transactions

dbRollback(dbConn)

if dbGetInfo(rs, what="rowsAffected") > 200) {warning("Something has gone wrong")dbRollback(dbConn)

}

Step 10 – Disconnecting etc

• What to do when finished your R session

dbDisconnect(dbConn)

# free resources occupied by result setdbClearResult(res)dbUnloadDriver(drv)

03/10/2018

9

Tip 1 – When you get an error downloading the data

• Sometimes the number of records in a table can cause the creation of a dataframe to fail.

• Instead, process the data set in chunks• Then merge into one dataframe.

res<-dbSendQuery(dbConn, "select * from sales”)

result<-list() i=1 result[[i]]<-dbFetch(res,n=2000) while(nrow(chunk <- dbFetch(res, n=2000))>0){

i<-i+1result[[i]]<-chunk

}

dataExtracted<-do.call(rbind,result)

> library(ROracle)> drv <- dbDriver("Oracle")> # Create the connection string> host <- "localhost"> port <- 1521> sid <- "orcl">connect.string <- paste("(DESCRIPTION=”, "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",> "(CONNECT_DATA=(SID=", sid, ")))", sep = "")

> con <- dbConnect(drv, username = "dmuser", password = "dmuser",dbname=connect.string)

> rs <- dbSendQuery(con, "select view_name from user_views")> # fetch records from the resultSet into a data.frame> data <- fetch(rs)> # extract all rows> dim(data)[1] 6 1> data

VIEW_NAME1 MINING_DATA_APPLY_V2 MINING_DATA_BUILD_V3 MINING_DATA_TEST_V4 MINING_DATA_TEXT_APPLY_V5 MINING_DATA_TEXT_BUILD_V6 MINING_DATA_TEXT_TEST_V> dbCommit(con)> dbClearResult(rs)> dbDisconnect(con)

Using ROracle

Needs Oracle Client in the search path

Pulls the data to the Client

Has a set of R functions tuned for

the Oracle DB

03/10/2018

10

The Challenges • Scalability

§ Regardless of the number of cores on your CPU, R will only use 1 on a default build

• Performance§ R reads data into memory by default. Easy to exhaust RAM by storing

unnecessary data. Typically R will throw an exception at 2GB.§ Parallelization can be challenge. Is not Default. Packages available

• Production Deployment§ Difficulties deploying R in production§ Typically need to re-code in …..