data analysis and programming in r

49
Data Analysis and Programming in R Eswar Sai Santosh Bandaru Eswar Sai Santosh Bandaru

Upload: eshwar-sai

Post on 18-Jan-2017

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis and Programming in R

Data Analysis and Programming in R

Eswar Sai Santosh Bandaru

Eswar Sai Santosh Bandaru

Page 2: Data Analysis and Programming in R

R

• What is R?• Programming language meant for statistical analysis, data mining• https://en.wikipedia.org/wiki/R_(programming_language)

• Why R?• Effective data manipulation, Storage and graphical display• Free of cost, open source• Many packages contributed by experienced programmers/ statisticians

• https://cran.r-project.org/web/packages/available_packages_by_name.html• Simple and elegant code, easy to learn• Microsoft is integrating R in SQL server

• Problems:• Memory management : data sits on RAM• Speed

• Many developments are happening to address these problems.

Eswar Sai Santosh Bandaru

Page 3: Data Analysis and Programming in R

Eswar Sai Santosh Bandaru

Page 4: Data Analysis and Programming in R

R studio Interface: Console

Console:Run your code

here

Eswar Sai Santosh Bandaru

Page 5: Data Analysis and Programming in R

R studio Interface: Editor Save and edit your code here

Eswar Sai Santosh Bandaru

Page 6: Data Analysis and Programming in R

R studio Interface: Output

Output – plots and help

Eswar Sai Santosh Bandaru

Page 7: Data Analysis and Programming in R

General Things:

• Case sensitive

• Shortcuts:• CTRL+ENTER (Important): Send code from editor to console and execute

• CTRL+2: Move the console from editor to console

• CTRL+1: MOVE the cursor from console to editor

• CTRL+UP IN CONSOLE: Retrieve previous commands

• # hash is used for commenting the code• CTRL+SHIFT+C: comment/uncomment a block of code

Eswar Sai Santosh Bandaru

Page 8: Data Analysis and Programming in R

R as a calculator

• + : Addition -- 2+3 output:5

• - : Subtraction -- 4-5 output: -1

• * : Multiplication - 2*3 output:8

• ^ or ** : Exponentiation -- 2^3 or 2**3

• / : Division - 17/3 -- 5.66667

• %% : Modulo Division - 17%3-- 2

• %/% : Integer Division -17%/%3 -- 5

Eswar Sai Santosh Bandaru

Page 9: Data Analysis and Programming in R

Assignments and Expression

• “<-” is the assignment operator in R• a<-3, 3 gets assigned to variable a• Expressions

• Combination of numbers/variables/operators• E.g., 2+3*a/14

• Order of Evaluation:• ORDER OF EVALUATION: BRACKETS -> EXPONENTIATION-> DIVISION ->

MULTILICATION -> ADDITION/SUBTRACTION• E.g., 7*9/13 - 10.1111• -2^0.5 -- -1.414• (-2) ^0.5 - NaN

• Q1

Eswar Sai Santosh Bandaru

Page 10: Data Analysis and Programming in R

Data Types

• Numeric: Real Numbers. E.g., 1.24, -3.12, 1

• Integer: Integer values. Suffix L is added

• Character: E.g., ‘a’ , “a”, “Hello World!”, “2”

• Logical: Boolean Type. TRUE (1), FALSE(0), T, F

• Complex: a+bi . a,b are real numbers

• Class(): function is used to check the class• E.g., class(24) -- numeric

• E.g., class(24L)-- integer

Eswar Sai Santosh Bandaru

Page 11: Data Analysis and Programming in R

Data structures

• 4 main types:• Vector

• Matrices

• Lists

• Data frames

• We would discuss vectors and data frames in today’s session

Eswar Sai Santosh Bandaru

Page 12: Data Analysis and Programming in R

Vectors:

• One dimension collection of objects of same kind (same data type)

• Vectors in R are similar to arrays in any other programming language

• Syntax: (1,2,3,4,5) . 1,2,3,4,5 are called elements

• (1,2,3,4,5) : numeric vector

• (‘a’,’b’,’c’,’d’): character vector

• (T, F, T, T): logical vector

• (1L,2L,3L): integer vector

• (1,2,3,4,6) ----- valid vector

• (1,’a’,3,’t’) ------ invalid vector (but R doesn’t throw an error due to coercion

Eswar Sai Santosh Bandaru

Page 13: Data Analysis and Programming in R

Creating

• Basic ways:• Using c()

• Using “:”

• Using seq()

• Using rep()

• Using vector()

Eswar Sai Santosh Bandaru

Page 14: Data Analysis and Programming in R

C() combine function

• Syntax: • X<- C(1,2,4,78,90) creates a Numeric vector X with elements 1,2,4,78,90

• Y<- c(‘a’,’b’,’c’,’d’) creates a character vector Y with elements ‘a’, ‘b’, ‘c’,’d’

• Printing:• X # Auto printing

• Print(x) # explicit printing

Eswar Sai Santosh Bandaru

Page 15: Data Analysis and Programming in R

Using “:”

• x <- 20:50 • Creates a numeric vector x with values starting from 20 till 50 with increments

of 1

• Ending value > Starting Value - default increment +1

• y <- 50:20 • Creates a numeric vector x with values starting from 50 till 20 with increments

of -1

• Ending value < Starting Value .- default increment -1

Eswar Sai Santosh Bandaru

Page 16: Data Analysis and Programming in R

Seq()

• X <- seq(2,50) • Creates a numeric vector starting from 2 till 50 with increment of +1

• X <- seq(50,2)• Creates a numeric vector starting from 50 till 2 with increment of -1

• X <- seq(2,50,2)• Creates a numeric vector starting from 2 till 50 with increment of +2

• Increment can also be –ve if starting element > ending element

• ( 2, 4,6,8,10…….,50)

• X<- seq(‘a’,’b’,2) Throws an error

Eswar Sai Santosh Bandaru

Page 17: Data Analysis and Programming in R

Rep()

• X <- rep(c(1,2,3),times =2)• Creates vector numeric vector X: 1,2,3,1,2,3• The vector gets repeated twice

• rep(1:3, each =2)• Output: 1,1,2,2,3,3• Each element in the vector gets repeated twice

• rep(1:3,each=2,times =3)• Output: 1,1,2,2,3,3, 1,1,2,2,3,3, 1,1,2,2,3,3,• 2 steps

• 1:Each element gets repeated twice• 2: the entire vector itself gets repeated thrice

• Different variations of rep-- ?rep

Eswar Sai Santosh Bandaru

Page 18: Data Analysis and Programming in R

Combining vectors

• X <-c(1,2,3,4,5)

• Y<-c(1,6,7,8)

• Z<-c(X,Y)

• Combines vectors X,Y and assigns to Z, output: 1,2,3,4,5,1,6,7,8

• Q1 – Q8

Eswar Sai Santosh Bandaru

Page 19: Data Analysis and Programming in R

vector()

• X<-vector() …empty vector with default data type:logical

• X<-vector (…)

Eswar Sai Santosh Bandaru

Page 20: Data Analysis and Programming in R

Subsetting vectors

X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6

X[1]: ‘a’• Unlike python, java…indexing starts from 1 in R

Eswar Sai Santosh Bandaru

Page 21: Data Analysis and Programming in R

Subsetting vectors

X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6

X[5]: ‘e’

Eswar Sai Santosh Bandaru

Page 22: Data Analysis and Programming in R

Subsetting vectors

X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6

X[-1]: ‘b’ ‘c’ ‘d’ ‘e’ ‘f’

Expect first element

Eswar Sai Santosh Bandaru

Page 23: Data Analysis and Programming in R

Subsetting vectors

X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6

X[1:3]: ‘a’ ‘b’ ‘c’Not same as x[3:1]

Prints first three

elements

Eswar Sai Santosh Bandaru

Page 24: Data Analysis and Programming in R

Subsetting vectors

X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6

X[-1:-2]: ‘c’ ‘d’ ‘e’ ‘f’or

X[-2:-1]: ‘c’ ‘d’ ‘e’ ‘f’Eswar Sai Santosh Bandaru

Page 25: Data Analysis and Programming in R

Example

• X[1:(length(X)-1)]• Prints every element except for the last element

Eswar Sai Santosh Bandaru

Page 26: Data Analysis and Programming in R

Element wise operations

• (45,20, 25,3,4)

+

• (2, 6, 10, 1, 3)

||

(47, 26, 35, 4, 7)

• (45,20, 25,3,4)

+

• (2, 6, 10, 1, 3)

||

(47, 26, 35, 4, 7)

• (45,20, 25,3,4)

+

• (2, 6, 10, 1, 3)

||

(47, 26, 35, 4, 7)

Eswar Sai Santosh Bandaru

Page 27: Data Analysis and Programming in R

Example:

• x1 <- c(1,2,3), x2 <- c(6,7,8). what is x1+2*x2

• (1,2,3)

• 2*(6,7,8) -- (12, 14, 16) ….recycling!

• (1,2,3) + (12,14,16) - (13,16,19)

Eswar Sai Santosh Bandaru

Page 28: Data Analysis and Programming in R

Recycling

• 1:5 + 1• Internally 1,2,3,4,5 + 1,1,1,1,1 (1 gets recycled 5 times to match the length of

longer vector, then element wise operation occurs)

• 1:6 + c(1,2)• Internally 1,2,3,4,5,6 + 1,2,1,2,1,2 (c(1,2) gets recycled to meet the length of

longer vector)

• C(1,2,3,4,5,6,7) + c(1,2,3,4) ( a warning !!)• 1,2,3,4,5,6,7 + 1,2,3,4,1,2,3

Eswar Sai Santosh Bandaru

Page 29: Data Analysis and Programming in R

Q12: Create vector q using element wise operations

Eswar Sai Santosh Bandaru

Page 30: Data Analysis and Programming in R

Subsetting a vector with logical vector

• Y <- c('a','b','c','d')

• Y[c(T,T,F,T)]

• ‘a’ ‘b’ ‘d’(selects the element if true else does not select)

• Recycling• Y[c(T)]

• Vector T gets recycled till it matches the length of Y

• Every element gets printed

Eswar Sai Santosh Bandaru

Page 31: Data Analysis and Programming in R

Comparison operators

• X<- c(1,2,3,4,5,6,7)

• X>4 (x greater than 4)• Outputs a logical vector having True for values greater than 4 and false for

values less than or equal to false

• Output: logical vector : F,F,F,F,T,T,T

• X[X>4]• Selects elements from X which are greater than 4

• Output: 5,6,7

Eswar Sai Santosh Bandaru

Page 32: Data Analysis and Programming in R

Conditional operators in R

• conditional statements in R

• x == y : checks for equality, outputs TRUE if equal else FALSE

• x !=y : checks for inequality

• x >=y: greater than or equal

• x <=y

• x<y

• x>y

• You can combine both of them using & , or operators

• Q13-Q16

Eswar Sai Santosh Bandaru

Page 33: Data Analysis and Programming in R

Coercion

• x <- c(1,2,'a',3) -- Does not throw an error• Other elements in the vector gets coerced to character• Output: ‘1’,’2’,’a’,’3’

• priority for coercion; character> numeric> logical• Logical converts to 1,0

• explicit coercion:• as.* function s• as.character (1:20) # customerID

• X<-c(‘a’,’b’,’c’,’d’)• as.numeric(x)--- R produced NA’s• Output: NA, NA, NA, NA

Eswar Sai Santosh Bandaru

Page 34: Data Analysis and Programming in R

Some important functions

• Which() : produces the indices of vector the condition is satisfied• X <- c(10,2,4,5,0)

• Which(x>2)

• Output: 1, 3, 4

• all() : produces a logical vector if a condition is satisfied by all values in a vector• all(x>2): False

• any(): produces a logical vector if a condition is satisfied in any values in a vector• Any(x>2) :TRUE

Eswar Sai Santosh Bandaru

Page 35: Data Analysis and Programming in R

attributes

• Attributes: Give additional information about elements of a vector • E.g., names of elements, dimensions, levels

• attributes(x) : shows all the available attributes of x

• If there are no attributes, r outputs NULL

• We can assign attributes to a created vector

• E.g., we can assign names to elements with function name()• names(x) <- student_names

• Where student names is character vector containing names of students

Eswar Sai Santosh Bandaru

Page 36: Data Analysis and Programming in R

Subsetting using names attribute

• X[‘Cory’] -- prints marks of Cory• Internally…using which() , R gets the index whose attribute name is “Cory”

• Then subsets based on the index

• X[c(‘Cory’,’James’)] - prints marks of Cory and James

• Q16

Eswar Sai Santosh Bandaru

Page 37: Data Analysis and Programming in R

Updating a vector: What if Cory’s marks get updated• X[1] <- 35

• Element at index 1 gets updated to 35

• X[x<30 &&x>25] <-40• All the values which are less than 30 updated to 40

• X[“Cory”] <- 67

Eswar Sai Santosh Bandaru

Page 38: Data Analysis and Programming in R

is.na() and mean imputation

• x<- c(1,2,4,NA,5,NA)• is.na(x): produces a logical vector, TRUE if element is NA else FALSE

• Output: F F F T F T

• Replace NA with the mean values????

Eswar Sai Santosh Bandaru

Page 39: Data Analysis and Programming in R

Factors attribute

• Converts a continuous vector in to a categorical data

• X<-c(1,1,1,2,2,2,3,3,3)

• Sum(x) : 18

• X<-factors(X)

• Sum(x) : error

• Levels(x): categories in x• Output: “1” “2” “3”

• Class(X)• Output: factor

Eswar Sai Santosh Bandaru

Page 40: Data Analysis and Programming in R

Table function: frequency table

• Counts the number of times an element occurs in vector

• X<-c(‘a’,’a’,’a’,’b’,’b’,’c’,’c’)

• table(x):• a-3

• b-2

• c-2

• Useful while plotting barplot

Eswar Sai Santosh Bandaru

Page 41: Data Analysis and Programming in R

ls() and rm()

• ls() : Lists all the objects in the current R session(environment)

• rm(“d”) : removes the object d

• rm( list = ls()): removes all objects from the environment

Eswar Sai Santosh Bandaru

Page 42: Data Analysis and Programming in R

Data frames:

• Data frames are simply “tables” (rows and columns)

• Each column should be of same data type (hence all the vector operations are valid for each column)

• Creation• X<- data.frame(data for column1, data for column 2,…….)

• Column gets binded

• 2 dimensional

Eswar Sai Santosh Bandaru

Page 43: Data Analysis and Programming in R

Subsetting data frames…why?

• Very useful for analyzing the data

• As it 2 dimensional, it has 2 indices : row * columns

• test[3,2] : refers to element in 3rd row 2nd column

• test[1:3,1:2]: first three rows, 2 columns

• Using column names

• test$student_name : refers to column: student_name• Its kind of vector!...so we can perform all vector operations

• test["student_name"] : refers to column student_name

• test["marks"]

Eswar Sai Santosh Bandaru

Page 44: Data Analysis and Programming in R

Students with higher than average marks??

• above_average<- (test$marks>mean(test$marks))

• test$student_names[above_average]

• Two steps:• above_average is a logical vector

• Test$student_names[above_average] selecting students where the vector is True

Eswar Sai Santosh Bandaru

Page 45: Data Analysis and Programming in R

Writing into csv

• Write.csv(test,”test.csv”)

• Gets saved to the default directory(folder) R is pointing to

• To know the default directory:• Use getwd()

Eswar Sai Santosh Bandaru

Page 46: Data Analysis and Programming in R

Reading a csv file

• setwd(“directory path”)

• read.csv(“file name”)

• Different function to read different files

• dir() : lists all files in the current directory

Eswar Sai Santosh Bandaru

Page 47: Data Analysis and Programming in R

Data inspection

• str()

• head()

• tail()

Eswar Sai Santosh Bandaru

Page 48: Data Analysis and Programming in R

Dates and Times in R

• Dates are stored internally as the number of days since 1970-01-01 while times are stored internally as the number of seconds since 1970-01-01

Eswar Sai Santosh Bandaru

Page 49: Data Analysis and Programming in R

Data Visualization in R: Using R base graphics

• 3 types:• base graphics

• ggplot2

• lattice

• Boxplots

• Barplots

• Histograms

• Scatter plots

Eswar Sai Santosh Bandaru