data analysis and programming in r
TRANSCRIPT
Data Analysis and Programming in R
Eswar Sai Santosh Bandaru
Eswar Sai Santosh Bandaru
R
• What is R?• Programming language meant for statistical analysis, data mining• https://en.wikipedia.org/wiki/R_(programming_language)
• Why R?• Effective data manipulation, Storage and graphical display• Free of cost, open source• Many packages contributed by experienced programmers/ statisticians
• https://cran.r-project.org/web/packages/available_packages_by_name.html• Simple and elegant code, easy to learn• Microsoft is integrating R in SQL server
• Problems:• Memory management : data sits on RAM• Speed
• Many developments are happening to address these problems.
Eswar Sai Santosh Bandaru
Eswar Sai Santosh Bandaru
R studio Interface: Console
Console:Run your code
here
Eswar Sai Santosh Bandaru
R studio Interface: Editor Save and edit your code here
Eswar Sai Santosh Bandaru
R studio Interface: Output
Output – plots and help
Eswar Sai Santosh Bandaru
General Things:
• Case sensitive
• Shortcuts:• CTRL+ENTER (Important): Send code from editor to console and execute
• CTRL+2: Move the console from editor to console
• CTRL+1: MOVE the cursor from console to editor
• CTRL+UP IN CONSOLE: Retrieve previous commands
• # hash is used for commenting the code• CTRL+SHIFT+C: comment/uncomment a block of code
Eswar Sai Santosh Bandaru
R as a calculator
• + : Addition -- 2+3 output:5
• - : Subtraction -- 4-5 output: -1
• * : Multiplication - 2*3 output:8
• ^ or ** : Exponentiation -- 2^3 or 2**3
• / : Division - 17/3 -- 5.66667
• %% : Modulo Division - 17%3-- 2
• %/% : Integer Division -17%/%3 -- 5
Eswar Sai Santosh Bandaru
Assignments and Expression
• “<-” is the assignment operator in R• a<-3, 3 gets assigned to variable a• Expressions
• Combination of numbers/variables/operators• E.g., 2+3*a/14
• Order of Evaluation:• ORDER OF EVALUATION: BRACKETS -> EXPONENTIATION-> DIVISION ->
MULTILICATION -> ADDITION/SUBTRACTION• E.g., 7*9/13 - 10.1111• -2^0.5 -- -1.414• (-2) ^0.5 - NaN
• Q1
Eswar Sai Santosh Bandaru
Data Types
• Numeric: Real Numbers. E.g., 1.24, -3.12, 1
• Integer: Integer values. Suffix L is added
• Character: E.g., ‘a’ , “a”, “Hello World!”, “2”
• Logical: Boolean Type. TRUE (1), FALSE(0), T, F
• Complex: a+bi . a,b are real numbers
• Class(): function is used to check the class• E.g., class(24) -- numeric
• E.g., class(24L)-- integer
Eswar Sai Santosh Bandaru
Data structures
• 4 main types:• Vector
• Matrices
• Lists
• Data frames
• We would discuss vectors and data frames in today’s session
Eswar Sai Santosh Bandaru
Vectors:
• One dimension collection of objects of same kind (same data type)
• Vectors in R are similar to arrays in any other programming language
• Syntax: (1,2,3,4,5) . 1,2,3,4,5 are called elements
• (1,2,3,4,5) : numeric vector
• (‘a’,’b’,’c’,’d’): character vector
• (T, F, T, T): logical vector
• (1L,2L,3L): integer vector
• (1,2,3,4,6) ----- valid vector
• (1,’a’,3,’t’) ------ invalid vector (but R doesn’t throw an error due to coercion
Eswar Sai Santosh Bandaru
Creating
• Basic ways:• Using c()
• Using “:”
• Using seq()
• Using rep()
• Using vector()
Eswar Sai Santosh Bandaru
C() combine function
• Syntax: • X<- C(1,2,4,78,90) creates a Numeric vector X with elements 1,2,4,78,90
• Y<- c(‘a’,’b’,’c’,’d’) creates a character vector Y with elements ‘a’, ‘b’, ‘c’,’d’
• Printing:• X # Auto printing
• Print(x) # explicit printing
Eswar Sai Santosh Bandaru
Using “:”
• x <- 20:50 • Creates a numeric vector x with values starting from 20 till 50 with increments
of 1
• Ending value > Starting Value - default increment +1
• y <- 50:20 • Creates a numeric vector x with values starting from 50 till 20 with increments
of -1
• Ending value < Starting Value .- default increment -1
Eswar Sai Santosh Bandaru
Seq()
• X <- seq(2,50) • Creates a numeric vector starting from 2 till 50 with increment of +1
• X <- seq(50,2)• Creates a numeric vector starting from 50 till 2 with increment of -1
• X <- seq(2,50,2)• Creates a numeric vector starting from 2 till 50 with increment of +2
• Increment can also be –ve if starting element > ending element
• ( 2, 4,6,8,10…….,50)
• X<- seq(‘a’,’b’,2) Throws an error
Eswar Sai Santosh Bandaru
Rep()
• X <- rep(c(1,2,3),times =2)• Creates vector numeric vector X: 1,2,3,1,2,3• The vector gets repeated twice
• rep(1:3, each =2)• Output: 1,1,2,2,3,3• Each element in the vector gets repeated twice
• rep(1:3,each=2,times =3)• Output: 1,1,2,2,3,3, 1,1,2,2,3,3, 1,1,2,2,3,3,• 2 steps
• 1:Each element gets repeated twice• 2: the entire vector itself gets repeated thrice
• Different variations of rep-- ?rep
Eswar Sai Santosh Bandaru
Combining vectors
• X <-c(1,2,3,4,5)
• Y<-c(1,6,7,8)
• Z<-c(X,Y)
• Combines vectors X,Y and assigns to Z, output: 1,2,3,4,5,1,6,7,8
• Q1 – Q8
Eswar Sai Santosh Bandaru
vector()
• X<-vector() …empty vector with default data type:logical
• X<-vector (…)
Eswar Sai Santosh Bandaru
Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6
X[1]: ‘a’• Unlike python, java…indexing starts from 1 in R
Eswar Sai Santosh Bandaru
Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6
X[5]: ‘e’
Eswar Sai Santosh Bandaru
Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6
X[-1]: ‘b’ ‘c’ ‘d’ ‘e’ ‘f’
Expect first element
Eswar Sai Santosh Bandaru
Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6
X[1:3]: ‘a’ ‘b’ ‘c’Not same as x[3:1]
Prints first three
elements
Eswar Sai Santosh Bandaru
Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)Index: 1 2 3 4 5 6
X[-1:-2]: ‘c’ ‘d’ ‘e’ ‘f’or
X[-2:-1]: ‘c’ ‘d’ ‘e’ ‘f’Eswar Sai Santosh Bandaru
Example
• X[1:(length(X)-1)]• Prints every element except for the last element
Eswar Sai Santosh Bandaru
Element wise operations
• (45,20, 25,3,4)
+
• (2, 6, 10, 1, 3)
||
(47, 26, 35, 4, 7)
• (45,20, 25,3,4)
+
• (2, 6, 10, 1, 3)
||
(47, 26, 35, 4, 7)
• (45,20, 25,3,4)
+
• (2, 6, 10, 1, 3)
||
(47, 26, 35, 4, 7)
Eswar Sai Santosh Bandaru
Example:
• x1 <- c(1,2,3), x2 <- c(6,7,8). what is x1+2*x2
• (1,2,3)
• 2*(6,7,8) -- (12, 14, 16) ….recycling!
• (1,2,3) + (12,14,16) - (13,16,19)
Eswar Sai Santosh Bandaru
Recycling
• 1:5 + 1• Internally 1,2,3,4,5 + 1,1,1,1,1 (1 gets recycled 5 times to match the length of
longer vector, then element wise operation occurs)
• 1:6 + c(1,2)• Internally 1,2,3,4,5,6 + 1,2,1,2,1,2 (c(1,2) gets recycled to meet the length of
longer vector)
• C(1,2,3,4,5,6,7) + c(1,2,3,4) ( a warning !!)• 1,2,3,4,5,6,7 + 1,2,3,4,1,2,3
Eswar Sai Santosh Bandaru
Q12: Create vector q using element wise operations
Eswar Sai Santosh Bandaru
Subsetting a vector with logical vector
• Y <- c('a','b','c','d')
• Y[c(T,T,F,T)]
• ‘a’ ‘b’ ‘d’(selects the element if true else does not select)
• Recycling• Y[c(T)]
• Vector T gets recycled till it matches the length of Y
• Every element gets printed
Eswar Sai Santosh Bandaru
Comparison operators
• X<- c(1,2,3,4,5,6,7)
• X>4 (x greater than 4)• Outputs a logical vector having True for values greater than 4 and false for
values less than or equal to false
• Output: logical vector : F,F,F,F,T,T,T
• X[X>4]• Selects elements from X which are greater than 4
• Output: 5,6,7
Eswar Sai Santosh Bandaru
Conditional operators in R
• conditional statements in R
• x == y : checks for equality, outputs TRUE if equal else FALSE
• x !=y : checks for inequality
• x >=y: greater than or equal
• x <=y
• x<y
• x>y
• You can combine both of them using & , or operators
• Q13-Q16
Eswar Sai Santosh Bandaru
Coercion
• x <- c(1,2,'a',3) -- Does not throw an error• Other elements in the vector gets coerced to character• Output: ‘1’,’2’,’a’,’3’
• priority for coercion; character> numeric> logical• Logical converts to 1,0
• explicit coercion:• as.* function s• as.character (1:20) # customerID
• X<-c(‘a’,’b’,’c’,’d’)• as.numeric(x)--- R produced NA’s• Output: NA, NA, NA, NA
Eswar Sai Santosh Bandaru
Some important functions
• Which() : produces the indices of vector the condition is satisfied• X <- c(10,2,4,5,0)
• Which(x>2)
• Output: 1, 3, 4
• all() : produces a logical vector if a condition is satisfied by all values in a vector• all(x>2): False
• any(): produces a logical vector if a condition is satisfied in any values in a vector• Any(x>2) :TRUE
Eswar Sai Santosh Bandaru
attributes
• Attributes: Give additional information about elements of a vector • E.g., names of elements, dimensions, levels
• attributes(x) : shows all the available attributes of x
• If there are no attributes, r outputs NULL
• We can assign attributes to a created vector
• E.g., we can assign names to elements with function name()• names(x) <- student_names
• Where student names is character vector containing names of students
Eswar Sai Santosh Bandaru
Subsetting using names attribute
• X[‘Cory’] -- prints marks of Cory• Internally…using which() , R gets the index whose attribute name is “Cory”
• Then subsets based on the index
• X[c(‘Cory’,’James’)] - prints marks of Cory and James
• Q16
Eswar Sai Santosh Bandaru
Updating a vector: What if Cory’s marks get updated• X[1] <- 35
• Element at index 1 gets updated to 35
• X[x<30 &&x>25] <-40• All the values which are less than 30 updated to 40
• X[“Cory”] <- 67
Eswar Sai Santosh Bandaru
is.na() and mean imputation
• x<- c(1,2,4,NA,5,NA)• is.na(x): produces a logical vector, TRUE if element is NA else FALSE
• Output: F F F T F T
• Replace NA with the mean values????
Eswar Sai Santosh Bandaru
Factors attribute
• Converts a continuous vector in to a categorical data
• X<-c(1,1,1,2,2,2,3,3,3)
• Sum(x) : 18
• X<-factors(X)
• Sum(x) : error
• Levels(x): categories in x• Output: “1” “2” “3”
• Class(X)• Output: factor
Eswar Sai Santosh Bandaru
Table function: frequency table
• Counts the number of times an element occurs in vector
• X<-c(‘a’,’a’,’a’,’b’,’b’,’c’,’c’)
• table(x):• a-3
• b-2
• c-2
• Useful while plotting barplot
Eswar Sai Santosh Bandaru
ls() and rm()
• ls() : Lists all the objects in the current R session(environment)
• rm(“d”) : removes the object d
• rm( list = ls()): removes all objects from the environment
Eswar Sai Santosh Bandaru
Data frames:
• Data frames are simply “tables” (rows and columns)
• Each column should be of same data type (hence all the vector operations are valid for each column)
• Creation• X<- data.frame(data for column1, data for column 2,…….)
• Column gets binded
• 2 dimensional
Eswar Sai Santosh Bandaru
Subsetting data frames…why?
• Very useful for analyzing the data
• As it 2 dimensional, it has 2 indices : row * columns
• test[3,2] : refers to element in 3rd row 2nd column
• test[1:3,1:2]: first three rows, 2 columns
• Using column names
• test$student_name : refers to column: student_name• Its kind of vector!...so we can perform all vector operations
• test["student_name"] : refers to column student_name
• test["marks"]
Eswar Sai Santosh Bandaru
Students with higher than average marks??
• above_average<- (test$marks>mean(test$marks))
• test$student_names[above_average]
• Two steps:• above_average is a logical vector
• Test$student_names[above_average] selecting students where the vector is True
Eswar Sai Santosh Bandaru
Writing into csv
• Write.csv(test,”test.csv”)
• Gets saved to the default directory(folder) R is pointing to
• To know the default directory:• Use getwd()
Eswar Sai Santosh Bandaru
Reading a csv file
• setwd(“directory path”)
• read.csv(“file name”)
• Different function to read different files
• dir() : lists all files in the current directory
Eswar Sai Santosh Bandaru
Data inspection
• str()
• head()
• tail()
Eswar Sai Santosh Bandaru
Dates and Times in R
• Dates are stored internally as the number of days since 1970-01-01 while times are stored internally as the number of seconds since 1970-01-01
Eswar Sai Santosh Bandaru
Data Visualization in R: Using R base graphics
• 3 types:• base graphics
• ggplot2
• lattice
• Boxplots
• Barplots
• Histograms
• Scatter plots
Eswar Sai Santosh Bandaru