2 data structure in r

45
Dr Nisha Arora Data Structure in R

Upload: naroranisha

Post on 15-Feb-2017

18 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 2 data structure in R

Dr Nisha Arora

Data Structure in R

Page 2: 2 data structure in R

Contents

2

Variable assignment in R

Numerical Operators in R

In built functions in R

Infinity, NA and NAN values in R

Atomic data types in R

Objects in R

Subsetting in R

References & Resources

Page 4: 2 data structure in R

Variable Names

4

Variable names in R are case-sensitive

Variable names should not begin with numbers (e.g. 1x) or

symbols (e.g. %x).

Variable names should not contain blank spaces: use

monthly_salary or monthly.salary (not monthly salary ).

Page 5: 2 data structure in R

Numerical Operators in R

5

Operator Description

+ Addition

- Subtraction

* Multiplication

/ Division

%/% Integer division

%% Modulo (estimates remainder in a division)

^ or ** Exponentiation

Page 6: 2 data structure in R

Logical Operators in R

6

Operator Description

< Less than

<= Less than or equal to

> Greater than

>= Greater than or equal to

== Exactly equal to

!= Not equal to

! x Not x

x |y x OR y

x & y x AND y

Page 7: 2 data structure in R

Inbuilt Mathematical Functions

7

pi; exp(1)

log(x) # log to base e of x

log10(x) # log to base 10 of x

log(x,n) # log to base n of x

floor(x) # greatest integer <x

ceiling(x) # smallest integer >x

lgamma(x) # natural log of gamma (x)

choose(n,x) # Binomial coefficient nCx

sqrt(x); factorial(x); gamma(x)

Page 8: 2 data structure in R

Inbuilt Mathematical Functions

8

trunc(x) # closest integer to x between x and 0

E.g., trunc(1.5) =1, trunc(-1.5) =-1

NOTE: trunc is like floor for positive values and like ceiling for

negative values

round(x, digits=0) # round the value of x to an integer

signif(x, digits=6) # give x to 6 digits in scientific notation

runif(n) # generates n random numbers

between 0 and 1 from a uniform distribution

Page 9: 2 data structure in R

Inbuilt Trigonometrically Functions

9

cos(x) # cosine of x in radians

sin(x) # sine of x in radians

tan(x) # tangent of x in radians

acos(x), asin(x), atan(x) # inverse trigonometric

transformations of real or complex numbers

acosh(x), asinh(x), atanh(x) # inverse hyperbolic

trigonometric transformations of real or complex numbers

abs(x) # the absolute value of x,

ignoring the minus sign if there is one

Page 10: 2 data structure in R

10

NA’s and NAN’s in R

Inf

Infinity

NA

Not available, generally interpreted as a missing value

The default type of NA is logical, unless coerced to some other type,

so the appearance of a missing value may trigger logical rather than

numeric indexing. Numeric and logical calculations with NA generally

return NA.

NAN

Not a number, e.g., 0/0

Page 11: 2 data structure in R

11

NA’s and NAN’s in R

is.nan() is used to test for NaN's

is.na() is used to test, if objects are NA's

A NAN value can also be NA but not conversely.

It means is.na also returns TRUE for NaN's

Page 12: 2 data structure in R

12

Data types in R

Logical, for example, TRUE, FALSE

Numeric (sometimes called double, usually treated as floating

point number/real number), for example, 11.7, -3, 99.0, 1000

Integer, for example, 25L, 0L, -33L

Specify L suffix to get integer (i.e. 1L gives integer 1)

Complex, for example, 3 – 4i, 4+5i

Character, for example, “abc”, “34”, “TRUE”, “3-4i”, ‘3L’

Page 13: 2 data structure in R

13

Data types in R

To check the class of variables, class() command can be

used

For example:

class(7); class(7L); class(T); class(‘T’); class(3+0i)

Special numbers such as Inf and NAN are of numeric

class

For example: class(8/0); class(0/0)

Page 14: 2 data structure in R

14

Coercion

All elements of a vector must be the same type, so when we

attempt to combine different types they will be coerced to the

most flexible type.

Types from least to most flexible are:

. Logical

Integer

Double/ Numeric

Character

Page 15: 2 data structure in R

15

Coercion

When a logical vector is coerced to an integer or double, TRUE

becomes 1 and FALSE becomes 0

x <- c(FALSE, FALSE, TRUE); as.numeric(x)

Total number of TRUEs

sum(x)

Proportion that are TRUE

mean(x)

Page 16: 2 data structure in R

16

Coercion in R

To forcefully coerce a variable class into other, following

functions are used

as.numeric(), as.logical(), etc.

Page 17: 2 data structure in R

17

Objects in R

Vector

The basic one dimensional data structure in R is the vector

List

Lists are different from atomic vectors because their

elements can be of any type, including lists

Matrix

The basic two dimensional data structure in R is the vector

Note: A variable with a single value is known as scalar. In R a

scalar is a vector of length 1

Page 19: 2 data structure in R

19

Vectors in R

To create vectors in R using concatenation function

num_var <- c(1, 2, 4.5)

Use the L suffix to get an integer rather than a double

int_var <- c(13L, 0L, 10L)

Use TRUE and FALSE (or T and F) to create logical vectors

log_var <- c(TRUE, FALSE, T, F)

Use double or single quotation to create character vector

chr_var <- c(“abc", “123")

Vectors can also be created by using sequence or scan function

Page 20: 2 data structure in R

20

Vectors in R

To name a vector

# Assigning names directly

x <- c(Mon = 37, Tue = 41.4, Wed = 43.2)

# Using names() function

x <- c(78, 86, 89); names(x) <- c(“chem", “phy", “math")

# Using setNames() function

x <- setNames(1:3, c("a", "b", "c"))

Page 21: 2 data structure in R

21

Vector Subsetting

x = c(11,42,23,14,55);

names(x) = c('ajay', 'ravi', 'john', 'anjali', 'namrata'); x

x[2]; x[1:3]; x[5]; x[7]

# x[n] gives 'nth' element of vector x, there are only 6 elements,

so x[7] is NA

x['ajay']; x[c('ravi', 'namrata')] # To select elements by

names

Page 22: 2 data structure in R

22

List in R

Lists are different from vectors because their elements can be of

any type, including lists.

We can construct lists by using list() instead of c()

x <- list(1:4, "abc", c(T, T, F), c(2.3, 5.9))

Page 23: 2 data structure in R

23

Matrix in R

To create matrix in R

x = matrix(1:9, nrow = 3, ncol = 3)

x = matrix (1:9, 3, 3) # Alternate way

To create a matrix by using by row

z = matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)

# By default byrow is FALSE, so matrix is created by column

a <- matrix(1:9, byrow=TRUE, nrow=3) # Alternate way

Page 24: 2 data structure in R

24

Matrix in R

To create matrix by using cbind() command

one <- c(1,0,0)

two <- c(0,1,0)

three <- c(0,0,1)

b <- cbind(one, two, three)

To create a matrix by using rbind() command

c <- rbind(one, two, three)

Page 25: 2 data structure in R

25

Matrix in R

To assign names to columns and rows of matrix

x = cbind(c(78, 85, 95), c(99, 91, 85), c(67, 62, 63))

colnames(x) = c(“Jan", ‘Feb', “Mar“)

rownames(x) = c(“product1”, ‘product2’, ‘product3’)

Other useful commands

dim(x); head(x); nrow(x); ncol(x); attributes(x)

rowSums(x); colSums(x)

Page 26: 2 data structure in R

26

Matrix Subsetting

To find sub matrices of a given matrix

x <- matrix(1:6, 2, 3)

x[1, 2] # Element of first row, second column [single element]

x[2, 1] # Element of second row, first column [single element]

x[2, ] # Matrix of all the elements of second row

x[, 1] # Matrix of all the elements of first column [matrix]

x[1:2, 3] # Elements of first & second row for third column only

Page 27: 2 data structure in R

27

Matrix Subsetting

To find sub matrices of a given matrix

x <- matrix(1:6, 2, 3)

By default, when a single element of a matrix is retrieved, it is returned

as a vector of length 1 rather than a 1 × 1 matrix.

This behaviour can be turned off by setting drop = FALSE.

x[1, 2] # Single element

x[1, 2, drop = FALSE] # Matrix of one row & one column

Page 28: 2 data structure in R

28

Matrix Subsetting

To find sub matrices of a given matrix

x <- matrix(1:6, 2, 3)

Similarly, sub-setting a single column or a single row results in a

vector, not a matrix (by default).

This behaviour can be turned off by setting drop = FALSE.

x[1, ] # Single row

x[1, , drop = FALSE] # Matrix of one row & one column

Page 30: 2 data structure in R

30

Factors in R

They are used for handling categorical variable, e.g., the ones

that are nominal or ordered categorical variables.

For example,

Male, Female Nominal categorical

Low, Medium, High Ordinal categorical

Page 31: 2 data structure in R

31

Factors in R

To create a factor in R using factor()

gender_vector <- c("Male", "Female", "Female", "Male", "Male")

factor_gender_vector <- factor(gender_vector)

Also, try levels(factor_gender_vector)

To change the levels of factor

levels(factor_gender_vector) = c(("F", "M"))

Other useful commands

summary(factor_gender_vector); table(factor_gender_vector)

Page 32: 2 data structure in R

32

Data frames in R

A data frame is the most common way of storing data in R, and if used

systematically makes data analysis easier.

Similar to tables (databases), dataset (SAS/SPSS) etc.

Consists of columns of different types; More general than a matrix

Columns – Variables; Rows – Observations

Convenient to hold all the data required for a data analysis

They are represented as a special type of list where every element of

the list has to have the same length

Data frames also have a special attribute called row.names

Page 33: 2 data structure in R

33

Data frames in R

Data frames are, well, tables (like in any spreadsheet program).

In data frames variables are typically in the columns, and cases in

the rows.

Columns can have mixed types of data; some can contain

numeric, yet others text.

If all columns would contain only character or numerical data,

then the data can also be saved in a matrix (those are faster to

operate on).

Page 34: 2 data structure in R

34

Data frames in R

To create a data frame in R

Example_1:

df <- data.frame(x = 1:3, y = c("a", "b", "c"))

Example_2:

length <- c(180,175,190)

weight <- c(75,82,88)

name <- c("Anil","Ankit","Sunil")

data <- data.frame(name,length,weight)

Page 35: 2 data structure in R

35

Data frames in R

To combine data frames in R

Example_1: using cbind()

df <- data.frame(x = 1:3, y = c("a", "b", "c"))

cbind(df, data.frame(z = 3:1))

Example_2: using rbind()

rbind(df, data.frame(x = 10, y = "z"))

Page 36: 2 data structure in R

36

Data frames in R

To combine data frames in R

Example_1: using cbind()

df <- data.frame(x = 1:3, y = c("a", "b", "c"))

cbind(df, data.frame(z = 3:1))

Example_2: using rbind()

rbind(df, data.frame(x = 10, y = "z"))

Page 37: 2 data structure in R

37

Data Type Conversions

Use is.foo to test for data type foo. Returns TRUE or FALSE

Use as.foo to explicitly convert it. For example,

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()

as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame)

http://www.statmethods.net/management/typeconversion.html

Page 38: 2 data structure in R

38

Handling of missing values

X <- c(1:8,NA)

Removing missing vlaues

mean(X, na.rm = T) or mean(X ,na.rm=TRUE)

To check for the location of missing values within a vector

which(is.na(X))

To assign this a large number, say, 999

X[which(is.na(X))] = 999

Read more at: http://www.statmethods.net/input/missingdata.html

Page 39: 2 data structure in R

39

Handling of missing values

x <- c(1, 2, NA, 4, NA, 5)

Identify missing values

bad <- is.na(x)

To remove missing values

x[!bad]

Page 40: 2 data structure in R

40

Handling of missing values

x <- c(1, 2, NA, 4, NA, 5); y <- c("a", "b", NA, "d", "e", NA)

df = data.frame(x,y)

To take the subset of data frame with no missing value

good = complete.cases(x,y); good

To take the subset of vector x with no missing value

x[good]

To take the subset of vector y with no missing value

y[good]

Page 45: 2 data structure in R

Thank You