modelling with r

7/26/2019 Modelling With R

1/3

Modelling with R: part 2October 2, 2011

ByMK

inShare1

(This article was first published onWe think therefore we R,and kindly contributed toR-bloggers)

I apologize for the delay in the second post (just in case anybody was waiting), I had been vary

involved with work the past week. I shall try to be more regular. Well, in thepreviouspost, we

successfully imported data into R and got a basic feel of it by looking at the various variables

present and their types as well. Now we will try to process the data to make sense of it. Data by

themselves are just space hogging particles, aesthetically challenged, and practically worthless

unless well unless we can get some information out of them. And to get that information,

they need to be processed, transformed, and at times coerced. This post will describe how we can

start to do that. So, lets grab its throat and make it spit out the ugly truth (excuse me for the

histrionics). Also, this last sentence bore no relationthishorrible Gerad Butler movie.

##########-1.2: Processing the data###############

There are three main steps involved here:1. Preliminary visulaization

2. Data transformation and/or variable creation

3. Development-validation division of data set

Lets start with 1.

Suppose we want to check the distribution of amount in the given data. We can use a simple plot

command.

plot(amount, type = l, col = royalblue)

plot(age, type = l, col = brown)

Now the plots we have created present the variation in these variables in a manner which neither

easily discernible nor is it clean.

It would be better if created histograms to check the frequency distribution of these variables.hist(amount)

hist(age)

# The hist command has a lot of options that help extend the features of the plot.

Now, with the hist command, we are able to see the picture clearly (literally). But it is not always

appropriate to plot a histogram. What is the best way depends largely on the problem at hand.

Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better

to use the plot command which will not only present the data in a much neater way but also enable

us to compare different time series in a single plot. For a simple example, you can check

Shreyespost.

Coming back, we can similarly visualize the pattern, frequency and distributions of other

variables as well.Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it

is better to create categories out continuous variables. This helps a lot while implementing the model

that we build because it is more convenient to come with strategies for individuals belonging to a

particular income group rather than for all individuals with specific incomes.

For this we need to bin some variables like amount and age. One approach to do this is to run the

following code

DO NOTrun this chunk of code. I will explain later why.

# g.data$amount


2/3

There is an important point to note here. Above, while creating the category variable, we overwrote

the original variable amount in the R object g.data. Ideally this process is not well advised

because if we later find that there was an error in our code or there was some flaw in the logic and

we need to change it, we will have to re-do all the steps that we have done till this point. But, there is

another side here as well. R, while working, stores all the data and the objects that we create in the

RAM and hence if the data set is of considerably large size then creating additional variables bytransformation is not a very wise idea either. This trade-off needs to be balanced.

In this case, the data set is quite small and hence it would be better if we create an additional

object instead of overwriting the original one.

g.data$amt.fac


3/3

histogram(employment, col = grey)

histogram(sav_acct, col = grey)

As a last step in this stage, we need to create a development sample and a validation sample. We

take about 70% percent of the data as development sample and 30% as validation sample.

d

modelling with r

Documents