modelling with r

Upload: jeiel-franca

Post on 02-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 Modelling With R

    1/3

    Modelling with R: part 2October 2, 2011

    ByMK

    inShare1

    (This article was first published onWe think therefore we R,and kindly contributed toR-bloggers)

    I apologize for the delay in the second post (just in case anybody was waiting), I had been vary

    involved with work the past week. I shall try to be more regular. Well, in thepreviouspost, we

    successfully imported data into R and got a basic feel of it by looking at the various variables

    present and their types as well. Now we will try to process the data to make sense of it. Data by

    themselves are just space hogging particles, aesthetically challenged, and practically worthless

    unless well unless we can get some information out of them. And to get that information,

    they need to be processed, transformed, and at times coerced. This post will describe how we can

    start to do that. So, lets grab its throat and make it spit out the ugly truth (excuse me for the

    histrionics). Also, this last sentence bore no relationthishorrible Gerad Butler movie.

    ##########-1.2: Processing the data###############

    There are three main steps involved here:1. Preliminary visulaization

    2. Data transformation and/or variable creation

    3. Development-validation division of data set

    Lets start with 1.

    Suppose we want to check the distribution of amount in the given data. We can use a simple plot

    command.

    plot(amount, type = l, col = royalblue)

    plot(age, type = l, col = brown)

    Now the plots we have created present the variation in these variables in a manner which neither

    easily discernible nor is it clean.

    It would be better if created histograms to check the frequency distribution of these variables.hist(amount)

    hist(age)

    # The hist command has a lot of options that help extend the features of the plot.

    Now, with the hist command, we are able to see the picture clearly (literally). But it is not always

    appropriate to plot a histogram. What is the best way depends largely on the problem at hand.

    Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better

    to use the plot command which will not only present the data in a much neater way but also enable

    us to compare different time series in a single plot. For a simple example, you can check

    Shreyespost.

    Coming back, we can similarly visualize the pattern, frequency and distributions of other

    variables as well.Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it

    is better to create categories out continuous variables. This helps a lot while implementing the model

    that we build because it is more convenient to come with strategies for individuals belonging to a

    particular income group rather than for all individuals with specific incomes.

    For this we need to bin some variables like amount and age. One approach to do this is to run the

    following code

    DO NOTrun this chunk of code. I will explain later why.

    # g.data$amount

  • 7/26/2019 Modelling With R

    2/3

    There is an important point to note here. Above, while creating the category variable, we overwrote

    the original variable amount in the R object g.data. Ideally this process is not well advised

    because if we later find that there was an error in our code or there was some flaw in the logic and

    we need to change it, we will have to re-do all the steps that we have done till this point. But, there is

    another side here as well. R, while working, stores all the data and the objects that we create in the

    RAM and hence if the data set is of considerably large size then creating additional variables bytransformation is not a very wise idea either. This trade-off needs to be balanced.

    In this case, the data set is quite small and hence it would be better if we create an additional

    object instead of overwriting the original one.

    g.data$amt.fac

  • 7/26/2019 Modelling With R

    3/3

    histogram(employment, col = grey)

    histogram(sav_acct, col = grey)

    As a last step in this stage, we need to create a development sample and a validation sample. We

    take about 70% percent of the data as development sample and 30% as validation sample.

    d