2-d graphs & simple linear regression with r

167
Exploring R ProMat Consultants 2D Graphs and Simple Linear Regression with R For Absolute R Beginners

Upload: christopher-breach

Post on 25-Jan-2017

356 views

Category:

Data & Analytics


1 download

TRANSCRIPT

  • Exploring R

    ProMat Consultants

    2D Graphs and Simple Linear Regression with R For Absolute R Beginners

  • 2

    Stuff................................................................................................................................4

    Preface...........................................................................................................................5

    Some (Reasonable?) Assumptions........................................................................6

    Chapter 1.......................................................................................................................7Download & Install R & RStudio..........................................................................................................................71.0 Introduction..............................................................................................................................................................72.0 DownloadingR.........................................................................................................................................................82.0 DownloadingRStudio.........................................................................................................................................11

    Chapter 2.....................................................................................................................13The RStudio Environment....................................................................................................................................131.0 Introduction............................................................................................................................................................132.0 TheRStudioLayout..............................................................................................................................................13

    Chapter 3.....................................................................................................................18The Working Directory............................................................................................................................................181.0 Introduction............................................................................................................................................................182.0 CreateYourOwnRStudioFolder...................................................................................................................183.0 FindtheCurrentWorkingDirectory............................................................................................................194.0 ChangetheWorkingDirectory.......................................................................................................................19

    Chapter 4.....................................................................................................................21R Packages.......................................................................................................................................................................211.0 Introduction............................................................................................................................................................212.0 ThePackagesWindow........................................................................................................................................223.0 DefaultPackages...................................................................................................................................................234.0 InstallingNewPackages....................................................................................................................................24

    Chapter 5.....................................................................................................................27Preparing Your Data & Reading it into RStudio.....................................................................................271.0 Introduction............................................................................................................................................................272.0 SingleX-YPairs......................................................................................................................................................273.0 DatawithMultipleYforEachX.....................................................................................................................30

    Chapter 6.....................................................................................................................31Basic Plotting in RStudio........................................................................................................................................311.0 Introduction............................................................................................................................................................312.0 BasicInteractiveScatterplots.........................................................................................................................313.0 BasicInteractiveHistograms..........................................................................................................................39

    Chapter 7.....................................................................................................................42Simple Linear Regression: Plot, Fit & Check............................................................................................421.0 Introduction............................................................................................................................................................422.0 CheckingtheAssumptions:RubberAbrasionData................................................................................44

    Chapter 8.....................................................................................................................61Further Analysis 1: Regression ANOVA & Hypothesis Tests........................................................611.0 Introduction............................................................................................................................................................612.0 SimpleLinearRegressionANOVAinR:RubberAbrasionData........................................................62

    Chapter 9.....................................................................................................................65Further Analysis 2: Leverage & Cooks Distance..................................................................................651.0 Introduction............................................................................................................................................................652.0 Leverage....................................................................................................................................................................66

  • 3

    3.0 CooksDistance......................................................................................................................................................694.0 RsBuilt-InDiagnosticAnalyses.....................................................................................................................72

    Chapter 10...................................................................................................................76Making Non-Linear Data Linear: Transformations.............................................................................761.0 Introduction............................................................................................................................................................762.0 HowDoWeTransform?.....................................................................................................................................773.0 TransformationGuidelines&Techniques..................................................................................................784.0 Transforming&FittingtheCementStrengthData...............................................................................79

    Chapter 11...................................................................................................................87Saving it All as an RStudio Script File & Plot Aesthetics.................................................................871.0 Introduction............................................................................................................................................................872.0 TheRScriptWindow...........................................................................................................................................883.0 ReproducingtheBasicPlotinanRStudioScript....................................................................................884.0 ModifyingTitles&Legends...............................................................................................................................955.0 LinearFittingofData......................................................................................................................................1086.0 ScatterplotSampleScriptforRubberAbrasionData........................................................................110

    Chapter 12.................................................................................................................116Plotting Multiple Sets of Data..........................................................................................................................1161.0 Introduction.........................................................................................................................................................1162.0 CuWireDataSet................................................................................................................................................1163.0 MultipleSetsofDatawithDifferentExplanatoryVariableValues..............................................129

    Chapter 13.................................................................................................................131The R Graphics Environment & Graph Customization..................................................................1311.0 Introduction.........................................................................................................................................................1312.0 HowRDividesPlotsintoRegions................................................................................................................1313.0 High&LowLevelPlotFunctionsandCommands...............................................................................1334.0 PlotCustomization............................................................................................................................................136

    Chapter 14.................................................................................................................147R Markdown: A Very Simple Primer..........................................................................................................1471.0 Introduction.........................................................................................................................................................1472.0 Installing&UsingRMarkdownfortheFirstTime.............................................................................1473.0 ASimpleExample:RubberAbrasion.........................................................................................................1604.0 KnitWord&KnitPDF......................................................................................................................................164

    Appendix1:RubberAbrasionData................................................................................165

    Appendix2:CementStrengthData................................................................................166

    Appendix3:CuWireData.............................................................................................167

  • 4

    Stuff Copyright 2015 by ProMat Consultants. All Rights Reserved. Diving Into R and the Diving Into R logo ProMat Consultants 2015. The information provided within this eBook is for general informational purposes only. While we try to keep the information up-to-date and correct, there are no representations or warranties, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or related graphics contained in this eBook for any purpose. Any use of this information is at your own risk.

  • 5

    Preface

    R is a very powerful statistical programming language used by academics and professional statistical scientists and its free! But like many programming languages its perceived to be difficult, more difficult than it actually is. There can be a steep learning curve with R but it can also get you useful results quickly if you know how. One of the things R does really well is simple linear regression analysis and graphing. When you know how to get what you want with R it will motivate you to go deeper into its capabilities. Youll get better at it by using it. This book has a very clear purpose, which is, to help complete R novices to:

    1. Plot a graph and control the appearance to get publication quality output

    2. Fit the data using simple linear regression (SLR). 3. Get the statistical summary of the data. 4. Do some basic interpretation. 5. Integrate R graphs and analyses into reports using R Markdown.

    This is an ebook for R beginners and people who are busy (working) and want to get a grip on how to make scatterplots and do SLR quickly. This book will give you:

    The basic skills to prepare, read and plot your data, fit the data, do some basic interpretation and produce publication quality graphics using R.

    Further interest in learning R (we hope). The motivation to further understand the statistical science behind

    linear regression (if you dont already know it).

  • 6

    Some Reasonable Assumptions

    You know how to use a computer and word processor / spread sheet.

    You are familiar with the very basics of the simple linear regression model and the concept of fitting a straight line to data.

  • 7

    Chapter 1 Download & Install R &

    RStudio

    1.0 IntroductionR can be used as a standalone program and doesnt require any

    special interface. However, as powerful as R may be its basic user interface is not so user friendly.

    Instead of using R as is were going to use R via a user interface

    called RStudio, which is a more convenient way to input commands, write simple programs and get results.

    This chapter explains how to download and install R and RStudio.

  • 8

    2.0 DownloadingRR is freely available for Windows and Mac (and Linux) and can be downloaded at the following site: http://www.r-project.org/ R for Windows can be downloaded as follows:

    1. Go to http://www.r-project.org/ 2. The screenshot below shows the website. Click on CRAN

    3. Choose a download site. Theyre listed by country. Scroll down to find more locations. Choose your preferred location and click on the link.

  • 9

    4. For Windows users click on Download R for Windows:

    And youll get to the screen below:

    5. Click on base and youll be taken here:

  • 10

    6. Click on Download R3.X.X for Windows and your download will start. When the file has been downloaded, open it and install R. For Mac, at step 4 choose Download R for Mac OSX and the procedure is similar.

  • 11

    2.0 DownloadingRStudioRStudio is a free GUI for use with R. Before you can use RStudio you must download and install R. RStudio can be found here: http://www.rstudio.com/

    Click on Download RStudio and youll be taken here:

  • 12

    Then click on Desktop and youll get to this screen.

    Now click on Download RStudio Desktop and youll come to this screen:

    Choose the appropriate version (Windows, Mac etc.) and RStudio will download. Open it and follow the installation instructions.

  • 13

    Chapter 2 The RStudio

    Environment

    1.0 IntroductionThis short chapter explains how RStudio is laid out.

    2.0 TheRStudioLayoutWhen you installed RStudio it placed an icon on your desktop. Double click on that icon and you should see something like the screenshot shown below.

  • 14

    There is an additional window that well use later called the script window. It can be opened by going to new file-R Script (shown below).

    Then youll see this:

    The script window can be closed by clicking on the x.

  • 15

    Individual windows (script, console) can be resized by placing the cursor between the windows to show a crosshair and dragging up/down. The whole window can be resized by placing the cursor at the bottom right corner and dragging diagonally OR placing the cursor at the right side of the window and dragging left/right (see next figure).

  • 16

    2.1. TheScriptWindowWhen you want R to do a lot of things simultaneously all in one go, like read in data, calculate an average and plot a graph, you can put all of these commands as one body of text in the script window. Well get into this later. In the meantime, the picture below shows and example of a script that imports a specific set of data and plots a graph. The data set that is analysed is shown right.

    3.2 TheConsole/CommandWindowThe console window is where you work with RStudio interactively, one command at a time. Interactive use of RStudio is useful when you have some data that you want to look at but youre not sure exactly what you want to do with it. So you might read in the data, plot specific variables and see how they look etc. The console window is also where you can get help on R commands and where you can check your working directory (well go into that later).

    3.3 ThePlot/UtilitiesWindowThe plot/utilities window has multiple functions. One of the most important for you is the Plots sub-window where your graphs will appear.

  • 17

    Another important sub-window is Packages because it contains R packages that can execute specific statistical tasks. This is something well look at later.

    3.4 TheEnvironment/HistoryWindowAs the name implies it shows a history of what youve done in R like files loaded etc. We wont deal much with this window.

  • 18

    Chapter 3 The Working Directory

    1.0 IntroductionThe working directory in R is the folder where R looks for your data and R or RStudio files. When R is installed it produces a default working directory that you can change whenever you want. Youll probably want to keep you RStudio files and data in specific folders that belong to an R working directory that you create. This chapter shows you how to find and specify the working directory.

    2.0 CreateYourOwnRStudioFolderYou need a separate folder and sub-folders for your RStudio files and data. I created a folder called R_Main_WD which has a bunch of sub-folders that contain various RStudio files and data. You can see that on the right. Create your own folder for RStudio data now. Depending on the data youre

  • 19

    looking at, one of these sub-folders will be you working directory and youll change working directories as you change the data youre looking at. But first what is your current working directory?

    3.0 FindtheCurrentWorkingDirectoryThis is your first interactive use of RStudio. In the console window type the following and hit the enter key:

    getwd() This is what youll see except itll be specific to your computer. It tells you the current working directory.

    Your working directory will be the default used when R and RStudio were installed because you havent specified it yet.

    4.0 ChangetheWorkingDirectoryYou should have made your own R folder on your computer, which will be the working directory. To find and select your working directory proceed as shown right.

  • 20

    Clicking on Choose Directory will let you choose your working directory. After youve changed the working directory, check it to make sure it has changed (section 3.0 in the chapter). In the next chapter well look at data preparation and youll want to save the data in your working directory folder.

  • 21

    Chapter 4 R Packages

    1.0 IntroductionThe basic version of R can perform many useful functions but there are many add-on packages designed to perform specific statistical analyses. These packages can be installed for free. This chapter shows how to install packages.

  • 22

    2.0 ThePackagesWindowThe plot and utilities console contains a window called packages. Figure 1 below shows an example. Yours may not look like this unless youve installed some of the same packages. A package can be selected by ticking the check box.

    Figure 1. Packages window showing some of the available add-on R packages. These packages are listed because they are available but many of them are not default packages, they were installed after the initial download and installation of R.

  • 23

    3.0 DefaultPackagesThe default packages are:

    Datasets Utils grDevices graphics stats methods

    In Figure 1 youll notice that the titles of each package are in blue text. Clicking on these titles takes you to the help page where you can find documentation about each specific package. If we click on graphics youll see the page shown in Figure 2. This page contains a list of commands or functions that can be used in the graphics package.

    Figure 2. Documentation on the graphics package

  • 24

    Youll be plotting scatterplots and histograms later on using the plot and hist commands from the graphics package, which you can see in the help window when you scroll further down. All of the examples of simple linear regression, histograms and data fitting in this text can be performed using only the default packages.

    4.0 InstallingNewPackagesThis text does not require that you install any additional packages but if youre interested and you want to explore the wide variety of available packages and perhaps try downloading some of them, you can find a complete list at: http://cran.r-project.org/web/packages/available_packages_by_name.html The page is shown below:

    If there happens to be a package that you want to install, its easy using RStudio. Click on the packages window in the Plots/Utilities and then click on install:

  • 25

    Lets say we want to install the abc package. Simply type in abc:

  • 26

    The download progress is shown in the console window:

    And the newly installed package (unchecked) is listed in the packages window:

    To activate a package for use you need to check the box.

  • 27

    Chapter 5 Preparing Your Data & Reading it into RStudio

    1.0 IntroductionThere are several ways to get data into R. Im just going to show one way because Im assuming you just want to get the data into R and plotted. The way used here is to put your data into a spread sheet and save it as a csv file.

    2.0 SingleX-YPairs2.1.TheSpreadsheetThe graph in Chapter 2 was plotted using data from O.L. Davies and P.L. Goldsmith [REF]. The data is the mass of rubber lost when rubber of specific hardness is abraded. Its shown right in an excel spread sheet (and in Appendix 1). When you get data from measuring instruments it often comes in the form of a text file or a csv file. If it comes as a text file save it as a csv but if it comes as a csv file you obviously dont need to do anything with it. But if you are making separate measurements of many samples you may not have the option of saving the data in one file. Then you have to take

  • 28

    the individual values and cut and paste them into one spread sheet that you save as a csv file in excel. Otherwise you need to type each value into the spread sheet, particularly if youre taking that data from an older textbook. One thing to notice about the data in the picture is that there are no duplicate values of x, so there are unique x-y pairs. It is important that each column has a name. We call the name of each column a header. This is important in R. Header names should be short because when we explore a set of data in R we have to type commands. Type the data in Appendix 1 into a spread sheet and save it as a csv file in YOUR WORKING DIRECTORY now. You must give it a unique name. For example rubber.csv. 2.2.ReadthecsvfileintoRMake sure you select the correct working directory, the one where you file is located (section 4.0, Chapter 3). The command to get a csv file into R is simple:

    read.csv(filename.csv, header=TRUE) So in the RStudio console window (the interactive window), type the command and you should see the result on the right. Whats wrong with this? When you read in the csv file it displays all of the data and thats messy. A better way is to assign a name to the command to read in the data. That name should be unique and will be always associated with reading the csv file unless you reassign the name. We need to modify the command a little.

  • 29

    TIP: rather than retyping commands you want to modify, click the ^ arrow on your keyboard. This will bring up the last command you entered. (if you keep pressing ^ it will cycle through previous commands too). After clicking ^ you should see

    This command can be edited. To modify the command click ^ and modify the command by giving it a unique name. I chose rub1:

    rub1

  • 30

    3.0 DatawithMultipleYforEachX

    3.1.TheSpreadsheetWhat if you have multiple y data for each x? Then the data presentation is a little different. The data is STACKED. The spread sheet on the right is an example of cement strength at different times. There are three measurements for each time. The way we deal with this is we have the same time values multiple times down the column, each with its corresponding y value. Simple. Note that as a reminder of the units of each variable, the notes in column c were added. The csv file is read into R in exactly the same way as shown earlier.

  • 31

    Chapter 6 Basic Plotting in

    RStudio

    1.0 IntroductionWe make scatterplots because we want to see if changes in the response variable (plotted on the y axis) depend on controlled changes in the explanatory variable (the one that is fixed or controlled and plotted as the x axis) are related. Scatterplots allow us to visualise the relationship and linear relationships are very common. But first we have to plot and making a scatterplot of data with R is what this chapter is about. R has several graphics packages. For this introduction to plotting well be using the base graphics, i.e. the basic graphical capabilities that come with R when its installed.

    2.0 BasicInteractiveScatterplotsWhen we start analysing new data its usual in R to work interactively using the console. The reason for this is that we dont yet know what exactly were going to do with the data and so at this stage we just want basic scatterplots that show the relationship (if any) between the variables. We dont want to waste time with elaborate labels and legends or any other frills. We only want to make the plots easy to view. Modifying the axis labels, adding legends etc. is covered in Chapter 10.

  • 32

    The most basic command for making a scatterplot in RStudio is very simple:

    plot(y~x) Here y represents the name of the column of data that you want to be the y-axis and x the column of data you want to be the x-axis. Note the order: its y first then x. Lets plot some data. 2.1.BasicInteractivePlotting:RubberAbrasionDataLets look at the rubber data from the last chapter. Lets read in the data via the console and check using the head command:

    Headers in R are strings that are denoted by $. To choose a specific column to plot we have to specify the name of our data file (rub1) and the y data ($loss) and the x data ($hard). In the console we can type

    plot(rub1$loss~rub1$hard)

    And the plot window will display this (make sure youve clicked on the Plots sub-window):

  • 33

    Just like with the reading of the data we want to assign the plot a unique name so that we can do things to it with other R commands. So lets call the plot rub1plot:

    rub1plot(rub1$loss~rub1$hard) The plot is exactly the same as shown above. TIP: you can reduce the amount of typing by assigning a unique name to each column. Here Ive assigned l1 to loss and h1 to hard.

  • 34

    2.2.ChangingtheXandYScalesThe x and y scales can be changed quite easily using the following commands:

    xlim=c(a,b) ylim=c(a,b)

    Interactive R commands are shown next together with the rescaled graph.

  • 35

    Youve probably noticed that theres always a little extra space before and after each axis, commonly called padding. If you don't want that there is a way to set the axis limits exactly using:

    xaxs=i This results in the exact x and y limits. Below we set the limits of x from 30-100 and y from 50-400.

  • 36

    2.3.ModifyingthePlotCharacterandColourThe default open circles in the last plot are not very easy to see. We can change the plot symbol and the colour of the character within the plot command. The way the plot() command works is that you state what you want to plot and the plot can be modified by other commands after specifying what you want to plot. To change symbol and symbol colour, specific commands are typed after the basic plot command:

    plot(l~h, pch=number 0-25, col=colour) pch stands for plot character, col stands for colour and they are separated by a comma. There are other things that can be specified too including the x and y-axis titles, the symbol size etc., which well get to soon. But as mentioned earlier, we dont want to waste time on elaborate details when were exploring data. For now lets just change the plot character (symbol) and the colour.

  • 37

    The command that controls the plot character i.e. open circles, filled circles, etc. is

    pch=some number 0 to 25 So for example we can write

    pch=19 So to plot our abrasion loss graph with black circles:

    plot(l~h, pch=19) Symbol colour is specified either by name or by a number. To get red circles:

    plot(l~h, pch=19, col=red) OR

    plot(l~h, pch=19, col=34) The graphs on the next page show examples.

  • 38

    A list of symbol numbers and the corresponding symbol type is sown below:

    Colours can be specified by name or by number. The figure below shows a list of numbers, a chart created by Earl F. Glyn at the Stowers Research Institute.

  • 39

    3.0 BasicInteractiveHistogramsIn simple linear regression well need to sometimes check the distribution of certain data and so we need to be able to plot histograms. Well plot a histogram of the loss data from our rubber abrasion data set. You need to have read in the rubber data. The most basic command for making a histogram in RStudio is very simple:

    hist(data) Where data is the name of column of data you want to plot. The loss data is in the column rub1$loss so we can type

    hist(rub1$loss) Doing so results in the histogram shown below. But remember this assumes you have already read in the data using rub1

  • 40

    It is better to assign a name to the creation of a histogram of a particular set of data and so we can call it something like:

    losshist

  • 41

  • 42

    Chapter 7 Simple Linear

    Regression: Plot, Fit & Check

    1.0 IntroductionThis chapter considers testing the assumptions about the simple linear regression model. If youre not familiar with them you should consult a good textbook like Montgomery1. In simple linear regression analysis a fitted line has an equation or a regression function as follows: (1)

    In this equation and are constants. is the value of y when x=0

    and is the slope of the line. As shown on the figure below, is the fitted value corresponding to each data point and the difference between the fitted and observed values is ri = yi yi

    1 Introduction to Linear Regression Analysis (5th Edn.). D. C. Montgomery, E.A. Peck and G.G. Vining. Wiley (2012).

    yi = + xi

    yiyi

  • 43

    The difference between an observed and a fitted value is called a residual, and is given for each observed data point by:

    (2) There is of course a residual for each data point with some larger than others. The basic assumptions about the simple linear regression model are stated below and were going to test how well the rubber abrasion data meets these conditions: Simple Linear Regression Conditions

    1. Linear relationship between yi and xi : The mean of the response variable Yi depends on the value of xi in a linear manner.

    2. Normal Distribution of the Random Error Variance: The variation of the response variable Yi (for a given xi) about its mean value is represented by a random variable Wi, which is normally distributed about a mean value of 0 with variance 2.

    3. Constant Random Error Variance: The variance of the random error is constant and is 2, which is approximated by the residual variance s2.

    ri = yi y = yi + xi( )

  • 44

    4. Independence of the Random Error: The deviations of the response from the mean are supposed to be purely random. So the random errors for Y1 should not be dependent on the random errors for Y2.

    2.0 CheckingtheAssumptions:RubberAbrasionData2.1. LinearRelationshipofYiagainstxiLets go back to the rubber abrasion data. First plot the data (dont forget to choose the working directory and read in the data first).

    Do the data look like they might lie along a straight line? It seems so and it appears justifiable that as rubber hardness increases abrasion loss decreases. The command to perform linear regression on the data is this:

    lm(y~x)

    We have to specify y and x data (as we did with plotting). We can simply type the command in the console but just as we did with reading in the data and plotting its good to call the linear regression of this data

  • 45

    something, so that we can do things to it later and generally keep track of what weve done. So lets call it rubfit and type the following into the console and press enter:

    rubfit

  • 46

    This is the most basic of results. What it shows is this:

    1. The (Intercept) is , the value of loss (y) when hardness (x) is 0. 2. h is , the slope of the line, which is negative as wed expect.

    Now to get the fitted line type the following command:

    abline(rub1fit) So we have:

    And the plot of l versus h with the fitted line is shown below.

  • 47

  • 48

    2.2. NormalDistributionoftheRandomErrorVarianceResiduals approximate the random error in linear regression models and because the random error is assumed to be normally distributed with mean zero and variance 2, the residuals should have the same properties. We assess this by plotting a histogram of the residuals and by producing a normal probability plot of the residuals. There are several types of residuals. The two most well known are: Raw Residuals:

    ri = yi y = yi + xi( ) These residuals have units that are the same as the response variable units. So for example if the response variable is grams per hour (rubber abrasion loss) then that is also the units of the residual. Raw residuals are fine but if you want to compare abrasion loss residuals from different experiments with different units you cant. Standardised Residuals: A standardised residual is the individual residual divided by the standard deviation of the whole residuals.

    ris = yi y

    RSS n 2( )=yi + xi( )RSS n 2( )

    There are other residual types and they are important but we are considering the essential basics of simple linear regression and we wont get into those types of residual here.

    2.2.1.HistogramofRawResidualsObviously if the residuals are normally distributed the histogram should be approximately normal in shape, i.e. symmetrical with a single peak. The procedure for plotting a histogram was shown in Chapter 6. The figure below is a plot of the abrasion loss residuals. Remember it is the raw residuals of the simple linear regression fit that were plotting.

  • 49

    Is the plot normal? Well, not really and in fact it appears to be slightly skewed to the right. However, the number of observations is small and we should not expect perfect normality.

    2.2.2.HistogramofStandardisedResidualsWe can also plot the standardised residuals and now the scale of the histogram is in terms of standard deviations from the mean of zero. This is more useful in the sense that we know that for a normal distribution about 95% of the data should lie within 2 standard deviations of the mean and about 99% of values should lie within 3 standard deviations of the mean. Data that lie outside of 3 standard deviations may be outliers. Outliers are data that may arise because of measurement errors, incorrect recording of data or may be due to some other cause. We may discard outliers but not without a good reason. The standardised residuals for the rubber abrasion data are shown below. The shape of the histogram is the same as the raw residual histogram and we can see that all of the data lies within 2 standard deviations of the mean.

  • 50

    Outliers are characterised by having unusual values of the response variable and can often be identified because they have unusually large residuals.

    2.2.1.NormalProbabilityPlotAnother way to check the normality of the residual data is to plot a probability plot. This plots the quantiles of the observed data against the theoretical quantiles of the normal distribution. If the data is normal it will lie along a straight line drawn between the 1st and 3rd theoretical quantile values. Deviations from the straight line indicate deviation from normality. The R command for plotting a normal probability plot is:

    qqnorm(name of fit) The data we want to plot is the residual data, either the raw or standardised residuals.

  • 51

    Previously we called the raw residuals for the rubber abrasion loss res.stand. If we type

    qqnorm(res.stand) We get a plot of the observed (sample) quantiles against the theoretical quantiles but without a line and with the standard open circle symbol. The qqnorm() command is a plotting command and we can, just as with plot(), tell R to change the symbol type and colour within the brackets like so to get the graph shown below:

    qqnorm(res.stand, pch=19, col=red)

    What about the line? We need a separate command to get the line for probability plots, which is:

    qqline(data) where of course in this case, data is res.stand. The probability plot with the line is shown below:

  • 52

    Now look at the data and the line. The lower and upper quantiles of the data do not lie along the line while the data in between is closer to the line. So the smaller and larger data values deviate from normality quite a lot. The distribution of residuals is not normal but then the deviation from normality is not drastic. The normal probability plot is also a way to detect outliers. While the histogram of standardised residuals showed that the data was within 2 standard deviations either side of zero the probability plot shows the actual values of the most extreme points to be a little over 2, but not much.

  • 53

    2.3. ConstantRandomErrorVarianceRemember that we said earlier the residuals estimate the random errors. So some type of analysis of the residuals could indicate if the residual variance and therefore the random error variance, is constant. The way to check that residual variance is constant is to plot each residual ri against the corresponding explanatory variable xi or against the fitted value yi .

    2.3.1.IdealResidualPatternHow should the residuals appear when they satisfy the normality condition? They should be evenly distributed around a mean value of zero.

    2.2.1.UnacceptableResidualPatternsThere are many residual patterns that indicate problems with the data but well concentrate on just two. The first pattern is the outward funnel pattern shown below.

  • 54

    The characteristic of this residual plot is that the variance of the residuals increases as the response variable increases. Another residual pattern that may be observed is a non-linear pattern like the one shown below.

    When these patterns occur the data violate the condition of constant variance. The fundamental way to deal with these patterns is to transform the explanatory variable (preferably) and/or, only if absolutely necessary, the response variable. Well deal with an example of transforming data later.

  • 55

    2.2.2.RawResidualsversusFittedValuesinRLets plot these for the abrasion loss data. You need to have read in the data, plotted the data and done the fitting. So you should be starting at this position:

    Now we need to access the residuals and the fitted values of the response variable, which can be easily done with R. The command for the raw residuals is simple:

    resid(name of fit)

    By name of fit we just mean the name assigned to the data fitting command, which was rub1fit. If we type

    resid(rub1fit)

    Well see the residuals as a list in the console:

  • 56

    But as usual we want to assign a name to the list of residuals so that its stored and we can recall it when we want so lets call it:

    raw.resid

  • 57

    Overall the raw residuals appear evenly distributed about zero, which satisfies the requirements of the simple linear regression model.

    2.2.3.StandardisedResidualsversusFittedValuesinRAssuming now that you have read in the rubber abrasion data, plotted the data and fitted it using the same names, lets plot the standardised residuals. The R command for standardised residuals is

    rstandard(name of fit) So, as before, lets assign a name to the standardised residuals and obtain a list of them:

    res.stand

  • 58

    A plot of res.stand against the fitted values rub.fitted gives the plot below, which is the same as the raw residuals plot earlier except that the residuals have now been rescaled.

  • 59

    2.4. IndependenceoftheRandomErrorThe residuals are reasonable estimates of the random errors but the truth is that the residuals are not actually truly independent. This is because if you calculate all the residuals except one, you can find out the uncalculated residual from the simple fact that the sum of the residuals is zero. But still the residuals are reasonable estimates of the random errors. While the residuals are not truly independent for the reason mentioned above, there is another kind of independence that can be checked. Residuals when plotted against time or the fitted values should not show regular patterns. For example, if the residuals versus fits results in a straight line, that implies a linear relationship between the residuals that in turn suggests they are not independent. Its not known if the abrasion loss data were obtained in the time order in which they are listed in the data file. Even so, if we assume the data was obtained in a sequence we can just plot it and see if there is a trend. This is called an index plot and it just shows the residuals for each point in the order in which they appear in the data file. So the index is actually a plot of residuals versus the number of the data point (1,2,to n). To plot the abrasion loss residual data as an index plot simply type:

    plot(rstandard(rub1fit),pch=19, col=blue) The graph is shown in the next figure and it shows that from the 15th data point, in ascending order (presumably time) the size of the standardised residuals is increasing. This is a clear trend that is cause for concern. The conclusion from this plot is that there is reason to believe that the residuals (random error estimates) are not independent.

  • 60

  • 61

    Chapter 8 Further Analysis 1:

    Regression ANOVA & Hypothesis Tests

    1.0 IntroductionThe fitting procedure for simple linear regression in R gives much more information than the intercept and slope of the straight line. But the additional information is not shown unless the user asks for it. In this chapter we show how to get the additional information and how to use it. That information is Analysis of Variance or ANOVA for short. Contrary to what many people believe, analysis of variance is not about analysing variance, its about analysing mean values. In fact, one of the conditions necessary to use the methods of ANOVA when analysing experiments is that while the mean value of a response variable may vary when an explanatory variable is changed, the variance of the response variable should be more or less constant.

  • 62

    2.0 SimpleLinearRegressionANOVAinR:RubberAbrasionDataNow lets return to the rubber abrasion data, fit the data to a linear regression model and obtain the ANOVA results and the fitting results including the fitting constants and the hypothesis test results. The R command for performing ANOVA is simple:

    anova(name of fit) The interactive R input below calls the fit to the data rub1fit and so to get the anova for rub1fit we simply type:

    anova(rub1fit) Now look at the ANOVA data below. The F value is large and Pr(>F) is very small, which, when interpreted using Table 6, leads us to conclude that we fail to reject the null hypothesis. In other words we conclude that the slope is significant and that there is a relationship between the response and explanatory variables.

  • 63

    The same type of information can be obtained using a different R command that summarises the results of hypothesis testing on the fitted data. That command is:

    summary(name of fit) For the particular name given to the fitting of the data we use:

    summary(rub1fit) Now look at the summary of the fit to the data below. The first output is what is being fitted:

    lm(formula=rub1$loss~rub1$hard) The lm stands for linear model and the bracketed term is self-explanatory: it tells us what is being fitted in the order y~x. Then comes information about the residuals. The minimum, maximum, 1st and 3rd quantiles. The next information is the coefficients and listed as intercept and rub1$hard ((the explanatory variable) respectively. Note that there is also a t value and a significance probability for as well as because the same hypothesis test can be applied to both, although were concerned here only with the slope. The t value is large and Pr(>t) is very small, which, again leads us to conclude that there is a relationship between the response and explanatory variables. There is also information that R2=0.5442, which means that 54.42% of the variation is accounted for by regression model.

  • 64

    The ANOVA and hypothesis test results together tell us the following:

    1. There is a relatively strong relationship between loss and hardness as reflected by the very small p value.

    2. Despite the strong evidence that 0 , the amount of the variation explained by the model via R2/PVE is disappointingly small at 0.54/54%.

    The relatively low value of R2 despite there being strong evidence of a linear relationship between loss and abrasion may indicate that perhaps the abrasion loss depends on more than simply the hardness. Fitting the abrasion loss to a multiple regression model might improve on the PVE. However, multiple linear regression will not be covered here.

  • 65

    Chapter 9 Further Analysis 2: Leverage & Cooks

    Distance

    1.0 IntroductionEarlier it was mentioned that outliers are characterised by having unusual values of the response variable and can often be identified because they have unusually large residuals. Outliers are important in regression analysis because they can have a strong effect on regression results. But outliers are not the only type of observation that can strongly affect the outcome of regression analysis. Leverage points are remote from the rest of the explanatory variables but lie more or less along the regression line. An example is shown in Figure 1(a). Leverage points are characterised by having unusual x coordinates and may affect certain regression model properties. Influence points also have unusual values of the explanatory variable and tend to pull the regression model in their direction such that they exert abnormal influence on the regression coefficients. An example is shown in Figure 1(b).

  • 66

    (a) (b)

    Figure 1. Examples of (a) a leverage point (b) an influential point. We need to know which data, if any, are leverage points or influential points so that we can understand why they exert strong effects on the regression model. The presence of such points is not necessarily a problem because they may represent something that is real and important that occurs with a particular set of data, but we need to identify them so that we can better understand the linear regression model. Well consider leverage points and influential points for simple linear regression models and show how to find them in R.

    2.0 LeverageLeverage may be of concern when it is too high, when it exceeds a certain threshold. Points with high leverage have extreme values of the explanatory variable. In simple linear regression, where there is only one variable xi, the leverage hi for n data points is given by the equation:

    hi =1n+

    xi x( )2n 1( )sx2 (1)

    where x and sx2 are the mean and variance of the x values.

  • 67

    How high is high leverage? There is a general rule of thumb for the threshold that is given by :

    L ! c k +1( )

    n (2) where k is the number of explanatory variables, c is a constant that is 2 or larger and n is the number of observed data. We are dealing with simple linear regression so clearly k=1 and in general c=2 so:

    LSLR !

    4n (3)

    So for our rubber abrasion data with n=30, that means:

    LSLR !

    430

    = 0.133 A data point with leverage larger than about 0.133 may be considered as a point of high leverage for the rubber abrasion data. Leverage is a way to identify points that might exert too large an influence on the regression model by changing the regression coefficients of the model, which might also affect the predictions made by the model. For more on Leverage see Montgomery (footnote 1 page 40). 2.1. LeverageCalculation&PlottinginR:RubberAbrasionDataThe leverage for each data point is calculated in R using the command:

    hatvalues(name of fit) The odd sounding name hatvalues is R terminology for leverage. When we type the following followed by enter:

    hatvalues(rub1fit) We get a list of the individual leverage values, point by point in the order in which they appear as shown below:

  • 68

    We can plot the leverage values against the order of the explanatory variables using the following basic command:

    plot(hatvalues(rub1fit)) We can change the appearance of the plot within the plot command. Lets use blue circle symbols:

    plot(hatvalues(rub1fit), pch=19, col=blue) Alternatively we can assign a name to the calculation of the leverage values, such as lev.rub1fit and enter the following:

    Either way, the graph we get is shown in Figure 1.

  • 69

    Figure 2. Leverage of each point plotted against the index (the order in the data set) of the points. For the rubber abrasion data. There is clearly one point that exceeds 0.133, which is point 1, the first point in the data set.

    3.0 CooksDistanceCooks distance is a way to identify points that definitely exert too much influence on the regression model (whereas leverage identified points that might exert too much influence). Cooks distance Di2 depends on leverage and is given by:

    Di2 = 1

    k +1ri2 hi1 hi

    (4) where ri2 is the standardised residual for point i. For simple linear regression k=1 and so we can write:

    Di2 = 1

    2ri2 hi1 hi

    (5) For more on Cooks distance see Montgomery (footnote 1 page 40).

  • 70

    3.1. CooksDistanceCalculation&PlottinginR:RubberAbrasionData

    The leverage for each data point is calculated in R using the command:

    cook.distance(rub1fit) We get a list of the individual Cooks distance values, point by point in the order in which they appear as shown below:

    As with leverage, a plot of Cooks distance is easily obtained using:

    plot(cooks.distance(rub1fit), pch=19, col=red) The plot is shown in Figure 3. What we are looking for is

    Large values of Cooks distance Regular patterns in the Cooks statistic that may indicate that

    certain data points are not independent.

  • 71

    Figure 3. Cooks distance of each point plotted against the index (the order in the data set) of the points. For the rubber abrasion data. Looking at Figure 3 we see that point 1 has a large Cooks distance. It also had high leverage. The conclusion is that point 1 is influential in the regression fitting because both leverage and Cooks distance are large. But there may be a greater concern. From Figure 3 we can see that Cooks distance decreases for points 1-4, then increase between points 4 and 6, then shows a general trend to decrease from point 8 to 16 followed by a short plateau and then a general trend for Cooks distance to increase significantly from point 20 onwards. How can we interpret these downward and upward trends? Its possible that experiments were carried out at different times and possibly even under slightly different environmental conditions. With this kind of data we should if possible ask the experimenter(s) to confirm how the tests were run, the test order, laboratory conditions etc., which may help understand why there are such trends.

  • 72

    4.0 RsBuilt-InDiagnosticAnalysesThe residual, leverage and Cook statistic plots that we just did can be also be easily plotted in R after the data have been fitted. Simply type the command shown below in the interactive console and press enter:

    Youll be prompted to hit Return to see a series of 4 plots that include the residuals versus fitted values, Cooks distance etc. An example of one such plot is shown below and points 26 and 29 have been flagged as potential points of influence.

    If you want to see all 6 graphs in one plot the plot space needs to be divided into 3 rows and 2 columns OR 2 rows and 3 columns using par(). This is shown below in the script window, followed by the graphical output.

  • 73

    If we dont specify that we want graphs 1 to 6 we get just four graphs:

  • 74

    Finally, we can also obtain specific graphs but remember to make the required number of panels in the plot window:

  • 75

  • 76

    Chapter 10 Making Non-Linear

    Data Linear: Transformations

    1.0 IntroductionSome graphs are not linear, like the graph in Figure 1 that shows cement strength plotted against time. What can we do when the relationship between a response variable and an explanatory variable is not linear?

  • 77

    Figure 1. Graph of non-linear relationship between cement strength and time. The usual way to handle such things is to make the relationship linear by transforming the explanatory variable. The response variable or both variables can also be transformed. This chapter will explore the basics of transformations but since this text is elementary well keep things simple.

    2.0 HowDoWeTransform?Look at Figure 1 and ask yourself, how would I make the data more linear? We transform, which means we apply a mathematical function to the explanatory variable or the response variable (or both) to make their relationship linear. The preference is to transform the explanatory variable rather than the response variable. Taking our cement data as an example, Figure 2 suggest what we could do to the explanatory variable to make the data more linear. The concept is simple but finding a suitable transformation is not always easy. In Figure 2 the mathematical function cannot have the same effect on all values of the explanatory variable because the separation between them will not change. So in Figure 2(a) we would like a function that shifts larger values further apart from the smaller values. In Figure 2(b) we want a function that reduces the distance between larger values more than between smaller values.

    (a) (b)

    Figure 2. The solid line represents a possible fitted line. Transforming to (a) spread out the values of the explanatory variable (b) move the larger values of

  • 78

    the explanatory variable closer to the smaller values. If we transform the response variable the same principles apply. In Figure 3(a) smaller values of the response should be shift closer to the larger values and therefore a function is required that shifts smaller values more than larger values. The case in Figure 3(b) is the opposite, we want a function that shifts larger values much more than smaller values.

    (a) (b)

    Figure 3. The solid line represents a possible fitted line. Transforming to (a) shift lower values of the response variable closer to higher values (b) move the larger values of the response variable further away from the smaller values.

    3.0 TransformationGuidelines&TechniquesMosteller and Tukey2 came up their bulging rule guideline for how to transform graphs. Figure 4 illustrates the principle, with each quadrant of the circle showing a typical graph shape that needs to be transformed to make it linear. Now the idea is that the explanatory variable, the response variable OR BOTH can be transformed to make the relationship more linear. The transformation applied to the variables is a power transformation and that means the explanatory variable x is transformed to a power p and the response variable to a power q. Figure 4 shows what kind of powers should be used according to the graph shape.

    2 F. Mosteller & J. W. Tukey in Data Analysis & Regression (1977).

  • 79

    What transformation can we use as p (or q) is decreases below 1 and towards zero and then to negative values? A power of zero transforms everything to 1, which is useless so the transformation that is between low positive values and low negative values is the log transformation. Lets try to apply these guidelines to the cement data.

    Figure 4. Shapes of graphs and guidelines for power transformations that can linearize the relationship between a response variable and an explanatory variable.

    4.0 Transforming&FittingtheCementStrengthData4.1. FindingaTransformationWe already know that the relationship is not linear from Figure 1, so Assumption 1 (Chapters 7 & 8) is not met and we need to try a transformation. Comparing Figure 1 with Figure 4 it seems we have the shape in the upper left quadrant of Figure 4 and we should consider transforming the explanatory variable using p

  • 80

    We can create a new variable that is some function of the explanatory variable time simply by applying the function to the values in the time column. This is easily done. We know that we want p

  • 81

    (a) (b)

    (c) (d)

    Figure 5. Effects of different p values for the explanatory variable on the linearity of the relationship. (a) p=0.1 (b) log (x) (c) p=-0.1 (d) p=-0.5.

    4.2. CheckingtheRegressionAssumptionsafterTransformationWeve satisfied one of the regression assumptions-linearity of the relationship between the response and explanatory variable- by transformation. Now we need to check the others. Normal Distribution of the Random Error Variance: Lets check that the residuals are distributed about a mean value of zero. A histogram of the residuals is a one way to check the distribution shape and is one way to check the normality of the standardised residual distribution (but not the only one and it should not be used as a standalone test). The histogram in Figure 6 does not look particularly normal in shape but remember that the number of data is small and with small datasets histograms may not appear normal. There is no data outside of 2 standard deviations from the mean of zero. It is possible for the probability plot to be linear even when the histogram is not when

  • 82

    the data set is small. Its a good idea then to look at the probability plot next.

    Figure 6. Histogram of standardised residuals for the cement strength data.

    The probability plot in Figure 7 shows that most of the data are close to the theoretical line with one point at the lower end of the theoretical quantile and 2 or possibly 3 points at the upper end deviating from the straight line. However, all data is with 2 standard deviations of the mean. Considering the histogram and the probability plot, it seems reasonable that the data are approximately normally distributed without outliers.

  • 83

    Figure 7. Probability plot of the cement strength standardised residuals.

    Constant Random Error Variance Now, for the variance of the residuals against the fitted values, look at Figure 8.

    Figure 8. R input and graph of standardised residuals vs. fitted data. Looking at the graph in Figure 8 the residuals seem to vary more or less symmetrically around zero, more so for the three larger fitted values. Overall, the data seem to meet the criterion for zero mean and although

  • 84

    the smaller fitted values seem to have a smaller spread the spread of the larger values is more or less the same. Not perfect agreement but OK. Independence of the Random Error There doesnt appear to be any obvious regular pattern in the residuals versus fitted values. But what about the standardised residuals plotted against the order of the data i.e. the index plot? Figure 9 shows the index plot and overall there is no indication of a significant increase in the variance and no patterns in the data. Therefore the residuals appear to be independent.

    Figure 9.Index plot of standardised residuals (standardised residuals plotted against order in the original data file). . Transforming the explanatory variable from x to x-0.5 results in apparently good linearity and the other assumptions of simple linear regression are generally satisfied. 4.2. FittingtheResponseVariabletotheTransformedExplanatory

    VariableFor the cement strength data a transformation of the explanatory variable with p=-0.5 results in a linear relationship between cement

  • 85

    strength and t-0.5. Figure 10 shows a plot of the transformed and fitted data and the regression summary. The regression fit is:

    y = 45.66 33t

    The regression analysis in Fig 10(b) shows:

    1. A very small p value for the slope beta that strongly suggests that there is a relationship between the variables.

    2. A high R2 that implies the transformed model fits the data very well.

    (a)

  • 86

    (b)

    Figure 10. (a) Plot of cement strength against t0.5 with a linear regression line (b) regression analysis.

  • 87

    Chapter 11 Saving it All as an

    RStudio Script File & Plot Aesthetics

    1.0 IntroductionWorking interactively in R is great for exploring your data and trying out different ways of plotting and transforming data. But once youve settled on what you want to plot and how you want to plot it, instead of plotting everything interactively its often easier to write an R script that contains instructions for what is plotted and how it looks.

    R scripts contain all of the commands in one mass of text or script that is executed all at once rather than command by command and tells R what to do. The commands youve used interactively to plot graphs and change plot symbols and colours are still used but now that youve decided what you want to plot and the titles, annotation etc. its easier to write it all as one script that you can save, run and modify as and when you please.

    Well work with the rubber abrasion data again to show how to change the appearance of the plots and get them publication or report ready.

  • 88

    2.0 TheRScriptWindowAll the while youve been working with RStudio you should have seen a blank script window like the one in Figure 1. Now is the time to use that window to assemble an R script that contains commands, plot commands and/or commands to perform some analysis of data.

    Writing scripts is like writing a program and like programs, scripts can be simple or complex depending on what youre trying to do. For simple linear regression well simply look at moving from interactively plotting and getting basic statistical analyses to putting all our plotting and data analysis requirements into a single script that we only have to execute once.

    Like programs R scripts in RStudio permit the inclusion of comments so that you and people you may share the script with are clear about what the script is doing.

    3.0 ReproducingtheBasicPlotinanRStudioScript3.1. WritingtheScriptThe goal of this section is to reproduce the abrasion loss scatterplot made interactively but this time using a script.

    The first thing to do is read in the data isnt it? No, the first thing were going to do is make a comment about the data were going to read in. In R/RStudio comments start with a # as shown below:

    Note how a line number is assigned to the comment. We can be more explicit about what were doing by placing comments on multiple lines. RStudio can wrap text but its better to write short concise comments on multiple lines rather than long rambling notes. We could write something like:

    At this point you might want to save the file in the same directory as the data i.e. the working directory for this particular data. Whenever you

  • 89

    open this particular RStudio script you can use the toolbar to set the working directory or you can set the working directory in the script. Setting it in the script is a good idea. If you dont know the path of the working directory, point RStudio to the directory that contains the data file using the tool bar and in the console (where you did the interactive stuff) type getwd():

    Lets make a comment in the RScript window that were setting the working directory:

    Now type setwd() in the script window on the next line. When you type the first bracket it automatically generates the closing bracket and places the cursor in between.

    Now copy the working directory (your own of course, not mine) and paste it between the brackets:

  • 90

    Now insert an empty line, make a comment and lets type the statement you should know well by now that reads in the data:

    Remember that this is a script and nothing is executed when you press enter. Executing a script requires a different action than the interactive mode. All were doing here is typing in commands that we want to be executed later, commands we know should work.

    Now to the plotting. We should know this data set very well from our interactive work that looked at plotting the data, fitting it and so on so the next thing we probably want to do is plot the data. Here the plot is assigned the name plot1 and we use the plot command in the form:

    plot(y~x)

  • 91

    But youve already seen that within the brackets of the plot command its possible to specify the type of plot character and colour and in fact a lot of features about the plot can be specified within the brackets including:

    Axis line width Axis labels Axis titles x and y axis limits Plot characters Line widths and colours..

    Now lets specify the character and its colour, which youve done before.

  • 92

    Looks OK and this will work when we execute the script (which well get to soon) but we need to make the script readable and informative so that we remember the steps in the script and/or someone else reading it can understand what its meant to do.

    We can do that by entering new lines and making comments within the plot() command. Firstly, we can press return after the first bracket and then use the tab key to indent the cursor so that it lies under the first bracket. Then type a comment.

    Press enter again, indent the rub1$loss~rub1$hard so that it lies under the comment. Place a comma after the rub1$loss~rub1$hard (because there are other instructions in the script), press enter and lay out the script so that it is clear what we have done.

  • 93

    3.2. RunningtheScriptNow that youve got this far lets run or source the script. There are two ways to do this.

    3.2.1 SourcingTheScriptThe usual way to do this is to click on the Source button (see below).

    Sourcing the script (as youve just done) causes RStudio to execute the script for the active document and generate the result shown below:

  • 94

    3.2.2 Ctrl+Shift+EnterAn alternative way to execute a script for an active document is to press ctrl+shift+enter.

  • 95

    4.0 ModifyingTitles&Legends4.1. TheMainTitleBy default there is no main title for scatterplots but you can add one if you wish. We want to give a name, specify the size of the font, the font type (bold or otherwise),the font colour etc..

    4.1.1 MainTitleNameLike the x and y-axis titles the main title inserted within the brackets of the plot function. The command is:

    main=Insert title Subscripts and superscripts can be included in the main title using exactly the same format as the axis labels. An example is shown below.

    The graph that we get when this script is sourced is shown below.

  • 96

    4.1.2 MainTitleFontBy this we mean bold or italic. By default the main title is in bold. This can be changed using:

    font.main=integer 1 to 4 1 corresponds to plain text (the default), 2 to bold face, 3 to italic and 4 to bold italic. We can change the title to be bold italic using 4.

  • 97

    4.1.3 MainTitleFontSizeThe size of the text for the main title is controlled by the command

    cex.main=a number

    cex is an abbreviation of character expansion and its a number that enlarges or contracts the text and cex.lab specifies the size of the text for the axis. The default value is 1, which is what youll get without specifying cex.lab. The number specifies how much the text is expanded relative to the default and can be chosen to be 0.5, 0.75, 1.5, 1.3, 2 etc.. An example is given below with the corresponding graphical output.

  • 98

    4.1.4 MainTitleColourThe colour of the main title font is changed using the following command:

    col.main=a number

  • 99

    OR

    col.main=a name Colour numbers were given in an earlier chapter. The next example changes the colour to blue. Note the commas after each command except the last command in the script.

  • 100

    4.2. ModifyingAxisTitlesAxis titles are easily modified in R. Commands for modifying title font, colour and size are similar to those used for the main title.

    4.2.1 AxisNameThe R command for assigning axis titles is:

    xlab=title goes here ylab=title goes here

    We can add a comment that says were making axis titles and then type a title:

    To run this and generate a new plot, source the script as shown earlier. The result is shown below.

  • 101

    The font, font size and font colour are controlled using similar commands to those used for the main title. They are:

    font.lab=integer 1 to 4 bold, italic etc.

    cex.lab=a number font size

    col.lab=a number or colour font colour When inserted into the script we have:

  • 102

  • 103

    4.2.2 Superscripts&SubscriptsSuperscripts and subscripts require a modification to the axis titles format. The command for a superscript takes the form:

    xlab=title1~X^Y~title2

    The command for a subscript takes the form:

    xlab=title1~X[Y]~title2

    In both cases X is the letter or number to which the superscript is attached. Y is the number, letter or symbol that is the superscript or subscript.

    4.2.2.1 SuperscriptLets label the y axis with the text Abrasion Loss (grams h-1). Using the given format:

    ylab=Abrasion Loss (grams~h^-1~) Note how we have to include one bracket in the first set of inverted commas and the second bracket in another set of inverted commas. This is shown in the script window below for the y label:

    The result is the graph shown below on the right but the grams and h-1 arent separated.

  • 104

    To separate the two we need to place a space after grams and within the inverted commas:

    And then we get the required separation.

  • 105

    The degree sign used for temperature is a special case and requires a different syntax.

    4.2.2.2 SubscriptLets call the Shore Hardness HS and label the x axis accordingly. The command for the x label is:

    xlab=Rubber Hardness ~H[S]~(degree Shore) This is shown in the script.

  • 106

    The graphical output is shown below.

    4.2.3 GreekCharactersWell demonstrate this for just the y label but of course it applies to the x label and the main title as well. The syntax for getting Greek letters is to type the name of the letter. Lets give abrasion loss the symbol . We can write the y label as:

    ylab=Abrasion Loss ~lambda~ (grams~h^-1~)

  • 107

    Note the space after Loss and before the : remember this is used to create a space between Abrasion Loss and in the graph. There is also a space after the and (grams so we dont get grams in the y label. The script and the graph are shown below.

    To get lower case Greek letters type alpha, beta, gamma etc. and for upper case letters type Alpha, Beta, Gamma etc..

  • 108

    5.0 LinearFittingofDataCommands for fitting data and plotting the fitted line are inserted outside of the plot function. 5.1. FittingtheData&InsertingtheFittedLineYouve already come across the command for linear fitting, which is:

    lm(y~x) And the fitted line can be inserted using:

    abline(name of fit)

    OR abline(lm(y~x))

    Within the abline() command the line width, line type and colour can be specified.

    lwd=number line width The default is 1 and numbers can be smaller or larger than 1, e.g., 0.7, 1.2 etc..

    lty=integer line type Here style is dashed, dotted etc. with integers determining the style. The integers and corresponding line types are shown below:

  • 109

    The fit and the inserted line is preceded by a comment (get used to doing this so you know what you did) explaining what the commands are doing. The relevant part of the script is shown below. Take note of the line type and colour. The graph is also shown below.

  • 110

    6.0 ScatterplotSampleScriptforRubberAbrasionData6.1. PlottingandRegressionwithFittedLineAn example of a script for the rubber abrasion data is shown below. This can be used as a model script for your own data.

  • 111

    The script results in the following plot and regression analysis.

  • 112

  • 113

    6.2. PlotGraph,PerformRegressionwithFittedLineandExportRegressionFittoatxtFile

    A sample script that plots, fits, shows the fitted line and exports the regression table to a txt file is shown below.

  • 114

    The regression fit is exported to a txt file:

  • 115

    6.3. PlotGraph,PerformRegressionwithFittedLine,ANOVAandExportRegressionFitandRegressionANOVAtoatxtFile

    The previous script is modified as shown below to export the regression summary and ANOVA summary to the same text file s1.

  • 116

    Chapter 12 Plotting Multiple Sets of

    Data

    1.0 IntroductionChances are that youll have more than one set of response variable data measured at the same values of the explanatory variable and you want to firstly explore the data by plotting multiple data on the same graph. Or you may have sets of response variable data measured at different values of the explanatory variable that you want to plot on the same graph. This chapter shows you how to do this, firstly interactively and then via script writing.

    2.0 CuWireDataSetLets look at a set of data of the tensile strength of different copper wires in wire bonds aged at 200C for various times. Weve called the data cu200, saved as a csv file in a specific working directory. The data consist of the tensile strengths of four different copper wire bonds after aging at 200C for the same times. 2.1. InteractivePlottingofMultipleDataBy now you should be familiar with loading in a csv data file interactively. After loading in the data we create a graph called plot1 thats a graph of

  • 117

    the first set of data w1, plotted against time t. The interactive input is shown below with the graph of w1 versus t.

    The next step is to plot the second set of data w2 against the same values of time. The way this is done is using a command called points() that is used in a similar way to plot() and places the variables to be plotted within the brackets, specifies the plot characters, the plot colour etc. in the same way as plot(). We assigned the name

  • 118

    plot2 to plot the second set of data using the plot() command, as shown below with the graph.

    Graphs of the other two copper wire strengths versus time are added in exactly the same way. The input and graphs are shown next.

  • 119

  • 120

    2.2. InteractiveAnalysisofMultipleDataThe procedures we used before can be applied here. We can fit each set of data and plot the regression line on the graph, although it can look messy. To fit the regression line we simply assign a name to each fit like fit1 and line 1 for plot1.

  • 121

    We can repeat the process for each plot and get the rather messy graph shown below.

  • 122

    Once weve done the fitting we can of course get the analysis results using the summary command and assigning a unique name to each summary. We show this below only for the first set of data w1.

  • 123

    2.3. AnExampleScripttoPlottheDatawithaLegendWhen youve finally decided what you want to plot, you probably want to add a legend. We dont recommend adding legends during interactive plotting, they dont normally serve any purpose at that particular stage. Adding legends in R can seem tricky. For this set of data, which consists of points, the command for producing a legend has the syntax: legend(x, y, labels, pch=c(pointtype1, pointype 2),

    col=c(col1, col2,.) Looks complex. But lets break it down:

    1. The x, y refers to the co-ordinates for the legend (not the plot variables).

    2. labels refers to the names of each variable. We called them w1,w2, w3 and w4.

    3. The pch part asks to define the plot characters. We should match them to the plot characters we specified in each plot.

    The best way to illustrate how to get legends is to show some examples.

    2.4.1.EasyLegendPlacementThe simplest way to place a legend is telling R to place the legend in one of the four corners of the plot, the top or bottom of the graph or the centre. The portion of the script to do this is shown next together with the other commands for legend names, symbol type and colour. Two graphical examples are shown, the second with different legend names. The options for placement are: topleft top topright bottomleft

    bottom bottomright center

  • 124

  • 125

    2.4.2.PlacingtheLegendwithaMouseClickA simple modification of the legend command allows the user to run the script and locate the legend on the graph using the cursor. The syntax is:

    legend(locator(1), labels, pch=c(pointtype1, pointype 2), col=c(col1, col2,.)

    The command locator(1) must not be changed while the rest of the commands can be customised as shown earlier. An example of the

  • 126

    script is shown next. Weve entered each command as a separate line to make it easier to read.

    An example is shown below.

    2.4.3.PositioningtheLegendwithCoordinatesThis is done by specifying x and y, the coordinates of the legend. The coordinates are based on the scale of the graph. In our wire plot the

  • 127

    scale is approximately 0-1100 for x and 0-18 for y. A sample script is shown below.

    The graph is shown below. The horizontal and vertical red lines mark x=200 and y=5 and show that R places the box such that the left side is aligned at x=200 and the top of the box is placed at y=5.

  • 128

    2.4.4.LegendwithoutBordersThe legend border can be omitted as follows:

  • 129

    3.0 MultipleSetsofDatawithDifferentExplanatoryVariableValues

    With different sets of data that need to be plotted on the same graph we follow the same procedure as before except we specify different variables in the points command. Well plot the data shown below that consists of two sets of measurements of the same response variable x versus time but for two different experiments.

    The graph is shown below and the script is on the next page.

  • 130

  • 131

    Chapter 13 The R Graphics

    Environment & Graph Customization

    1.0 IntroductionWeve concentrated on getting things plotted up to now because this text is intended to quickly enable R novices to plot graphs. But at this point and after plotting graphs it seems appropriate to better understand how R sets up graphics in order to learn how to customise the appearance of graphs. This chapter explains the very basics of how R is set up to plot graphs and how to further customize graphs to get the look that you want. The rubber abrasion example will be used to illustrate graph customization in RStudio scripts.

    2.0 HowRDividesPlotsintoRegionsWith reference to the illustration in Figure 1, an R graph is divided into several regions:

    The plot area The plot margin The outer margin

    Well only concern ourselves with the plot area and the plot margin. The outer margin can also be modified within R but we wont deal with that topic in this text.

  • 132

    Figure 1. Illustration of the division of a single plot into 3 areas consisting of the plot area, the plot margin and the outer margin. Each margin is sub-divided into four areas.

    Well explain the layout of the plot area and the plot margins. 2.1. ThePlotAreaThe coordinate system for the plot area is simple and has already been used when we talked about legend placement (Chapter 12, Section 2.4.3). The coordinate system for the plot area is the x and y scales, which is illustrated in Figure 2. So if your x scale is from 0 to 50, the x-coordinate is between 0 and 50 and if the y scale is -10 to 100, the y-coordinate is within those limits. The 2.2. ThePlotMarginsWith reference to Figure 2, the plot margins are divided into 5 lines, starting from Line 0 adjacent to the margin and going to Line 4 moving away from the margin.

  • 133

    Figure 2. Illustration of the division of a single plot into 2 areas consisting of the plot area and a margin area that is sub-divided into 4 margins. Within each margin there are 5 lines.

    The plot margins contain information such as text labels, text axis labels and tick marks. Positioning those items is achieved by specifying the line where each item appears.

    3.0 High&LowLevelPlotFunctionsandCommandsHigh-level plot functions create complete graphs. Functions like plot(y~x) produce a basic graph with default values of any unspecified parameters. High-level commands like pch and xlab are added to high-level functions to change the appearance of graphs (commands like pch and xlab are called arguments of the function plot(y~x)). For example, youve already seen this as pch=19 being added within the brackets of plot(y~x) to produce a filled circle plot character. A list of common high-level commands used for modifying graphs is shown in Table 1, but this list is not everything, just some of the commands that are useful for controlling the appearance of graphs. These commands apply not only to the plot() function but also to the legend() function for example (Chapter 12 Section 2.4.4). You should recognise commands 1-8, which youve already encountered.

  • 134

    Table 1. Common high-level graphics settings. Item Setting Description Example

    1 main Main graph title main= Histogram 2 font Main title font

    Label font font.main=2 font.lab=1

    3 col Main title colour Label colour Plot character colour

    col.main=1 col.lab=2 col=2

    4 cex Main title text size Label text size

    cex.main=2 cex.lab=1.75

    5 lty Line type (dotted, dashed etc.)

    lty=3

    6 lwd Line width lwd=2 7 bty Box around graph/legend

    No box bty=y bty=n

    8 pch Plot character type pch=19 9 xaxp

    yaxp Number of x axis ticks Number of y axis ticks

    xaxp=c(20, 100, 8)

    10 xaxs X or Y axis with padding Without padding

    xaxs=r xaxs=i

    11 mgp Axis, text and tick label position

    mgp=c(4,2,1)

    12 tck Tick mark length relative to plot size

    tck=0.75 tck=-0.75

    13 tcl Tick mark length relative to text size. Positive values give upward tick marks, negative values give downward tick marks

    tcl=0.75 tcl=-0.75

    14 family Selects the font family=mono family=sans family=serif family=symbol

    Low-level plot commands add information or features to an existing plot and are not included as arguments within functions. That means they are used outside of functions like plot() and hist(). Table 2 shows a list of common commands, some of which youve already used.

  • 135

    Table 2. Common low-level graphics settings. Item Setting Description Example

    1 abline Adds red fitted lines Horizontal lines at specific y Vertical lines at specific x Horizontal & vertical lines at specific x,y

    abline(fit1, col=red) abline(h=350) abline(v=90) abline(h=350, v=90)

    2 legend Adds a legend See Chapter 12, section 2.3

    3 grid Adds gridlines that match the tick marks Adds 7 and 14 gridlines for x and y respectively.

    grid(NULL,col=blue, lty=2, lwd=3) grid(14,7,col=blue, lty=2, lwd=3)

    4 text Adds text to the plot area

    text(x,y, text)

  • 136

    4.0 PlotCustomization4.1. PositioningAxes,AxisTitleandTextLabelsOur plot of rubber abrasion data using the default values assigned by R to the axis position, axis text and axis labels is shown in Figure 3.

    Figure 2. Rubber abrasion plot with default axis parameters.

    The script for this plot is shown below:

  • 137

    The command we need to change the position of the axis label, axis labels and axis is:

    mgp=c(axis text line, label line, axis line) m