poweredby upgrad education private limited © copyright

15
Hi, all in this video we're going to discuss about ggplot2. So what is this ggplot2 and why it's so specific? It is that ggplot2 is a data visualisation package that is available online. At the same time, this ggplot is contributed and licensed by MIT and one of the most commonly and widely used package of the library to perform at different levels of data visualisation in R. And basically like it is referred as a grammar of graphics within R. So that is the reason why it had the name of ggplot. Then why we have to use it. The important reasons to go with the ggplot is that aesthetic attributes. To be precise, what is this aesthetic attributes representing, you know visualisation package is that every visualisation can be composed of colour, shape, size and any of the other geometric objects. It can be a points or the lines or the bars. But this to support by a different characteristics is basically been identified or applied on many layers. Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Upload: others

Post on 18-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Poweredby upGrad Education Private Limited © Copyright

Hi, all in this video we're going to discuss about ggplot2. So what is this ggplot2 and why it's so specific? It is that

ggplot2 is a data visualisation package that is available online. At the same time, this ggplot is contributed and

licensed by MIT and one of the most commonly and widely used package of the library to perform at different levels

of data visualisation in R.

And basically like it is referred as a grammar of graphics within R. So that is the reason why it had the name of ggplot.

Then why we have to use it. The important reasons to go with the ggplot is that aesthetic attributes.

To be precise, what is this aesthetic attributes representing, you know visualisation package is that every visualisation

can be composed of colour, shape, size and any of the other geometric objects. It can be a points or the lines or the

bars. But this to support by a different characteristics is basically been identified or applied on many layers.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 2: Poweredby upGrad Education Private Limited © Copyright

But to perform this operations in this many levels of layers like colour, shape, size and line, bars or points, other

packages are so complicated where ggplot helps us to make it so damn straight forward to easily use it to perform

any different complex level of visualisation charts.

So the idea here is that this is one reason, along with multiple layer support. So, as I pointed earlier, the multiple

layers on top of building the existing layer, let's take you build a bar chart and then you wanted to add a colour layer

on top of it and then, if you wanted to add a size layer on top of it.

This ggplot makes it easy to perform that operation and thirdly, faceting. So this is something that, whenever we

want to include a specification on how to break the data and then to bring it into a different required subsets to

display on the multiple layer of data.

To summarise again, we have a lot of data in order to put into a subsetting data and then to figure it out the

specification of the same thing to bring it in the same visualisation that is basically faceting. So this three important

reason makes ggplot as one of the most important data visualisation chart that we have to use.

So, let's get into a basic operations of ggplot. If you're using it for first time, make sure that you install it with

install.packages piece of code. Since I already installed it, I am directly calling the library ggplot2. I call that. Let's have

an idea about to request help from ggplot.

If you wanted to go with the documentation of ggplot, you can use the function help. So ggplot2. This will direct you

to the documentation of the package itself. So who is the authors and then the description of it? Maybe you can have

the other referencing link for it.

Then, let's go with simple example. I'm gonna call iris data, which is available within the Rstudio itself to perform

simple operation. Let's perform one operation to understand it. So ggplot, the function itself is ggplot. Don't get

confused here, because library name is ggplot2.

The function name is ggplot, so it it's easy to confuse you. So, whether it's a function or a library, the library package

name itself is ggplot2, function name is ggplot. So I am calling iris data, then the aesthetic attributes that I am

producing here is aes of the column names.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 3: Poweredby upGrad Education Private Limited © Copyright

If you want to have an idea about the data, I execute the data here. We have 150 observations, 5 different variables,

so which is Sepal.length, Sepal.width, Petal.length, Petal.width and Species. So now I'm gonna give the aesthetic

attributes as from let's take Sepal.length, Sepal.length.

So this is one attribute that I am producing. Then I would say I will go with Petal.length, so Sepal.length

and Petal.length. I give this two variables as a aesthetic thing, along with I'm gonna give a colour pattern. So colour is

Species. So I give all these three characteristics, and then I close this.

So this functionality will help us to build a straight forward chart with all these three characteristics, which

is Sepal.length, Petal.length and Species. Now with this Species or the Petal.length or the Sepal.length, I wanted to

also add a type of chart that I wanted to see.

So I am going with a simple scatter plot, geom_point. So this is basically helping us to produce a simple scatter plot.

Now, since I gave all this information, I would like to store it in a different variable name. So I just give it as a irisplot

name. Then I execute this.

So this chart that I built based on ggplot is stored in the variable called iris plot. So whenever I call this iris plot, I will

be able to see that visualisation. So what I am doing here, I am printing iris plot.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 4: Poweredby upGrad Education Private Limited © Copyright

Now, look at here. A simple chart, a scatter plot has been created. So we have three different species and then we

have Petal.length, Sepal.length against each other. x axis and y axis, and then the Species in colours, three different

colours because of three different species.

So this is how you build a chart or a visualisation using ggplot2 in R studio. Similarly, rather than changing just this

specific point, a geom_point to a geom box plot will produce a box plot to change it to a line, it will build a line chart

with the same other characteristics.

Of course, the parameters would change, but a simple change in piece of line will change the chart type itself. This is

the power of ggplot.

So here in R, how we gonna perform a box plot using ggplotters. First thing, we have to import the data, so and this

time, I'm importing a Toyota_data which has three columns and then we have it for 1400 observation. So I'm

choosing first five rows so that we can see how the data is actually.

So id, price and age, we have three different columns. I'm gonna perform a box plot on price column, so I wanted to

understand the outliers present in price column.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 5: Poweredby upGrad Education Private Limited © Copyright

So how I'm gonna go about is ggplot. This time I wanted to mention what data I'm using: first variable. So I

mentioned data is the actual variable that I use. Then what are the aesthetic attributes that I wanted to go with

ggplot. Since box plot is a univariate data model in precise, a single column can be used to produce the chart.

It is not something that we need to have multiple attributes or columns in order to perform the chart. So I am going

with only one column, which is price. So how do we go about the aesthetics if we don't have multiple columns itself?

The idea is, you can specify aesthetic aes and then the x is basically the name of it simple as this.

The price I give it in a double quotes, because it is just the text, the name. Then y, which is the actual value, because I

want the box plot values on my y-axis, not on the x-axis, because it is just univariate data. So I gave this price, then

this will produce a simple y-axis based charter, but I want a box plot on top of it.

How we're gonna get that is basically the same operation of geom, geom box plot, which is from ggplot2. I give this

so that this chart will be of a box plot rather than of just a simple scatter plot or just a data distribution plot. I execute

this. Now, I have get to see my box plot in the plot section.

Now, what does this mean here is the centre line. This is basically the box limit and then the data is distributed above

it and then the below it. So there are no data points pointed underneath the box chart itself, which is basically the

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 6: Poweredby upGrad Education Private Limited © Copyright

lower whisker and above the upper whisker, there are many points listed here, which means a point above the upper

whisker are actually outliers or the point below the lower whisker are outliers.

Since there is no point to see in the lower side, there is no outlier or abnormal points. But on the reverse, the above

side, the upper side, the upper whisker has many values above it. So all these data points can be considered as a

outlier. So this is how you can build a box plot with the help of ggplot and then to understand it.

Histogram is a important chart type for this specific requirement, simple, as that. So we're going to use histogram in

terms of two different axis. One axis talks about the frequency of the numbers that we have in the data and the

second axis talks about the bins that we're gonna build on top of it.

Let's take, if you have a numeric column with the age category, if we have age is in between 15 and 30, that becomes

one bin, 30 to 45 bin 2, 45 to 60 bin 3. Now, on the other axis, it talks about the frequency of those categories. Simple

as that, this is what histogram is. Now, let's get into R to solve this. Here in R, we're gonna use histogram.

How we're gonna use histogram is: x-axis goes with the bins that we have been created or the size of bin that we

have provided; on the y-axis, it's going to produce the frequency of it. So how it's going to work in R, let's see. So, I'm

importing the data to have a short understanding of data, I'm just displaying the top five observations of it.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 7: Poweredby upGrad Education Private Limited © Copyright

So this is how the data is listed: id, price and age. Now I'm gonna choose one of these three columns, which is price. I

wanted to understand the price distribution of this data. So I start with performing ggplot. ggplot, the data that we

have then the aesthetic attributes that I wanted to pass through was price, because this is price distribution. Then,

after creating a basic chart, I want to define what type of chart it is, so geom histogram.

Now this will build a basic histogram chart. So it says that without bin value by default, it's going bin is equal to 30. So

every bin that is represented here is of 30 price. So now what I'm gonna update this chart type is rather than 30 bins,

I wanted to define the bin width.

So I don't want to limit it with 30 number of bins. I wanted to go with bin width. I am saying 500. To say it in the

different terms, I'm saying that the price should be of 500 in difference, 0 to 500, 500 to 1000, 1000 to 1500. That

should be the bin width.

So after defining this, the chart has been changed because every bin now is representing only 500 in the price

change.

Now trend analysis to give a brief. It is that when we wanted to understand the variable spread in a different

timestamp, that is basically trend analysis. How it is actually evolving or spreading in a different time frames. Now,

let's start with building a line chat.

So I'm setting the working directory. I have the working directory path. Now, I'm importing the data. Now, let's have a

short understanding about the data. Data, I'm giving the first five rows. You could see it as multiple number of

columns to be precise 20 variables, we have. 3312 observations we have technically rows.

Now with this. What we gonna use in order to build a line chart is always keep this in mind: in order to build a line

chart, always go with one date variable that will make it so easy to understand the trend in a different times. So we

have order date, so I'm gonna choose order date and then I'm gonna use sales of the specific product or the specific

timeline.

So there is a column sales and there is a call of date order date, so I am going to use this two columns in order to

build the line chart. Now, since order date is available, I wanted to make sure that it is in the date data type.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 8: Poweredby upGrad Education Private Limited © Copyright

So, I am performing str(data) that will give the structure of the data. Now you could see order date is in character

data type, not in date data type. So what I'm going to do, I'm going to change this to a date data type. How I'm going

to change it is?

So data order date column, I'm converting it to as .Date. So this function will convert it to a date actually. So what is

the input that I gonna give? It is the same column name, the format I have to input, which is format, and then I

specify what the format is.

So you could see in order date. It is month, date and year. So I'm gonna apply the same format here, so that the

machine can understand it. So %m for the month, /%d for date, /%year. So now this will change the data type to a

date. I executed it. Now, I'm checking the structure. Now we could see order date is basically date format, perfect.

Now, let's build a line chart. In order to build the line chart, I'm going to use ggplot2, so I'm calling the library,

library(ggplot2). Then I'm gonna build this line chart which is ggplot, then I pass the data, then of course, aesthetic

variables I'm gonna give as we need to one in x-axis, one in y-axis.

So in x-axis I want order date and in y-axis, I want profit or maybe sales. Now, as soon as I give this, I'm gonna add a

simple line chart, the same way that we have seen with the other charts. It's geom_line, a line chart. Now a line

chart, I'm specifying what colour I want?

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 9: Poweredby upGrad Education Private Limited © Copyright

Just a additional parameter. If you don't want absolutely fine. So I mentioned color = 'red'. Now, I'm executing this,

look at it, I have a line chart created.

Maybe here this is sales. You could see that the sales is actually increasing in a different times, somewhere around

April, and sales is again coming back to the normal sales level for the other months and then during November and

December, it is increasing. So this is how the sales is actually. Now, let's replace sales with profit just for a different

understanding, cool.

Now, look at it. Profit during the month of April is high, but month of somewhere around May, it is actually having a

loss. Similarly, every month what is the ups and downs, we could able to see that. So this is how you build a line chart

and then you understand the trend of a different period.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 10: Poweredby upGrad Education Private Limited © Copyright

Hi all. In this video we gonna talk about variable relationships with respective correlations. So what is variable

relationship? As it's self-explanatory, we can think about a relationship between the variables, we refer to variable

relationship. But mathematically or statistically, how do we support it with our data analysis skill. This is where

correlation analysis comes into play.

To be precise, think about it. I can say that the volume of number of units of a product is highly correlated to its

profit. Straight forward right, so how come we can refer it? But the number of units that we sold against to the profit

that we achieve. We just compare those values, we get the correlation, how much it is correlated to? But is there any

statistical metric to support it?

Yes, there is something called Pearson correlation, where it gives us a bandwidth range of -1 to 1, where -1 refers to

the 100% negative correlation and +1 refers to 100% positive correlation. What is this negative correlation, positive

correlation? The idea is whenever there is a exact opposite trend of directions, that is where we refer to negative

correlation.

To be precise, let's take, you give a discount 2% discount, so you have higher sales. So what if you increase the

discount to 5%, the sales will keep increasing. What if you give the discount for 20%, it keeps increasing, but to look

in the other perspective, the profit will be keep going down.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 11: Poweredby upGrad Education Private Limited © Copyright

Whenever discount is increasing higher, the profits are going down. So, it is correlated but negatively correlated. A

correlation between discount and the sales is positively correlated. Increase the discount, sales is also increasing. But

increasing the discount will result in decreasing the profit.

So one of a idea is increasing on the same direction. That is positive correlation. When one is increasing, the other

one is decreasing, which is negative correlation. This is how you can determine the actual correlation between the

variables. This is what we refer to variable relationship. Now, let's get into R to see how it works.

We're going to use correlation analysis and scatter plot in order to understand the relationship between the

variables. Let's start with importing the data. So in order to import the data, I am setting my working directory. So,

now I have my working directory set. I import the data.

Now, let's have a short understanding about the data, so you could see we have three columns: id, price and age.

Since price and age, both are numeric data. I'm gonna use this price and age column in order to perform a simple

scatter plot and the correlation analysis. Now, let's start with scatter plot.

The same way in order to build a scatter plot, I'm using ggplot2 package, then I'm gonna use the function ggplot as a

function itself. So ggplot, I pass the data, then the aesthetic variables which I'm gonna give two variables, one for

x-axis, one for y-axis, because scatter plot works on both the axis.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 12: Poweredby upGrad Education Private Limited © Copyright

So x is equal to I'm gonna give. Here, let's take age, we have two variables age and the price, so x-axis I give it age and

y-axis, I pass it price. Then the same way that we build the other charts. The same thing here, geom scatter plot,

which is point.

So point, is a function which will help us to build a simple scatter plot. Now we gave this, I execute this. Now you

could see there is a scatter plot built.

Now have a look and try to understand here. That x-axis is actually given by age and y-axis is actually given by price.

Now you could see a bivariate analysis. Think about it in outlier analysis using box plot, we use only one column at a

time in order to understand the outlier, but here it is two different variables.

Now look at it, which one would you consider outlier? Will you consider these two points as outlier just because it is

like outside the regular trend of the data or this point which is at the top? Will you consider this point as an outlier, or

will you consider this point as an outlier?

This is how you try to understand the relationship of the variable. Now to interpret which one is actually an outlier is?

This one is potential outlier. Why is considered this as a potential outlier is? Think about it, in the range of age, this is

actually not outlier, because age goes somewhere around 40 to somewhere around 120.

So it is in between somewhere 80 to 90. Seems to be not an outlier in a univariate. If you perform a box plot, it

cannot be identified. At the same way in the price somewhere around 5000 to somewhere 30000. This is somewhere

around 20000 in middle, so it cannot be classified as an outlier in a box plot.

But with this two axis: age and price together, this is actually been away from the actual trend. Think about it. I can

draw a straight line here. But that straight line is not close enough to this point. So this is actually a potential outlier.

How about these two points?

This doesn't fall under a different distribution at all, because it is the same line of trend. If I draw a straight line

somewhere here, it is actually going well with the distribution, but this data point also fit into that. That is why I don't

consider this as an outlier.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 13: Poweredby upGrad Education Private Limited © Copyright

How about this point? Having a very less age and a very high price. So this should be a one more potential outlier,

comparing with this one. But this two data point is not an outlier, because that obeys the existing trend of the other

data points. That is how you understand the variable relationship.

Now, beyond scatter plot, we have something called correlation analysis. So correlation is nothing but two variables

have the same direction of spread. In simple terms, I can say it in a way like both of the variable travel in same

direction or in an exact proportion of reverse direction.

So if it travels in the same direction, that is positive correlation. If it travels just opposite direction to each other, that

is basically negative correlation. Think about it. The number of cars sold in India per year is highly correlated with

number of divorces in India. Now think about it.

Does that mean that people are getting divorced? That is why they are buying a new car. We cannot say that, but it is

highly correlated because number of divorces on every year is also increasing. At the same time, the number of car

sales is also increasing, both travel in the same direction that gives the correlation, but not the causation.

So it's basically not causing it, divorces are not causing the number of sales in car. Number of sales in car, not causing

divorces, but it is correlated. So this is how you have to understand, and then you have to justify yourself whether

you can use the other variables as correlation or not. Now, for this case, let's use price and age to perform the simple

correlation.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 14: Poweredby upGrad Education Private Limited © Copyright

Now I'm gonna use a function cor and then I'm gonna pass both of these variables: data$Price and data$Age. The

correlation is -0.866, which means price and age are negatively correlated by 86%, approximately, where we can say

that whenever the price is decreasing, the age is actually increasing.

So think about it a used car, whenever the age is increasing, the price will keep dropping. That is what exactly this

correlation is talking about. Now, this is one idea to understand it. If you have a huge data frame, what you can

perform is instead of these two variables, you can perform the entire data set itself with passing the data frame with

core functions, cor function.

Now this will build you a correlation matrix. In this, you have all the three columns against the same three columns.

That is why the diagonal values will always be 1. Id against id is 100% correlated, price against price is 100%

correlated, age against age is 100% correlated.

The diagonal values will always be 1. But the next values, apart from the diagonal values, is what you have to

consider for. Now, price and id is correlated by -0.73.

Since id is not making any sense in terms of any analysis. That is why we don't consider it. If we consider the rest of it

-0.8669. Same thing here. So this is how you can build a simple correlation matrix on the data frame. So if you

wanted to round the numbers, rounded correlation of data, I am giving maybe rounded to one decimal.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved

Page 15: Poweredby upGrad Education Private Limited © Copyright

So this would be the correlation value and to give you an inference, correlation value ranges in between -1 to 1,

which was proposed by Pearson. So we call it as a Pearson correlation value. So -1 refers the 100% negative

correlation, +1 refers to 100% positive correlation.

The correlation that we don't want is somewhere in between 0, because even if it is negatively correlated, we get

inference. Even if it's positively correlated, we get inference. When we don't get inference about the variable is that

when it is not correlated at all.

There, it's somewhere in between the value of 0. So yeah guys, this is what correlation analysis and this is how you

can perform a variable relationship analysis altogether, using scatter plot and correlation analysis.

Disclaimer: All content and material on the upGrad website is copyrighted, either belonging to upGrad or its bonafidecontributors and is purely for the dissemination of education. You are permitted to access, print and downloadextracts from this site purely for your own education only and on the following basis:

● You can download this document from the website for self-use only.● Any copies of this document, in part or full, saved to disk or to any other storage medium, may only be used

for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personaluse only.

● Any further dissemination, distribution, reproduction, copying of the content of the document herein or theuploading thereof on other websites, or use of the content for any other commercial/unauthorised purposesin any way which could infringe the intellectual property rights of upGrad or its contributors, is strictlyprohibited.

● No graphics, images or photographs from any accompanying text in this document will be used separately forunauthorised purposes.

● No material in this document will be modified, adapted or altered in any way.● No part of this document or upGrad content may be reproduced or stored in any other website or included in

any public or private electronic retrieval system or service without upGrad’s prior written permission.● Any right not expressly granted in these terms is reserved.

Powered by upGrad Education Private Limited © Copyright upGrad Education Pvt. Ltd. All rights reserved