case study: stock price prediction

© Copyright 2019. UpGrad Education Pvt. Ltd. All rights reserved

In this video, we will discuss a case study based on linear regression. I hope you guys are now familiar with the topic

and excited to learn its application in the real world. Linear regression is a very powerful statistical tool and it is used

everywhere from scientific research teams to stock markets.

I have listed some popular applications of linear regression here.

• For example, if a company’s sales have increased steadily every month by conducting a linear analysis on

historical sales of monthly data, the company could forecast sales in future months.

• Another application of linear regression is in fraud detection. Consider an E-commerce company who wants

to identify fraudulent transactions on its website. It would be interested in detecting early signs of fraud and

linear regression can be useful in knowing how different variables like size of the order, number of

Transcription

Case Study: Stock Price Prediction


transactions, that though he's a first-time shopper or not, shipping address, etc., can predict the probability

of a fraud.

• It can also be used to analyse marketing effectiveness. For example, if a company wants to know how different

channels of advertising have impacted its sales, they can run a regression to establish the relationship between

sales and different marketing strategies like sales versus newspaper ads, sales versus Instagram promotion,

sales versus free distribution of samples, and so on. This should sound familiar to USB, also looked at a similar

example in the lectures.

• Finally, we also have quantitative finance, regression is often used by financial managers and analysing

investment returns and valuing different assets.

Coming to investments, let us move to a case study, which discusses this topic in detail. We will be discussing stocks

are doing prediction to a practical example, stock price prediction or stock return prediction is an attempt to determine

the future value of a company based on analysis of factors, which impact its price moment.

Predicting how the stock market will perform is one of the most difficult things to do. If you look around, you'll find

hundreds of people who are trying to predict the market or stock prices every day, and the reason why this is so

difficult is because there are so many factors involved in the prediction, there are physical factors, psychological

factors, behavioural factors, etc.


Some of the factors are listed here:

o They can be macroeconomic factors like state of the country's economy, growth rate inflation, etc. For

example, if a country, as a whole is doing well, chances are that firms in that country also have superior returns.

These macroeconomic factors generally impact a large number of stocks together.

o There are also other factors, which are more specific to a stock like profit margin, debt to equity issues, sales

of a company, and so on. These factors are indicative of strength of the underlying business.

o And we also have cooperative banks like mergers and acquisitions, dividend announcement, changes in senior

management. These factors also directly impact a company's stock.

o Then you also have soft factors like investor sentiment, whether the investors on average are bullish or bearish

on a particular stock. If a price of particular stock is expected to keep pricing, investors are said to be bullish,

but if price falls are expected sentiment is bearish.

o This is also not an exhaustive list. There can be many other factors like socioeconomic conditions, new product

launch, brand values, and so on.


All these factors combined to make share prices volatile and very difficult to predict with a high degree of accuracy.

Some of these factors are also behavioural and not easy to quantify.

In such scenarios, it is not possible to predict the stock price with very high accuracy and portfolio managers generally

settle for a reasonable accuracy. They also don't expect to go right every day and are more interested in predicting

returns over a period of time.

Consider a portfolio manager who has built a model for a particular stock. His model could be based on daily data and

based on his model; he has predicted returns per se the next 10 years.

He may not be right on each day, but even if he can somewhat match the predicted cumulative returns for the next

10 days with the actual returns, he has done a decent job. So, the objective of stock price prediction is often not to

predict each data point, but rather the return over a period of time.

So, how does linear regression fit into this picture of stock return predictions? Let's see that through a real-world

example of a financial analyst who wants to predict the returns of ABC limited, which is an FMCG company listed on

the Bombay stock exchange, and he also wants to prepare an analysis on the factors which impact ABC limited returns.


We have the data set here. Let's take a look. The data starts from 2007 and goes till 2019, so we have approximately

13 years of data. We have daily returns of ABC or change in price of ABC in column B. Next, we have daily return on

Sensex in column C and daily return on nifty in column D.

Sensex and nifty are the two main stock indices used in India. They are benchmark Indian stock market indices that

represent the weighted average of largest Indian companies. So, Sensex represent average of 30 largest and most

actively traded Indian companies. Similarly, nifty represents a weighted average of 50 largest Indian companies.

Another variable is dividend announcement in column E, which is one, if a company has announced dividend on a

particular date and zero otherwise. So, for example, it is one on January 2, 2007, because the company ABC announced

a dividend on this date and it is zero for all other days when company did not announce any dividend. Notice that this

is a dummy variable.

Lastly, we have a sentiment variable in column F. It is a sentiment score which quantifies how investors feel about

ABC. It can be based upon news analysis or upon option market analysis or based on some survey. We would not go

into details of score here and take it as given. A very high sentiment score represents bullish investors and vice versa.

Let's plot the target variable to see how it is shaping up in the data. Also note that the independent variables, which

are Sensex, nifty, dividend announcement and sentiment, they are all lagged by a date. So, for example, when you are

trying to predict return on January 2, we will use previous day values of these variables.


Let us plot the target variable to see how it's shaping up in the data. So, here's the return of ABC, which is a target

variable of Y.

We also have cumulative plot of Y. As discussed earlier, we would also be comparing actual cumulative returns with

estimated cumulative returns to judge our model.


Before we begin with the linear regression, we would split our data set into two parts. Training set and test set. As the

name suggests, the training data set is used to build or train the linear regression equation, and it is then tested on

the test set, which would give an idea of our model’s accuracy.

In this case, our training set would be 2007 to 2017, and we will only use this data to decide the variables we want to

include and then estimate the coefficients of linear equation. We then use this equation to predict the returns of 2018

and 19 and compare the results with actual observed returns. This is also realistic because you will build your model

using historical data and make predictions for the future.

Let's first start with the simple linear regression. We will use only one variable, which is nifty here to predict returns

of ABC. So, equation would be, on the left-hand side, we have return on ABC and on the right-hand side, we have alpha

plus beta into return on nifty.

This equation for relationship between return on a stock and return on index also holds an important meaning in

theory of finance. It is borrowed from the capital asset pricing model or CAPM as popularly called.

As complicated as it may sound, CAPM is actually just a single factor linear regression model. The CAPM assumes only

one factor, which is the market return or market risk premium to explain the return on a stock. In practice, we typically

proxy the market with the broad index like in this case, we have used nifty and the dependent variable is the expected

return on a stock.

The alpha in CAPM is the risk-free rate of return or the minimum return investor can expect in absence of any risk. The

risk here is represented by market return or market risk premium. On average, higher the risk higher is the market

return or market risk premium.

And the sensitivity to this market risk is estimated from beta. If beta is greater than 1, it means that the stock is more

volatile than the market, and if beta less than 1, we can see that stock on average is less volatile than the market. If

you're interested, you can read more about the CAPM model. It's widely used in the world of finance.


We now move to estimating the parameters of this model using R. So, we start by reading the data. We use read.csv

command, and we can do head data. Now because we have a date column, remember how we change date format in

R using as dot date function, so we will as dot date function.

Next as discussed, we want to split our data into training and test data set. The trained data set would have all the

data points were 2017. We have used the subset here data such that data dollar date, dollar is used to read the column

in R and data such that data dollar date is less than 31st December, 2017, the test data would be data such that data

dollar date is greater than 31st December, 2017.

So, now we are ready to build our equation on the train set. We will use the LM command or the LM function. So SLR

would be LM, the dependent variable is ABC or return on ABC and independent variable here is nifty. Data equals to

train, and we can print this summary of SLR. And we can see that the beta of nifty is 0.39, and it is statistically significant

as the T value is 18.75, and P value is very low.


Let's summarize the results of simple linear regression.

1. The beta of the stock or coefficient of market return is 0.39. This beta is less than 1. It indicates that the stock

is less volatile than the market, and we can say that its price is steadier than most of the stocks.

2. The T stat of this beta is 18.75, and the P value is less than 5%. With a such high T stat, we would reject the

null that beta is 0, and we can say that nifty is statistically significant at 5% level of significance.

3. Finally, the regression has an R square of 0.109 or approximately 11%. What it means is that only 11% variation

in Y or ABC is being explained by variation in nifty.

We can check out the plot here. We have the regression line, the blue line is the regression line, which shows estimated

Y for each X that is estimated ABC return corresponding to nifty return, and you can see that many of the data points

are far from the regression line, that is the actual points are far from the predicted points.

This is because the regression equation can only explain 10% of the variation. We would need to include more factors

to estimate a better line of it. So, we include more variables from the data set and try to fit a multiple linear regression

model.


So, our new equation is return on ABC equals to alpha plus beta1 nifty plus beta2 Sensex plus beta3 dividend

announced plus beta4 sentiment. Let's try to estimate the parameters of this equation in R.

So, in R, we will again use the LM function. So, now a new model MLR is equal to LM and they have again ABC as our

dependent variable and the independent variables are now Sensex, sentiment, nifty and dividend announced. The

data set is again train.

So, we can check results using summary MLR. So, the new variable Sensex, sentiment and dividend announced are all

significant. They have very high T values and very low P values, but notice that nifty is insignificant in this regression.

The T value is very low and we would not reject the null that beta of nifty is zero.

However, in the earlier regression where we only had a nifty, we saw that it was very significant with a very high T

value. So, what happened now? Why has adding more variables made nifty insignificant?

Do you guys remember the problem of multicollinearity? Can we have that here, can R independent variables be

corelated among themselves? Let's check it by looking at the correlation matrix.


So, we can use the COR function in R and we just want to see if there is any correlation between the independent

variables, so we check the correlation using COR function in R, and we can draw the correlation matrix. We have

ignored divided announced you because it does a dummy variable.

So, this is the correlation matrix, and you can see that the correlation between nifty and Sensex is quite high. It is 0.8,

almost 80% correlation.

We have the correlation matrix, you know, you can see that there is a very high correlation between nifty and Sensex,

which is 0.8. Apart from these two, there is not a very high correlation between any other pair of variables, but the

high correlation between nifty and Sensex indicates that the two indices move together.

We can expect them to be correlated as they are both actually similar as both of them represent market return. Thus,

we can conclude that we have the problem of multicollinearity in R regression, and the presence of multicollinearity

has impacted our T test, which actually made nifty insignificant.

So, we can conclude that our regression has the problem of multicollinearity and because of multicollinearity, our T

test were impacted, which made nifty insignificant. Nifty was otherwise significant when tested in isolation.

So, how do we get rid of multicollinearity? One way is to only include one of the relative variables. Instead of including

the correlated variables in our regression, we would only include one of the independent variables.

So, in the above case since both nifty and Sensex moved together, it makes sense to just include one of them in R

regression. And if you see the correlation matrix here, check out the correlation between ABC and Sensex and nifty.

So, the correlation between ABC and Sensex is 0.42, whereas the correlation between ABC and nifty is 0.33. Since the

correlation between ABC and Sensex is higher, we'll include Sensex and we'll drop nifty from R regression model.


So, our new equation would be return on ABC is equal to alpha plus beta1 Sensex plus beta2 investor sentiment plus

beta3 divided announced. So, let us estimate the parameters of this new equation.

Now, in the final equation, we have drop nifty and regress ABC on Sensex sentiment and dividend announced. The

data is equal to train, noticed that till now we have only used train data set even for deciding on the variables, we have

not touched the test data. We only use it to test the accuracy of our model.

So, let’s run it and check the summary MLR. So, in the new equation, we can see that all the independent variables are

statistically significant. They all have high T values and low P values.

Let’s interpreted each of these coefficients. So, the coefficient or beta1 coefficient Sensex is 0.49. If return in Sensex

increases by 1%, the return of ABC would on average increase by 0.49 or half a percent. The coefficient of sentiment

is 0.1. So, we can say that if sentiment score changes by one unit, on average the return on ABC would change by 0.109

or 0.11.

Dividend announced is a dummy variable, so how do we interpret dummy or how would we interpret the coefficient

of dividend announced, which is 0.02? It means that on average, return on ABC is 2% higher on days after divided is

announced versus on other days.


Further, the R square is 0.34 that is 34% of variation in Y is explained by variation in the independent variable or 34%

variation can be explained by the regression equation. The F statistic is also quite high. It is 507.3.

Remember, F statistic corresponds to the null hypothesis, that coefficient of all the independent variables are equal

to zero. Since the F statistic is very high, we would reject the null. Now, we have built a model based on the train data

set that is test this model on the test set.

So, we used the predict function in R to predict the returns of 2018 to 19 or in the test data set we constructed earlier.

So, here is the predict function. The first input would be MLR and the data would be now test, and we just passed the

variables we have used in the final regression.

So, this will return the predicted returns of 2018 and 19 using the regression equation, and we can check the R square

of predicted values. For that, we will need to install a package called MISE tools.

So, remember how we install a package in R. We use the install dot packages, and then we'll call it using the library

function. So, once we have this package, we can use the R square function.

So, there is a function called R square in the MISE tools package. The first input is actual Y and the second input is the

error term or actual minus predicted. So, we can use this function to calculate the sum of squares, and we get an R

square of 0.41, which is even higher than what we saw in the training set, it was 0.34.


This means that even in the test data set, R regression model can explain 40% of variation in the dependent variable.

We can also look at its plot. So, here is a plot of actual versus predicted returns of ABC. The blue line is the actual one

and general is the predicted one.

Apart from R square, another way to look at the correlation between predicted and actual value is to find the

correlation between them. So, you can just do COR function in R and find the correlation between predicted and actual

values. The correlation is positive and has the magnitude of 0.64.

So, you can just see the correlation using COR function, COR test dollar ABC, which is the actual and the predicted

value is test underscore predicted. Correlation is positive. It is 0.64, and we can say that there is a high correlation

between actual and predicted values.


So, now we have tested the accuracy of our model in the test data set, we get an R square of 0.4, and the correlation

between predicted and actual is 0.64.

To conclude in this case study, we have been to linear regression models. One is simply near regression model and

other is a multiple linear regression model. The first one had only nifty as an independent variable. The second one

had Sensex, dividend announced and sentiment as the independent variables.

We can compare the two models based on adjusted R square. Remember, we don't use R square, but rather adjusted

R square to compare two models with different number of variables.

The adjusted R square value for simple linear regression is 0.108 or 11%, which means that 11% of the variation is

being explained by nifty, whereas for multiple linear regression model, R equation could explain 34% of variation, and

the adjusted R square is 0.34 in this case.

On the basis of adjusted R square, we can conclude that the multiple linear regression model does a better job in

explaining variation in return of ABC. Further, the variable Sensex dividend announced and sentiment are all also

significant. I hope you guys are now comfortable with linear regression and also know how to build SLR and MLR

models in R.


No part of this publication may be reproduced, transmitted, or stored in a retrieval system, in any form

or by any means, electronic, mechanical, photocopying, recording or otherwise,

without the prior permission of the publisher.

case study: stock price prediction

Documents