text analytics- an application in indian stock markets

Click here to load reader

Upload: sinjana-ghosh

Post on 26-Jan-2015

109 views

Category:

Data & Analytics


1 download

DESCRIPTION

This presentation was created to present the project done as a part of Applied Management Research Project in Vinod Gupta School of Management, IIT Kharagpur

TRANSCRIPT

  • 1. Vinod Gupta School of Management, IIT Kharagpur Text Analytics- An Application in Indian Stock Market Applied Management Research Project, 2014 By Sinjana Ghosh Done under the able guidance of Prof. A. K. Misra

2. Background Motivation behind this project 3. Algorithmic Trading in India Involves the use of algorithms in pre-built platforms to place electronic trades on stocks, futures, options, currencies and commodities on exchanges, without any human intervention In 2008, India allowed the first Direct-Market-Access (DMA) and algorithmic trades to go through The most commonly used strategies of algorithmic trading in India include arbitrage, market making and trend following algorithms 4. Big Data Data available in various forms not just structured but also semi-structured like XML and EDI Documents and unstructured like Text, multimedia etc. Big Data analytics is the strategy of using this huge amount of data which is now accessible through internet, mobile messages and various other platforms, to extract useful information , that can be further analyzed to help in the decision making process 5. Text Data analytics Subset of Big data analytics which involves extraction of entities like person, location, organization etc. from text messages and relationship between the extracted entities and analysing them for business needs Predictive analytics Involves searching for meaningful relationships among variables and representing those relationships in models Response variables and explanatory variables Two common types of model: Regression and Classification 6. Sentiment Analysis Use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials Aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document Machine Learning A branch of artificial intelligence, concerns the construction and study of systems that can learn from data 7. The Problem Using text mining of news articles available in the public domain to analyse the market sentiment and correlate it with the actual movement in Nifty 50 8. Use textual news from a plethora of online resources to perform data mining to check for occurrence of a basic set of keywords in the article. Training a machine learning algorithm for accurately predicting the impact of the most viewed news articles on the market sentiment and predict the movement of market represented in the study by Nifty50. Validate the results obtained through training set using a set of recent news articles (Test set) to check for errors and level of accuracy. Objective 9. Methodology Textual Representation Bag of words Noun Phrasing Named Entities Named Entities with context-capturing feature Predictive Modelling Approach Source: Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R (Mill) 10. Methodology Sources of textual data 11. Methodology Partitioning data in machine learning Source: Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R (Mill) 12. Text Analysis Algorithm 1. Convert all the characters to lowercase 2. Remove stop-words which does not help in sentiment analysis like is, are, if, when, where, then, their, there, where, why, when, which, how After this the following is done: 1. Create an array of named entities which are of significance like inflation, gdp, sensex etc. 2. The script is run which extracts the named entities which occur in the article along with the 2 words immediately preceding and 3 words immediately succeeding it. This is done to not only capture the keywords but also the context. 3. The algorithm is trained by assigning weights to each of the keyword so that the sentiment score most closely reflects the actual returns of the day. 13. Text Analysis Algorithm 4. A set of qualifiers is defined and the preceding and succeeding words captured as context of the extracted keyword. The algorithm further assigns a weight (-1 for negative, 0 for neutral and +1 for positive) to each extracted qualifiers. 5. The sum product of the qualifier weight and keyword weight gives the actual sentiment score of the article from which the returns of the day due to that news can be predicted. 6. Importance score is simply the sum of the weights of the individual occurrence of keywords in the article. However, whether the effect will be positive or negative, and how much the market will react to it is determined only by the sentiment score. 7. Regression is performed on the scores versus actual returns for the training set and a formula is obtained for converting the scores into forecasted returns. 8. This is tested on the validation set and errors are calculated. 14. Training of algorithm Training set: Daily returns of 2013-14 with returns>1% or returns