mass-processing financial filings using self-organizing...

5
1 Abstract—Combining qualitative and quantitative data to offer predictions of a company's financial status is not a new technique in the finance industry. Back used Self-Organizing Maps to organize companies into clusters solely using financial charts [1] then Magnussen used collocational network to validate the predictions in quantitative data [9]. Lin then gave methods to optimize clustering methods [8], allowing more precise characterization of companies in different financial statuses. Based on these results, we propose an application that summarizes public companies' financial performances, categorize them according to specified parameters, and produce visual and numerical results to provide analytical aid that contain more information faster. Index Terms— Clustering; Collocational networks; Data mining; Financial reports; Financial Analysis; Benchmarking Self-organizing maps I. INTRODUCTION he financial services institutions have seen applications of Natural Language Processing techniques in fields from retail-banking chat-bots to sentiment analysis in trading systems to search recommendation in Bloomberg terminals [5]. This paper introduces NLP and text analysis techniques to the Merger and Acquisition realm, aiming to help investors, analysts, and companies find desired target companies using a combination of quantitative and qualitative information parsing techniques developed by recent researchers. A. Why is it important, what problem does it solve? My proposed system hopes to expand the number of potential targets and allows more accurate analytical results to researchers in a Merger and Acquisition deal. Finding the right target is important because it is the first step of the deal and sets and tone for the rest of the process. The Jiayi Chen is with the Department of Computer Science, Columbia University in the City of New York, New York, NY, 10027, USA (email: [email protected]). searching process is a long and complicated decision process that involves taking into consideration the business value, market capital, and financial structure of the target company, and many more factors. Moreover, the research often involves reading financial reports of target companies, which contains textual and numerical information -- both of which are crucial indicators to researchers. However, as of 2012, there are 4,102 public companies listed in the US exchanges and 39,427 world-wide excluding the US [4]. Sorting through all of their financial filings and calculating their capital expenditure, multiples, and other financial metrics can be a laborious and tedious task. By combining preexisting techniques [1] on analyzing both categories of information and establishing a framework that scrapes, cleans, and summarizes financial filing data into readable forms, this system allows users to quickly identify investment targets and spend more time doing in-depth company-specific research. B. Key Considerations Up-to-date information: We can utilize the existing database make open source by Securities Exchange Committee [2]. By using python to download full text pages and then parsing them by sections, we can constantly update our structured database. NLP: The structure of filings often follows a specific format. This form allows the parser to differentiate between text and numeric data. We separate text and numeric data for further analysis. Categorization: the most important function of this system is to allow users quickly identify a company of a specific size, industry, or other features. By using k-means [3] to categorize each company using their filings, we can create multiple Mass-processing Financial Filings Using Self-Organizing Maps and NLP for M&A Target and Acquirer Identification Jiayi Chen T

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mass-processing Financial Filings Using Self-Organizing ...jc4489/judy_file/Judy_Chen_NLP_ML_paper.pdf · 3 A. API Call to EDGAR Online First, we use SEC's own built-in accessing

1

Abstract—Combining qualitative and quantitative data to offer

predictions of a company's financial status is not a new technique in the finance industry. Back used Self-Organizing Maps to organize companies into clusters solely using financial charts [1] then Magnussen used collocational network to validate the predictions in quantitative data [9]. Lin then gave methods to optimize clustering methods [8], allowing more precise characterization of companies in different financial statuses. Based on these results, we propose an application that summarizes public companies' financial performances, categorize them according to specified parameters, and produce visual and numerical results to provide analytical aid that contain more information faster.

Index Terms— Clustering; Collocational networks; Data mining; Financial reports; Financial Analysis; Benchmarking Self-organizing maps

I. INTRODUCTION he financial services institutions have seen applications of Natural Language Processing

techniques in fields from retail-banking chat-bots to sentiment analysis in trading systems to search recommendation in Bloomberg terminals [5]. This paper introduces NLP and text analysis techniques to the Merger and Acquisition realm, aiming to help investors, analysts, and companies find desired target companies using a combination of quantitative and qualitative information parsing techniques developed by recent researchers. A. Why is it important, what problem does it solve? My proposed system hopes to expand the number of potential targets and allows more accurate analytical results to researchers in a Merger and Acquisition deal. Finding the right target is important because it is the first step of the deal and sets and tone for the rest of the process. The

Jiayi Chen is with the Department of Computer Science, Columbia University in the City of New York, New York, NY, 10027, USA (email: [email protected]).

searching process is a long and complicated decision process that involves taking into consideration the business value, market capital, and financial structure of the target company, and many more factors. Moreover, the research often involves reading financial reports of target companies, which contains textual and numerical information -- both of which are crucial indicators to researchers. However, as of 2012, there are 4,102 public companies listed in the US exchanges and 39,427 world-wide excluding the US [4]. Sorting through all of their financial filings and calculating their capital expenditure, multiples, and other financial metrics can be a laborious and tedious task. By combining preexisting techniques [1] on analyzing both categories of information and establishing a framework that scrapes, cleans, and summarizes financial filing data into readable forms, this system allows users to quickly identify investment targets and spend more time doing in-depth company-specific research. B. Key Considerations

Up-to-date information: We can utilize the existing database make open source by Securities Exchange Committee [2]. By using python to download full text pages and then parsing them by sections, we can constantly update our structured database.

NLP: The structure of filings often follows a specific format. This form allows the parser to differentiate between text and numeric data. We separate text and numeric data for further analysis.

Categorization: the most important function of this system is to allow users quickly identify a company of a specific size, industry, or other features. By using k-means [3] to categorize each company using their filings, we can create multiple

Mass-processing Financial Filings Using Self-Organizing Maps and NLP for

M&A Target and Acquirer Identification Jiayi Chen

T

Page 2: Mass-processing Financial Filings Using Self-Organizing ...jc4489/judy_file/Judy_Chen_NLP_ML_paper.pdf · 3 A. API Call to EDGAR Online First, we use SEC's own built-in accessing

2

general categorization that make the companies searchable.

Comparison between similar companies: with the company data base in place we can generate visualizations and help users compare different features of multiple companies. Existing AI databases like CyMetica [7] only lists all related companies according to their business model and keywords but does not supply concrete data analytical aid that pushes forward the decision making process.

II. CURRENT LIMITATIONS A. Fit with M&A applications Although there are existing methods to analyze financial filings and annual report and generate predictive results, they focus on predicting future earnings [6] or stock-prices [8]. M&A researchers looks at not only the current financial status or capital structure of a company, but also it's customer base, its business model, and other value-based characteristics. These characteristics are hard to capture only through financial reports but can be extracted to an extent in the annual reports. However, no literature has applied these techniques to the M&A research process. Companies have different goals during M&A and simply quantitative summaries produced by Bloomberg terminals are rudimentary and can be improved by textual analysis of annual reports as well. Databases such as Capital IQ provides a company's past conference calls and annual reports but leaves the user to digest them. Therefore, we leverage textual analysis tools [9] as well as qualitative analysis techniques to output a more detailed categorical summary that help users decide if the company can 1) add synergy to the acquirer, 2) help acquirer reduce cost, and 3) provide strategical advancement to the acquirer's business model. B. User-friendliness There is no existing system that can gather all the financial data and produce summarization. Bloomberg Terminals allow users to read formatted financial reports and ratios but does not provide direct mechanism to compare large amount of companies. The database exists but analysis is still done by the user outside of the application. For non-technical users, the task of downloading data

and then reading through them to determine a right target can be difficult and the sheer limitation of human power can lead to errors and a smaller scope for targets.

III. HIGH-LEVEL OVERVIEW OF PROPOSED APPROACH

Fig. 1. Flowchart of the application We propose a system that starts with preliminary data collection from sec.gov where all financial filings of publicly traded companies in the US can be found in full-text form [2]. Then we tap into the SEC Edgar database using their API and scrape data to our own database. Then we use Magnusson's method to process textual data and then use Self-Organizing Maps (SOM) to categorize companies according to different financial metrics. We can also run the data through a K-Means clustering in Scikit-learn to give more accurate characterizations. Finally, we label all of the companies and store them in a relational database and serve it to the search engine that would allow user to input desired numerical thresholds or keywords. The system will output a list of companies with visuals to help users decide.

IV. DETAILED DESCRIPTION OF PROPOSED APPROACH

Page 3: Mass-processing Financial Filings Using Self-Organizing ...jc4489/judy_file/Judy_Chen_NLP_ML_paper.pdf · 3 A. API Call to EDGAR Online First, we use SEC's own built-in accessing

3

A. API Call to EDGAR Online First, we use SEC's own built-in accessing tool EDGAR Onling [10] to return companies of the desired quality. For example, if the user is looking for a mid-sized company in the insurance industry based in Santa Clara with a revenue greater than 2 million, the user will type in "Santa Clara" for location, "2,000,000" for revenue, and "Insurance" for industry. The API will make individual calls to the EDGAR database like this: http://edgaronline.api.mashery.com/v2/companies?filter=city%20eq%20%22SANTA%20CLARA%22&appkey=32srrxnrc49gu2kk4pxa5hhm (for location) [11] and this: http://edgaronline.api.mashery.com/v2/companies?filter=industry%20in%20(%22Insurance%20(Miscellaneous)%22,%22Insurance%20(Life)%22)&appkey=32srrxnrc49gu2kk4pxa5hhm (for inustry). The API will return lists of companies in XML format with fields such as entityids, which we can further search to obtain their total revenue through API calls like this: http://edgaronline.api.mashery.com/v2/corefinancials/ann?entityids=8528&appkey=32srrxnrc49gu2kk4pxa5hhm We can then gather all the companies with the targeted total revenue. After getting this preliminary filtration, we can then further analyze the companies using the textual and quantitative information on their 10-K reports. To quantitatively assess the current financial status of the company, we make an API call to the company's financial data in the past 10 years (or any valid length) and scrape down financial multiples such as operating margin, debt equity, P/E ratios. They look like this: <value field="operatingmargin">0.2831</value> <value field="debtequity">0.3536</value> <value field="currentratio">2.91</value> B. Quantitative Analysis By performing a simple linear regression on these ratios, we can gain a quick understanding whether the company is experiencing growth. To further visualize the state of a company using quantitative data, we follow Magnusson's method [9]. We first randomly pick 100 companies in the industry to create a benchmark SOM, then, using the same method as Magnusson, we calculate operating

margin, return on equity, return on total assets, current ratio, equity to capital, interest coverage, and receivables turnover. Some of these ratios, such as operating margin and current ratio, are provided in the EDGAR API, we recalculate to make sure the periods match up. Using these data, we can create feature maps and then group them into one SOM for a conclusive analysis. We generate visuals using SOM [14].

Fig. 2. Sample Feature Maps from Back [1] C. Qualitative Data Analysis To summarize textual data, we follow Magnusson's use the concept of collocational network, which measures the frequencies of certain wording groups and calculate their Mutual Information (MI) score. This method was borrowed from Williams [12, 13] in Magnusson. Taking advantage of the formulaic nature of financial reports, we expect words such as "growth," "increase," "expansion" to occur more often than "decrease," "shrink" in a company with positive outlook. Since this method quantifies the weight of each word, we can draw feature maps according to the word changes for visualization.

Page 4: Mass-processing Financial Filings Using Self-Organizing ...jc4489/judy_file/Judy_Chen_NLP_ML_paper.pdf · 3 A. API Call to EDGAR Online First, we use SEC's own built-in accessing

4

D. Cross-validation Finally, we cross-validate the conclusions from the SOM by drawing collocational networks for each time period of each company and look for significant change in the frequencies of words such as "decrease" or "growth." Through a customizable filter, the user can target companies in a growing, declining, or stable status. The application will output the SOMs generated, with lighter colors indicating companies with high ratios indicating growth, and grey colors indicating decline. Along with the SOMs it will also output the list of companies whose trend according to textual analysis agrees with the SOM's result. There is also option for user to check regression lines calculated from financial data to see the actual growth rates and individual ratios for finer analysis.

V. ADVANTAGES A. Merger model validation One reason for a company to seek M&A is to identify similar companies with existing resources so that they can expand market share without having to rebuild factories or compete with the existing company. By gathering and summarizing financial filings using collocation networks and SOMs, we can categorize the companies according to the needs of the analysts. For example, the data can be plugged into a merger model to affirm or reject the hypothesis that the two companies are fit for merger or acquisition. A lot of companies look into M&A in order to fill a gap in their product line. For example, Walmart buys Jet.com because they want to fill the gap in their online sales. By using collocation network an looking for target industry keywords, the application can compute whether there are potential product gaps in one company that can be filled by another company. Combining with market size and other financial metrics, this application can help analyst to spot potential deals. C. Faster and more comparable Existing solutions such as Bloomberg Terminal, GuruFocus, or CyMetica only directly provide company specific information. By using a SOM with industry benchmark data and textual analysis, we can quickly situate the profitability or growth potential

of a company in the industry and reduce the amount of ad-hoc, repetitive excel work an analyst has to go through in order short-list target companies. D. High Customizability Since EDGAR Online provides all the publically available data of a company, theoretically we can look at all the statistics for a comprehensive view on the industry, or just focus on one ratio across all companies. Compared to how a lot of investment companies track potential companies by grouping analysts into different industries, this application allows cross-industry analysis for users without a lot of prior knowledge about the industry. Users can further combine the application to their routine by adding in DCF models, Merger models, or Comparable Companies. Because we directly scrape data off EDGAR the official SEC site and format the quantitative data as we scrape, with some tweak in the code, a user can easily generate models for one or more companies in the correct timeframe with accurate values. This way, more models can be generated by the same analyst and more time can be spend analyzing the hidden synergies in a potential merger or acquisition, thus increasing information processed and help users make more informed recommendations to client acquirers.

VI. CHALLENGES A. Data Source excludes private companies This system is based off the publically-available data on Edgar Online API and therefore cannot access private company data. This poses a great challenge because most of the companies in the economy are private and oftentimes the most sought-after companies [20], especially small companies and startups with innovative technologies are the most suitable companies to buy. However, this shortcoming is a matter of information source and our system will be able to quickly sort through the data if they are available to us. B. API limitations In order to continuously update data, we will need access to Edgar Online API and in case of commercialization we might need to pay for our usage to increase the number of clicks and access to the database [15]. Currently, the API only allows 5000 calls per day and 2 calls per second. Since the

Page 5: Mass-processing Financial Filings Using Self-Organizing ...jc4489/judy_file/Judy_Chen_NLP_ML_paper.pdf · 3 A. API Call to EDGAR Online First, we use SEC's own built-in accessing

5

number of documents is far more than 500, our server will be down if we were to access all companies available in a single day. B. Adoption Interface design: might need to limit the features to better customize to certain M&A teams/industries. Websites that use similar API such as OTC Markets [16], NASDAQ stock report [17], Yahoo Finance [18], and Chart IQ [19] all has comprehensive financial reports online, some like NASDAQ even have calculated financial multiples listed in charts. The market for financial reporting might seem saturated already with these established sources. However, we believe that providing visualization of SOMs and allow direct comparisons we will attract users with less technical finance background such as a new CEO or an owner of traditional business. C. Regulatory Since all data are publically available, we do not see any regulatory limitations in our proposed method if it remains a public application and not in commercial use. If there is commercial potential for this application, it is not difficult to pay EDGAR to use their API since they already have existing customers including OTC Markets, NASDAQ, YAHOO Finance, and Chart IQ. There is also the slim chance that a bug in the application might cause the wrong data field to be fed into a model. We would try very hard to detect such detrimental bugs but for commercial use such errors could cause a skewed recommendation and cause lawsuits for us the developer and the user.

REFERENCES [1] S Back, B., K. Sere, and H. Vanharanta. 2017. "Analyzing Financial

Performance With Self-Organizing Maps". 1998 IEEE International Joint Conference On Neural Networks Proceedings. IEEE World Congress On Computational Intelligence (Cat. No.98CH36227). Accessed May 2. doi:10.1109/ijcnn.1998.682275.

[2] "Full Text Search". 2017. Searchsec.Gov. https://searchwww.sec.gov/EDGARFSClient/jsp/EDGAR_MainAccess.jsp.

[3] Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31, no. 8 (2010): 651-666.

[4] National Bureau of Economic Research. 2015. The U.S. Listing Gap. Cambridge, MA: NBER.

[5] "Trending On Twitter: Social Sentiment Analytics | Bloomberg L.P.". 2017. Bloomberg L.P.. https://www.bloomberg.com/company/announcements/trending-on-twi

tter-social-sentiment-analytics/?_ga=1.129143941.84714901.1490848784.

[6] Chen, Kuo-Tay, Tsai-Jyh Chen, and Ju-Chun Yen. "Predicting future earnings change using numeric and textual information in financial reports." Pacific-Asia Workshop on Intelligence and Security Informatics. Springer Berlin Heidelberg, 2009.

[7] Cymetica | Financial Bots For Public Company Discovery". 2017. Cymetica.Com. http://cymetica.com.

[8] Lin, Ming-Chih, Anthony JT Lee, Rung-Tai Kao, and Kuo-Tay Chen. "Stock price movement prediction using representative prototypes of financial reports." ACM Transactions on Management Information Systems (TMIS) 2, no. 3 (2011): 19.

[9] Magnusson, Camilla, Antti Arppe, Tomas Eklund, Barbro Back, Hannu Vanharanta, and Ari Visa. "The language of quarterly reports as an indicator of change in the company’s financial status." Information & Management 42, no. 4 (2005): 561-574.

[10] "EDGAR®Online - EDGAR®Online Datafied API". 2017. Developer.Edgar-Online.Com. http://developer.edgar-online.com/.

[11] "EDGAR®Online - Companies Metadata". 2017. Developer.Edgar-Online.Com. http://developer.edgar-online.com/docs/companies.

[12] J. Sinclair, Corpus, Concordance, Collocation, Oxford University Press, Oxford, 1991.

[13] G.C. Williams, Collocational networks: interlocking patterns of lexis in a corpus of plant biology research articles, International Journal of Corpus Linguistics 3, 1998, pp. 151–171.

[14] J. Karlsson, B. Back, H. Vanharanta, A. Visa, Analysing financial performance with quarterly data using self-organis- ing maps, TUCS Technical Report No. 430, Turku, 2001.

[15] "EDGAR®Online". 2017. Developer.Edgar-Online.Com. http://developer.edgar-online.com/apps/register.

[16] "OTC Markets | Official Site Of The OTCQX, OTCQB And OTC Pink Marketplaces Featuring Free Stock & Bond Quotes, Trade Prices, Chart, Financials And Company News & Information For Investors, Companies And Traders - Otcmarkets.Com". 2017. Otcmarkets.Com. http://www.otcmarkets.com/stock/RHHBY/financials.

[17] "Stock Report of EMC". 2017. NASDAQ.com. http://www.nasdaq.com/symbol/emc/stock-report.

[18] "TSLA Income Statement | Tesla, Inc. Stock - Yahoo Finance". 2017. Finance.Yahoo.Com. http://finance.yahoo.com/quote/TSLA/financials?ltr=1.

[19] "HTML5 Financial Charting And Data Visualization Solutions". 2017. Chartiq. https://www.chartiq.com/.

[20] Capron, Laurence, and Jung-Chin Shen. "Acquisitions of private vs. public firms: Private information, target selection, and acquirer returns." Strategic Management Journal 28, no. 9 (2007): 891-911.

Jiayi Chen is a second-year student pursuing a B.S. in Computer Science at Columbia University in the City of New York. Her research interests include cryptography, data mining, financial transaction systems.