domain adaption of named entity recognition to support

Domain Adaption of Named Entity Recognition to Support Credit RiskAssessment

Julio Cesar Salinas Alvarado Karin Verspoor Timothy BaldwinDepartment of Computing and Information Systems

The University of MelbourneAustralia

[email protected]@unimelb.edu.au

[email protected]

Abstract

Risk assessment is a crucial activityfor financial institutions because it helpsthem to determine the amount of capi-tal they should hold to assure their sta-bility. Flawed risk assessment modelscould return erroneous results that triggera misuse of capital by banks and in theworst case, their collapse. Robust modelsneed large amounts of data to return ac-curate predictions, the source of which istext-based financial documents. Currently,bank staff extract the relevant data byhand, but the task is expensive and time-consuming. This paper explores a ma-chine learning approach for informationextraction of credit risk attributes from fi-nancial documents, modelling the task as anamed-entity recognition problem. Gener-ally, statistical approaches require labelleddata for learn the models, however the an-notation task is expensive and tedious. Wepropose a solution for domain adaption forNER based on out-of-domain data, cou-pled with a small amount of in-domaindata. We also developed a financial NERdataset from publicly-available financialdocuments.

1 Introduction

In the years 2007–2008, the GFC (Global Fi-nancial Crisis) affected a vast number of coun-tries around the world, causing losses of aroundUSD$33 trillion and the collapse of big-namebanks (Clarke, 2010). Experts identified that oneof the main causes of the GFC was the use of poorfinancial models in risk assessment (Clarke, 2010;news.com.au, 2010; Debelle, 2009).

Risk assessment helps banks to estimate theamount of capital they should keep at hand to pro-mote their stability and at the same time to protect

their clients. Poor risk assessment models tend tooverestimate the capital required, leading banks tomake inefficient use of their capital, or underes-timate the capital required, which could lead tobanks collapsing in a financial crisis.

Financial documents such as contracts and loanagreements provide the information required toperform the risk assessment. These texts hold rel-evant details that feed into the assessment process,including: the purpose of the agreement, amountof loan, and value of collateral. Figure 1 providesa publicly available example of a loan agreement,as would be used in risk assessment.

Currently, bank staff manually extract the in-formation from such financial documents, but thetask is expensive and time-consuming for threemain reasons: (1) all documents are in unstruc-tured, textual form; (2) the volume of “live” doc-uments is large, numbering in the millions of doc-uments for a large bank; and (3) banks are con-tinuously adding new information to the risk mod-els, meaning that they potentially need to extractnew fields from old documents they have previ-ously analyzed.

Natural language processing (NLP) potentiallyoffers the means to semi-automatically extract in-formation required for risk assessment, in the formof named entity recognition (NER) over fieldsof interest in the financial documents. However,while we want to use supervised NER models, wealso want to obviate the need for large-scale anno-tation of financial documents. The primary focusof this paper is how to build supervised NER mod-els to extract information from financial agree-ments based on pre-existing out-of-domain datawith partially-matching labelled data, and smallamounts of in-domain data.

There are few public datasets in the financial do-main, due to the privacy and commercial value ofthe data. In the interest of furthering research oninformation extraction in the financial domain, we

Julio Cesar Salinas Alvarado, Karin Verspoor and Timothy Baldwin. 2015. Domain Adaption of NamedEntity Recognition to Support Credit Risk Assessment . In Proceedings of Australasian Language TechnologyAssociation Workshop, pages 84−90.

Figure 1: Example of a loan agreement. Relevant information that is used by risk assessment mod-els is highlighted. The example is taken from a loan agreement that has been disclosed as partof an SEC hearing, available at http://www.sec.gov/Archives/edgar/data/1593034/000119312514414745/d817818dex101.htm

construct an annotated dataset of public-domain fi-nancial agreements, and use this as the basis of ourexperiments.

This paper describes an approach for domainadaption that includes a small amount of target do-main data into the source domain data. The re-sults obtained encourage the use of this approachin cases where the amount of target data is mini-mal.

2 Related Work

Most prior approaches to information extractionin the financial domain make use of rule-basedmethods. Farmakiotou et al. (2000) extract en-tities from financial news using grammar rules

and gazetteers. This rule-based approach obtained95% accuracy overall, at a precision and recall of78.75%. Neither the number of documents in thecorpus nor the number of annotated samples usedin the work is mentioned, but the number of wordsin the corpus is 30,000 words for training and140,000 for testing. The approach involved thecreation of rules by hand; this is a time-consumingtask, and the overall recall is low compared toother extraction methods.

Another rule-based approach was proposed bySheikh and Conlon (2012) for extracting informa-tion from financial data (combined quarterly re-ports from companies and financial news) with theaim of assisting in investment decision-making.

85

The rules were based on features including ex-act word match, part-of-speech tags, orthographicfeatures, and domain-specific features. After cre-ating a set of rules from annotated examples, theytried to generalize the rules using a greedy searchalgorithm and also the Tabu Search algorithm.They obtained the best performance of 91.1% pre-cision and 83.6% recall using the Tabu Search al-gorithm.

The approach of Farmakiotou et al. (2000) issimilar to our approach in that they tried to ad-dress an NER problem with financial data. How-ever, their data came from financial news ratherthan the financial agreements, as targeted in ourwork. The focus of Sheikh and Conlon (2012) iscloser to that in this paper, in that they make useof both financial news and corporate quarterly re-ports. However, their extraction task does not con-sider financial contracts, which is the key charac-teristic of our problem setting.

Somewhat further afield — but related in thesense that financial agreements stipulate the legalterms of a financial arrangement — is work on in-formation extraction in the legal domain. Moenset al. (1999) used information extraction to ob-tain relevant details from Belgian criminal recordswith the aim of generating abstracts from them.The approach takes advantage of discourse anal-ysis to find the structure of the text and linguisticforms, and then creates text grammars. Finally, theapproach uses a parser to process the documentcontent. Although the authors do not present re-sults, they argue that when applied to a test set of1,000 criminal cases, they were able to identify therequired information.

In order to reduce the need for annotation, weexplore domain adaptation of an information ex-traction system using out-of-domain data and asmall amount of in-domain data. Domain adap-tation for named entity recognition techniques hasbeen explored widely in recent years. For instance,Jiang and Zhai (2006) approached the problem bygeneralizing features across the source and targetdomain to way avoid overfitting. Mohit and Hwa(2005) proposed a semi-supervised method com-bining a naive Bayes classifier with the EM algo-rithm, applied to features extracted from a parser,and showed that the method is robust over noveldata. Blitzer et al. (2006) induced a correspon-dence between features from a source and targetdomain based on structural correspondence learn-

ing over unlabelled target domain data. Qu etal. (2015) showed that a graph transformer NERmodel trained over word embeddings is more ro-bust cross-domain than a model based on simplelexical features.

Our approach is based on large amounts oflabelled data from a source domain and smallamounts of labelled data from the target domain(i.e. financial agreements), drawing inspirationfrom previous research that has shown that using amodest amount of labelled in-domain data to per-form transfer learning can substantially improveclassifier accuracy (Duong et al., 2014).

3 Background

Named entity recognition (NER) is the task ofidentifying and classifying token-level instancesof named entities (NEs), in the form of propernames and acronyms of persons, places or orga-nizations, as well as dates and numeric expres-sions in text (Cunningham, 2005; Abramowiczand Piskorski, 2003; Sarawagi, 2008). In the fi-nancial domain, example NE types are LENDER,BORROWER, AMOUNT, and DATE.

We build our supervised NER models usingconditional random fields (CRFs), a popular ap-proach to sequence classification (Lafferty et al.,2001; Blunsom, 2007). CRFs model the condi-tional probability p(s|o) of labels (states) s giventhe observations o as in Equation 1, where t is theindex of words in observation sequence o, each kis a feature, wk is the weight associated with thefeature k, and Zw(o) is a normalization constant.

p(s|o) = exp(∑

t

∑k wkfk(st−1, st, o, t))

Zw(o)(1)

4 Methods

4.1 Data

In order to evaluate NER over financial agree-ments, we annotated a dataset of financialagreements made public through U.S. Securityand Exchange Commission (SEC) filings. Eightdocuments (totalling 54,256 words) were ran-domly selected for manual annotation, based onthe four NE types provided in the CoNLL-2003dataset: LOCATION (LOC), ORGANISATION

(ORG), PERSON (PER), and MISCELLANEOUS

(MISC). The annotation was carried out us-ing the Brat annotation tool (Stenetorp et al.,2012). All documents were pre-tokenised and

86

part-of-speech (POS) tagged using NLTK (Birdet al., 2009). As part of the annotation, weautomatically tagged all instances of the to-kens lender and borrower as being of entitytype PER. We have made this dataset availablein CoNLL format for research purposes at:http://people.eng.unimelb.edu.au/tbaldwin/resources/finance-sec/.

For the training set, we use the CoNLL-2003English data, which is based on Reuters newswiredata and includes part-of-speech and chunk tags(Tjong Kim Sang and De Meulder, 2003).

The eight financial agreements were partitionedinto two subsets of five and three documents,which we name “FIN5” and “FIN3”, respectively.The former is used as training data, while the latteris used exclusively for testing.

Table 1 summarizes the corpora.

4.2 FeaturesFor all experiments, we used the CRF++ toolkit(Kudo, 2013), with the following feature set (opti-mized over the CoNLL-2003 development set):• Word features: the word itself; whether the

word starts with an upper case letter; whetherthe word has any upper case letters other thanthe first letter; whether the word contains dig-its or punctuation symbols; whether the wordhas hyphens; whether the word is all lower orupper case.• Word shape features: a transformation of the

word, changing upper case letters to X, lowercase letters to x, digits to 0 and symbols to #.• Penn part-of-speech (POS) tag.• Stem and lemma.• Suffixes and Prefixes of length 1 and 2.

4.3 Experimental Setup and ResultsWe first trained and tested directly on the CoNLL-2003 data, resulting in a model with a precisionof 0.833, recall of 0.824 and F1-score of 0.829(Experiment1), competitive with the start-of-the-art for the task.

The next step was to experiment with the finan-cial data. For that, first we applied the CoNLL-2003 model directly to FIN3. Then, in order toimprove the results for the domain adaption, wetrained a new model using the CONLL +FIN5data set, and test this model against the FIN3dataset.

A summary of the experimental results over thefinancial data sets is presented in Table 2.

1000 3000 5000 7000 9000 11000 13000

Training Dataset Size (Sentences)

0.0

0.2

0.4

0.6

0.8

1.0

F1-Score

Exp1:increasing CoNLL

Exp3:Financial + increasing CoNLL

Figure 2: Learning curves showing the F-Score asmore CONLL data is added for Experiment1 andExperiment3. Experiment3 starts in FIN5 and in-crementally adding CONLL data.

5 Discussion

Table 2 summarizes the results of directly ap-plying the model obtained by training only overout-of-domain data to the two financial data sets.The difference in the domain composition of theCONLL data (news) and the financial documentscan be observed in these results. With out-of-domain test data, a precision of 0.247 and a recallof 0.132 (Experiment2) was observed, while test-ing with in-domain data achieved a precision of0.833 and recall of 0.824 (Experiment1).

As a solution to the difference in the na-ture of the sources in the context of limited an-notated in-domain data, we experimented withsimple domain adaptation, by including into thesource domain (CONLL) data a small amountof the target domain data — i.e. including datafrom FIN5— generating a new training data set(CONLL +FIN5). When trained over this com-bined data set, the results increased substantially,obtaining a precision of 0.828, recall of 0.770 andF-score of 0.798 (Experiment3).

As additional analysis, in Figure 2, we plotlearning curves based on F-score obtained for Ex-periment2 and Experiment3 as we increase thetraining set (in terms of the number of sentences).We can see that the F-score increases slightly withincreasing amounts of pure CONLL data (Exper-iment2), but that in the case of the mixed trainingdata (Experiment3), the results actually drop as weadd more CONLL data.

Figure 3 shows the learning curves for Experi-ment3 and Experiment4, as we add more financial

87

Name DescriptionCONLL CoNLL-2003 training dataCONLLtest CoNLL-2003 test dataCONLL +FIN5 CoNLL-2003 training data + five financial agreementsFIN5 Five financial agreementsFIN3 Three financial agreements

Table 1: Description of the data sets used.

Name Training Data Test Data P R F1Experiment1 CONLL CONLLtest 0.833 0.824 0.829

Experiment2 CONLL FIN3 0.247 0.132 0.172Experiment3 CONLL +FIN5 FIN3 0.828 0.770 0.798Experiment4 FIN5 FIN3 0.944 0.736 0.827

Table 2: Results of testing over the financial data sets.

100 300 500 700 900 1100

Training Dataset Size (Sentences)

0.0

0.2

0.4

0.6

0.8

1.0

F1-S

core

Increasing Financial

CoNLL + increasing Financial

Figure 3: Learning curves showing the F-score asmore financial training data is added for Experi-ment3 and Experiment 4.

data. Here, in the case of Experiment3, we startout with all of the CONLL data, and incremen-tally add FIN5. We can see that the more financialdata we add, the more the F-score improves, with aremarkably constant absolute difference in F-scorebetween the two experiments for the same amountof in-domain data. That is, even for as little as 100training sentences, the CONLL data degrades theoverall F-score.

Confusion matrices for the results of the predic-tions of Experiment3 are shown in Table 3.

Analysis of the errors in the confusion matrixreveals that the entity type MISC has perfect recallover the financial dataset. Following MISC, PER isthe entity type with the next best recall, at over 0.9.However, generally the model tends to suffer froma high rate of false positives for the entities LOC

and ORG, affecting the precision of those classes

and the overall performance of the model.One interesting example of error in the output of

the model is when the tokens refer to an address.One example is the case of 40 Williams Street,where the correct label is LOC but the model pre-dicts the first token (40) to be NANE and the othertwo tokens to be an instance of PER (i.e. WilliamsStreet is predicted to be a person).

In the model generated with just the CONLLdata, one notable pattern is consistent false posi-tives on tokens with initial capital letters; for ex-ample, the model predicts both Credit Extensionsand Repayment Period to be instances of ORG,though in the gold standard they don’t belong toany entity type. This error was reduced drasti-cally through the addition of the in-domain finan-cial data in training, improving the overall perfor-mance of the model.

Ultimately, the purely in-domain trainingstratagem in Experiment4 outperforms the mixeddata setup (Experiment3), indicating that domaincontext is critical for the task. Having said that,the results of our study inform the broader ques-tion of out-of-domain applicability of NER mod-els. Furthermore, they point to the value of evena small amount of in-domain training data (Duonget al., 2014).

6 Conclusions

Risk assessment is a crucial task for financial in-stitutions such as banks because it helps to esti-mate the amount of capital they should hold to pro-mote their stability and protect their clients. Man-ual extraction of relevant information from text-

88

Predicted

Actual

LOC MISC ORG PER O RecallLOC 20 0 3 2 14 0.513MISC 0 7 0 0 0 1.000ORG 0 0 16 0 40 0.286PER 0 0 0 202 14 0.935NANE 12 2 24 8 –Precision 0.625 0.778 0.372 0.953

Table 3: Confusion matrix for the predictions over FIN3 using the model from Experiment3, includingthe precision and recall for each class (“NANE” = Not a Named Entity).

based financial documents is expensive and time-consuming.

We explored a machine learning approach thatmodelled the extraction task as a named entityrecognition task. We used a publicly availablenon-financial dataset as well as a small number ofannotated publicly available financial documents.We used a conditional random field (CRF) to la-bel entities. The training process was based ondata from CoNLL-2003 which had annotationsfor the entity types PER (person), MISC (mis-cellaneous), ORG (organization) and LOC (loca-tion). We then assembled a collection of publicly-available loan agreements, and manually anno-tated them, to serve as training and test data.Our experimental results showed that, for this taskand our proposed approach, small amounts of in-domain training data are superior to large amountsof out-of-domain training data, and furthermorethat supplementing the in-domain training datawith out-of-domain data is actually detrimental tooverall performance.

In future work, we intend to test this approachusing different datasets with an expanded set ofentity types specific to credit risk assessment, suchas values and dates. An additional step would becarry out extrinsic evaluation of the output of themodel in an actual credit risk assessment scenario.As part of this, we could attempt to identify addi-tional features for risk assessment, beyond what isrequired by the financial authorities.

ReferencesWitold Abramowicz and Jakub Piskorski. 2003. In-

formation extraction from free-text business doc-uments. In Stephanie Becker, editor, EffectiveDatabases for Text & Document Management, pages12–23. IRM Press.

Steven Bird, Ewan Klein, and Edward Loper. 2009.

Natural Language Processing with Python — An-alyzing Text with the Natural Language Toolkit.O’Reilly Media, Sebastopol, USA.

John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In Proceedings of the 2006 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 120–128, Sydney, Australia.

Philip Blunsom. 2007. Structured classification formultilingual natural language processing. Ph.D.thesis, University of Melbourne Melbourne, Aus-tralia.

Thomas Clarke. 2010. Recurring crises in Anglo-American corporate governance. Contributions toPolitical Economy, 29(1):9–32.

Hamish Cunningham. 2005. Information extraction,automatic. In Encyclopedia of Language and Lin-guistics, pages 665–677. Elsevier, 2nd edition.

Guy Debelle. 2009. Some effects of theglobal financial crisis on australian financial mar-kets. http://www.rba.gov.au/speeches/2009/sp-ag-310309.html.

Long Duong, Trevor Cohn, Karin Verspoor, StevenBird, and Paul Cook. 2014. What can we get from1000 tokens? a case study of multilingual POS tag-ging for resource-poor languages. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing, pages 886–897, Doha,Qatar.

Dimitra Farmakiotou, Vangelis Karkaletsis, John Kout-sias, George Sigletos, Constantine D. Spyropoulos,and Panagiotis Stamatopoulos. 2000. Rule-basednamed entity recognition for Greek financial texts.In Proceedings of the Workshop on Computationallexicography and Multimedia Dictionaries (COM-LEX 2000), pages 75–78, Patras, Greece.

Jing Jiang and ChengXiang Zhai. 2006. Exploitingdomain structure for named entity recognition. InProceedings of the Main Conference on Human Lan-guage Technology Conference of the North Amer-ican Chapter of the Association of ComputationalLinguistics, pages 74–81, New York, USA.

89

Taku Kudo. 2013. CRF++: Yet anotherCRF toolkit. https://taku910.github.io/crfpp/. Accessed 26 May, 2015.

John Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of the 18th Interna-tional Conference on Machine Learning, pages 282–289, Williamstown, USA.

Marie-Francine Moens, Caroline Uyttendaele, and JosDumortier. 1999. Information extraction from le-gal texts: the potential of discourse analysis. In-ternational Journal of Human-Computer Studies,51(6):1155–1171.

Behrang Mohit and Rebecca Hwa. 2005. Syntax-based semi-supervised named entity tagging. InProceedings of the ACL 2005 on Interactive Posterand Demonstration Sessions, pages 57–60, Ann Ar-bor, USA.

news.com.au. 2010. Poor risk assessment ‘ledto global financial crisis’. http://goo.gl/f92sv8. Accessed 10 Nov, 2015.

Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Wei-wei Hou, Nathan Schneider, and Timothy Baldwin.2015. Big data small data, in domain out-of do-main, known word unknown word: The impact ofword representations on sequence labelling tasks. InProceedings of the 19th Conference on Natural Lan-guage Learning (CoNLL-2015), pages 83–93, Bei-jing, China.

Sunita Sarawagi. 2008. Information Extraction. Foun-dations and Trends in Databases, 1(3):261–377.

Mahmudul Sheikh and Sumali Conlon. 2012. A rule-based system to extract financial information. Jour-nal of Computer Information Systems, 52(4):10–19.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of theDemonstrations at the 13th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics, pages 102–107, Avignon, France.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of CoNLL-2003, pages 142–147. Ed-monton, Canada.

90

domain adaption of named entity recognition to support

Documents