catch me if you can: improving the scope and accuracy of fraud...

Catch Me If You Can:

Improving the Scope and Accuracy of Fraud Prediction

Bidisha Chakrabarty, Pamela C. Moulton, Leo Pugachev, and Frank Wang*

July 24, 2018

*Chakrabarty ([email protected]) and Wang ([email protected]) are at Saint Louis University; Moulton ([email protected]) is at Cornell University; Pugachev ([email protected]) is at the University of Oklahoma. We thank Dan Amiram, Attila Balogh, Scott Duellman, Quentin Dupont, Jonathan Karpoff, Dave Michayluk, Mark Nigrini, Ethan Rouen, Rik Sen, Wing Wah Tham, Wayne Thomas, and seminar participants at University of Queensland and University of New South Wales for helpful comments.

Catch Me If You Can:

Improving the Scope and Accuracy of Fraud Prediction

Abstract

We propose a parsimonious metric – the Adjusted Benford score (AB-score) – to improve the

detection of financial misstatements. Based on Benford’s Law, which predicts the leading-digit

distribution of naturally occurring numbers, the AB-score estimates a given firm-year’s likelihood

of financial statement manipulation, compared to its peers and controlling for time-series trends.

The AB-score requires less data than the leading accounting-based misstatement metric (the F-

score) and can be computed for many more firm-years, including financial firms. For firm-years

with all data available, combining the AB-score and F-score variables in one model yields higher

accuracy in predicting misstatements in- and out-of-sample.

Keywords: Fraud, Accounting quality, Benford’s Law, F-score, AAERs, Earnings manipulation,

Earnings misstatement

JEL Classification: G20, G23, M41

1

1. Introduction

Financial fraud is difficult to predict because the perpetrators enjoy an informational advantage

over victims and investigators. Globally, organizations lose about 5% ($3.5 trillion) of their annual

revenues to fraud.1 Fraud victimizes shareholders (Karpoff, Lee, and Martin, 2008), affects lenders

(Fulghieri, Strobl, and Xia, 2014), damages the reputation of directors (Fich and Shivdasani, 2007)

and auditors (Skinner and Srinivasan, 2012), and ties up the resources of investigative agencies.

Given the wide reach of fraud, it is not surprising that considerable effort is devoted to fraud

detection and prediction.

We offer a new, parsimonious metric to detect financial reporting irregularities, such as

earnings management, manipulation, and/or misstatement.2 This metric is easy to compute and

requires fewer inputs than existing measures, so it can be computed for a wider range of firms,

including financial firms. It performs well in out-of-sample tests and, importantly, increases the

number of firm-years that can be examined by more than 50% compared to metrics that require

specific accounting variables. Our measure is based on the mathematical observation known as

Benford’s Law (Benford, 1938), which predicts the frequency of each leading digit in a naturally

occurring distribution of numbers (that is, what fraction of numbers should begin with each digit,

1 through 9). For example, in distributions that obey Benford’s Law, the number 1 appears as the

first digit (as in 19 or 168) about 30% of the time, while the number 9 appears as the first digit less

than 5% of the time. Amiram, Bozanic, and Rouen (2015) observe that restated financial

statements more closely adhere to Benford’s Law than the misstated versions in the same year and

that divergence from Benford’s Law can be used to predict material misstatements. Using the

Amiram et al. (2015) findings as a springboard, we conduct a comprehensive investigation into

1 This estimate comes from a survey of Certified Fraud Examiners, who investigated cases between January 2010 and December 2011 and arrived at the estimate by using the 2011 Gross World Product. The Association of Certified Fraud Examiners (ACFE) published the results of the survey in its 2012 Report to the Nations on Occupational Fraud & Abuse., available at http://www.acfe.com/press-release.aspx?id=4294973129. This estimate is comparable to the 3% organizational revenue loss from corporate fraud estimated for the U.S. by Dyck, Morse, and Zingales (2013). 2 We follow the literature in referring to these irregularities as earnings management, manipulation, misreporting, or misstatement (terms that are used interchangeably in the literature), rather than fraud per se. Although the Securities and Exchange Commission (SEC) allegations often imply that they find evidence of fraud, firms typically neither admit nor deny guilt when responding to the SEC’s allegations.

2

the cross-sectional and time-series properties of firms’ financial statement deviations from

Benford’s Law and propose two new metrics that can be used to identify firm-years with higher

likelihood of misreporting.

We call the first metric the Adjusted Benford score (AB-score), and we build it as follows.

First, we verify that in aggregate financial statement numbers in Compustat closely follow the

leading digit distribution predicted by Benford’s Law. We then move to a firm-year level of

analysis by constructing, for each firm-year, a raw score that measures how much the leading digit

distribution of financial statement numbers deviates from the distribution predicted by Benford’s

Law. Our raw score is akin to the Financial Statement Divergence Score of Amiram et al., 2015.

We study how the raw score varies across firms, financial statement length, industry grouping, and

time. Guided by the results of this investigation, we construct several standardized variants of the

raw score and include them in a selection model to predict known cases of financial misstatements,

as identified in the SEC’s Accounting and Auditing Enforcement Releases (AAERs). This

selection process flags one particular combination that has the best predictive ability; we call this

the AB-score model. This model produces an odds ratio, the AB-score, which expresses the

probability that a firm’s financial numbers are misstated in a given year.

The main advantage of the AB-score is that it can be computed over every firm-year that has

any financial statement information available. As a result, it can be computed over a wider range

of firm-years than prediction models that require specific financial statement inputs such as

accruals. We choose as our benchmark one of the most comprehensive and popular measures of

earnings manipulation, the F-score of Dechow, Ge, Larson, and Sloan (2011; DGLS henceforth).

The F-score has been shown to be very useful in detecting financial misreporting and is widely

used in the accounting and finance literature.3 DGLS compute the F-score as the predicted

probability of a misstatement (an odds ratio) using fitted values from a model that includes balance

sheet items, nonfinancial measures, off-balance-sheet activities, and market-based measures.

DGLS use the SEC’s AAERs as their misstatement indicator, as do we. We compare the

3 Over 150 studies in the finance and accounting literature use the F-score as a metric of financial misstatement according to a Google Scholar search at the time of writing (February 2018).

3

applicability of the F-score and the AB-score over the original DGLS sample period (1979-2002)

and our full sample period (1979-2011). Specifically, we compare how many AAER and non-

AAER firm-years can be predicted by each measure based on their data requirements. This is

important because models that require a larger number of inputs to identify misstatements often

have a limited scope due to missing data.4 The AB-score can be calculated for about 61% more

firm-years than the F-score during the DGLS sample period. Similarly, the AB-score can be

calculated for about 58% more firm-years than the F-score during our full sample period. About

47% of the additional firm-years that can be estimated by the AB-score model are for financial

firms, and the remainder are non-financial firms that are missing necessary data in Compustat.

In addition to the AB-score, we create a model that includes the Benford-based variables along

with the variables from the F-score model. We call the output of this combined model the ABF-

score. While the ABF-score is limited in scope to the firm-years for which the F-score variables

are available, it offers the benefit of using leading-digit-based information as well as accounting

information to detect misstatements within the smaller sample.

To assess the benefits of using Benford-based variables in predicting misstatements, we first

examine how many AAER firm-years are correctly predicted in-sample by the AB-score, the F-

score, and the ABF-score, using an odds-ratio threshold of 1.0.5 Thanks to its broader sample

coverage, the AB-score correctly predicts the largest number of AAER firm-years, 973, while the

F-score and ABF-score correctly predict 678 and 683, respectively. As a percentage of the sample

to which each model can be applied, all three models perform well: the AB-score correctly predicts

about 73% of the AAER firm-years it can estimate, versus 69% for the F-score and 70% for the

ABF-score. The AB-score has a lower Type II (false negative) error rate than the other two models,

but it comes at the expense of a higher Type I (false positive) error rate. The ABF-score has lower

Type I and Type II error rates as well as a higher correct AAER firm-year prediction rate than the

4 Missing data is an issue because the convention in finance and accounting studies is to drop observations with missing data. For example, Brazel, Jones, and Zimbelman (2009) show that of the 268 AAERs in their 1998 to 2007 sample, 162 had missing or incomplete data to build accounting-based misstatement measures (see their Table 1). 5 Above the threshold of 1.0, a firm-year is more likely to be misstated than a randomly chosen observation from the sample; see Section 4.3.

4

F-score. The ABF-score’s advantage holds both with an odds-ratio threshold of 1.0 and when

considering all possible thresholds through receiver operating characteristic (ROC) curves.6 Taken

as a whole, the in-sample tests suggest that the AB-score and ABF-score models can improve the

prediction of earnings misstatements, both in terms of increasing the scope of firms that can be

analyzed (with the AB-score) and boosting the accuracy of the F-score accounting-based metric

(with the ABF-score).

A common concern about explanatory models is that relationships found in-sample may not

hold out-of-sample. To address this concern, we assess each model’s ability to predict AAER firm-

years out-of-sample using 100 simulations in a random-holdout specification (randomly selecting

half of the observations to estimate the model and testing its predictive ability on the other half of

the observations). The out-of-sample tests confirm the in-sample findings: The AB-score correctly

identifies more AAER firm-years thanks to its broader sample coverage, while as a percentage of

the sample each model can be estimated over, the ABF-score performs best in terms of both correct

prediction rates and error rates at an odds-ratio threshold of 1.0.

The AB-score and F-score models approach misstatement detection from very different

perspectives (with the ABF-score combining the two), one rooted in the prevalence of leading

digits and the other capturing specific accounting information. A natural question is whether they

identify the same misstatement firm-years despite coming from such different perspectives. To

answer this question, we examine the congruence of the AAER firm-year predictions of the AB-

score, F-score, and ABF-score models. Within the overlap sample, the AB-score (ABF-score)

correctly predicts about 84% (91%) of the AAER firm-years that are correctly predicted by the F-

score, while the F-score correctly predicts about 72% (90%) of the AAER firm-years that are

correctly predicted by the AB-score (ABF-score). Furthermore, we find that the ABF-score is more

successful at correctly identifying AAER firm-years that the F-score is unable to distinguish from

non-AAER firm-years than the F-score is at breaking ABF-score ties. Taken together, these

findings suggest that the ABF-score provides incremental benefit above the F-score.

6 The ROC curve is a diagnostic tool to evaluate the efficacy of a binary model. It plots the true positive rate against the false positive rate for all possible threshold cutoffs.

5

Finally, we adopt a case-study approach and investigate how the AB-score, F-score, and ABF-

score perform in detecting notorious cases of financial misstatement. We identify ten high-profile

financial misstatement cases during our sample period and calculate the AB-score, F-score, and

ABF-score for each case. Of the 57 AAER firm-years in this sample with Compustat data

available, the AB-score classifies 46 (81%) as likely to be misstated. The F-score and ABF-score

can be computed for only 41 firm-years, of which the F-score classifies 27 (66%) and the ABF-

score classifies 31 (76%) as likely misstated. Furthermore, the AB-score and ABF-score provide

stronger signals of financial misconduct: The average AB-score in this sample is 1.50 and the

average ABF-score is 1.56, implying that these firm-years are about 1.50 to 1.56 times as likely to

be misstated as the average observation. In contrast, the average F-score for these firm-years is

1.27. Overall, the AB-score and ABF-score provide sharper identification of likely misreporting

behavior in this sample of notorious cases.

Our primary contributions to the literature are the new AB-score and ABF-score models. The

AB-score provides a reliable metric for detecting potential misstatements in a much broader set of

firms than the leading F-score metric. Of particular note is the fact that unlike the F-score (and the

ABF-score), the AB-score can be applied to financial firms. In examining the relationship between

managerial compensation and risk in financial firms, Cheng, Hong, and Scheinkman (2015) note

that the conduct of financial firms such as Bear Stearns, Merrill Lynch, AIG, and Lehman Brothers

during the financial crisis underscores the importance of bringing greater scrutiny to their reporting

activities. The AB-score allows such scrutiny. Furthermore, much of the research in misstatement

prediction relies on databases of firms that have been caught. Karpoff, Koester, Lee, and Martin

(2017) find that databases of firms identified as having engaged in misstatement (i.e., ex-post

samples) have several systematic biases that are economically meaningful; predictive models such

as the AB-score and ABF-score alleviate such biases. Given the small sample characteristics and

biases in ex-post misconduct samples, Amiram, Bozanic, Cox, DuPont, Karpoff, and Sloan (2017)

highlight the need for “more robust, and possibly yet-to-be discovered, techniques and

methodologies” for fraud-related research. Our study directly addresses this call.

For non-financial firms with the necessary financial statement variables available, the ABF-

6

score encompasses both the financial intuition of the F-score variables and the leading-digit

detection of the AB-score, producing a metric with a higher correct classification rate and lower

error rates than the F-score and the AB-score, both in-sample and out-of-sample. Our

recommendation is that researchers use the ABF-score for firm-years with all the necessary

financial statement data available and use the AB-score for financial firms and any other firms

lacking the necessary data. To this end, will share our programs for constructing the AB-score and

the ABF-score with interested researchers and make them publicly available at a later date.

The remainder of the paper is organized as follows. Section 2 discusses the related literature.

Section 3 describes our data and develops the AB-score model. Section 4 develops the ABF-score

model and tests the models’ abilities to detect misstatements in-sample and out-of-sample, and

Section 5 examines high-profile misstatement cases. Section 6 concludes.

2. Related literature

2.1 Benford’s Law and financial statement numbers

The original research establishing that there is a predictable frequency with which leading

digits occur in a natural distribution began with astronomer Simon Newcomb (1881), who noticed

that books of logarithm tables were generally more worn on the early pages than toward the back.

People seemed to look up numbers beginning with the digits 1 and 2 far more often than they

looked up numbers beginning with the digits 8 and 9. Newcomb later sketched a proof that

numbers beginning with 1 and 2 actually occur more often in nature than numbers beginning with

8 and 9. His proof shows that a randomly selected number should begin with the digit 1 about

log10(2) or 30.1% of the time, the frequency of numbers with leading digit 2 should be log10(3/2)

or about 18%, those with leading digit 3 should be log10(4/3) or about 12%, and so on until the

frequency of 8’s should be 5.1% and that of 9’s should be 4.6%. In general, the probability with

which the leading digit (d) should appear in a distribution of numbers is:

𝑃 𝑑 𝐿𝑜𝑔 11𝑑

, 1

where d = 1, 2, … , 9, and P is the probability associated with that number’s appearance in the

7

data. Fifty-seven years after Newcomb’s work, physicist Frank Benford rediscovered the property

and did extensive work to provide a more rigorous mathematical underpinning. Benford found

support in over 20,000 entries from 20 different sources, including data on river surface areas,

populations, specific heats of chemical compounds, American League baseball statistics, and

numbers obtained from newspaper and Reader’s Digest articles.7 Since Benford’s (1938) article

gained widespread attention while Newcomb’s (1881) work had been somewhat overlooked, the

law became known as Benford’s Law.

Benford’s Law has been used to investigate data-related irregularities in settings as disparate

as political elections (Klimek, Yegorov, Hanel, and Thurner, 2012), religious activity (Mir, 2014),

and volcanology (Geyer and Marti, 2012). Varian (1972) was an early champion for the use of

Benford’s Law in the social sciences. In an accounting context, Nigrini (1999) shows that

deviations of financial statement numbers and tax-related data from the prediction of Benford’s

Law can be useful to flag cases for further scrutiny. Durtschi, Hillison, and Pacini (2004) show the

value of Benford’s Law as a signaling device to identify accounts more likely to involve

misstatement, thus improving on the random selection process auditors employ to assess the

validity of a firm’s reported numbers. Deviations from Benford’s Law in financial statements

appear to vary over time: Wang (2011) finds an increase in deviations from 1960 to 2011.

In recent work related to financial misconduct, Amiram et al. (2015) use Benford’s Law to

create their Financial Statement Divergence (FSD) Score.8 Our study extends Amiram et al.’s

(2015) work in several ways. First, we show that the relationship they find in a sample of 73 AAER

firm-year observations holds for the universe of 1,336 observations. Second, we refine the FSD

score to account for time-series variation and variation caused by the number of inputs. While

Amiram et al. (2015) use only financial statements that include more than 100 inputs, our sample

contains the universe of Compustat observations with any non-missing financial statement data.

Most importantly, while Amiram et al. (2015) document the relationship between AAERs and

Benford’s Law, we launch an extensive investigation into how to best convert that relationship

7 Benford (1938), Table I. 8 Bowler (2017) and Boyle and Lewis-Western (2018) test the use of the FSD score in an audit setting.

8

into information that researchers and auditors can use to detect misstatement. Finally, we test our

measures both in- and out-of-sample alongside the leading accounting-based misstatement

measure and investigate the advantages of combining the two approaches.

2.2 The F-score and other metrics used in financial misconduct research

There is no single definition of what constitutes financial misconduct, so there are multiple

approaches used to gauge it. Some studies adopt the strictest definition of financial misconduct:

fraud. These studies are based on small, often hand-collected samples of firms that are sanctioned

for fraud. One such example is Brazel, Jones, and Zimbelman (2009), who begin with a sample of

AAERs and then go through each release to determine whether fraud is established, cross-checking

against other sources. Other studies use broader indications of financial misconduct, including

direct measures such as restatements that arise from U.S. GAAP violations (Burns and Kedia,

2006) and indirect measures such as total accruals (Bayley and Taylor, 2007), earnings

management (Beneish, 1999), and options back-dating (Bernile and Jarrell, 2009).

One of the most advanced and widely used measures to detect financial statement manipulation

is the F-score developed by DGLS (F is for “fudging,” according to one of the authors). DGLS

compile a database of financial misstatements by hand-collecting information in the SEC’s

AAERs, noting whether the firm or employees were named in the AAER and whether the

wrongdoing was related to overstated earnings (understatement of earnings is more likely to be an

unintentional mistake).9 Using this database, DGLS develop a prediction model which provides

the F-score, a scaled probability that can be used to estimate the likelihood of earnings

misstatement.

The F-score is generated from a model that analyzes financial statement data, combining

several accounting variables that have been used in previous studies to signal earnings

management or financial misreporting. DGLS present three such models in decreasing order of

parsimony. The first includes several measures of accruals quality and discretionary accruals. To

gauge whether diminishing firm performance prompts misreporting, it includes annual changes in

9 DGLS make their AAER data available to other researchers to promote research on earnings misstatements.

9

return on asset and cash sales items. To capture financing activities, it includes debt or equity

issuance, and because soft assets may be easier to manipulate, the model includes the ratio of soft

assets to total assets. The second model adds to these variables the abnormal change in the number

of employees because firms may try to boost short-term earnings by cutting employee headcount.

It also adds operating lease activities because leases can be used to frontload earnings. The final

model adds current and lagged market-adjusted returns, because firms may misstate to compensate

for poor performance. DGLS demonstrate that by including financial statement and market

information beyond accruals, the F-score offers a robust approach to detecting misstatements.

A large number of studies use the DGLS F-score in misstatement-related research. For

example, Fang, Huang, and Karpoff (2016) use the F-score to document how short selling, or its

prospect, curbs earnings management. Jia, Van Lent, and Zeng (2014) use the F-score to examine

male CEOs’ facial masculinity and financial misreporting. Bradley, Gokkaya, Liu, and Xie (2017)

use the F-score to gauge the ability of analysts to detect firms engaging in financial misreporting

activities. To test how firm-initiated clawbacks reduce accounting manipulation, Chan, Chen, and

Chen (2013) use the F-score as a metric of financial statement manipulation. DeFond, Lim, and

Zang (2015) use the F-score to assess which client firms present greater engagement risk for

auditors.

Given the widespread use of the F-score in financial misconduct research, we believe it is the

most useful benchmark against which to test the AB-score. In a recent study, Perols, Bowen,

Zimmermann, and Samba (2017) propose other potential benchmarks. They investigate three data

analytic techniques and show that two of these outperform the F-score in detecting AAER firm-

years in a limited sample. We choose the F-score as our benchmark because the Perols et al. (2017)

models have been tested on only a small sample of AAER firm-years (51 out of nearly 1,400),

while the F-score model has been tested more broadly including out-of-sample.

3. Development of the AB-score model

3.1 Data and sample

We use two main data sources for this study. For our financial statement data, we use all

10

Compustat variables that appear in the balance sheet, income statement, and statement of cash

flow, as in Amiram et al., 2015.10 We obtain data on the SEC-issued AAERs from the Center for

Financial Reporting and Management (CFRM) at University of California, Berkeley.11 Our full

sample period is 1979 – 2011; because AAERs are issued with a lag relative to alleged

misstatement years, we use AAERs issued through 2014 to identify misstatements through 2011.

To facilitate comparisons with the results of DGLS, we also examine their sub-period of 1979 –

2002.

The AAER dataset documents firms that are issued accounting and auditing enforcements by

the SEC at the conclusion of an investigation against the firm, an auditor, or an officer for alleged

accounting and/or auditing misconduct. These releases provide details on the nature of the

misconduct, the individuals and entities involved, and their effect on the financial statements. We

begin with the 1,383 AAERs issued in our sample period, covering 1,909 firm-years. Because our

study focuses on financial misstatements, we filter out 403 actions that do not allege misstated

annual financials. We further eliminate 224 AAERs in which the recipient or misstatement year

cannot be precisely determined. We delete seven AAERs that allege earnings understatement

(instead of inflated earnings) to facilitate comparison between our prediction model and that of

DGLS.12 We lose 189 observations when merging with Compustat data, yielding a final sample of

1,336 distinct firm-years covering 578 AAERs issued to 577 firms.13

Each AAER alleges at least one year of misstated financials, and many identify multiple

consecutive misstated years per firm. Our sample contains AAERs that allege financial

misstatement ranging from one to 16 years.14 Table 1 summarizes the distribution, showing that

10 Because our goal is to predict as many firm-years as possible, we do not require that a firm-year have a minimum number of line items to be included. Amiram et al. (2015) point out that their results are robust to including firm-years with fewer than 100 line items, and our results are robust to excluding firm-years with fewer than 100 line items. 11 We thank Dechow, Ge, Larson, and Sloan for making these data available. The data collection procedure for the AAER dataset is described in detail in DGLS. 12 Including the seven understatement AAERs strengthens our results in further analyses. 13 In our sample period, Time Warner AOL receives AAERs related to two separate material accounting misstatements. The first alleges 1995-1996 financials to be misstated, and the second relates to 2000-2002 financials. 14 In 16 cases, a single AAER from the CFRM database alleges misstatement over a non-contiguous time horizon. We compute these AAERs’ durations as the difference between the first and last AAER-year rather than treating each as multiple AAERs with shorter durations.

11

the mean (median) AAER in our sample alleges 2.31 (2.00) years of financial misstatement.

[Table 1 here]

We establish the applicability of Benford’s Law to the Compustat universe by examining the

leading digit distribution of all non-missing financial statement variables in Compustat over our

sample period. Each firm-year must have at least one non-missing financial statement variable in

Compustat to be included.15 Table 2 presents the results.

[Table 2 here]

In the full Compustat sample, there are 10.73 million numbers that begin with the leading digit

1 and 1.56 million beginning with the leading digit 9. As a percentage of the total (34.88 million

numbers), the leading digit 1 appears with a frequency of 30.76% (10.73/34.88) while numbers

with a leading digit of 9 appear with a frequency of 4.47% (1.56/34.88). The comparable

predictions from Benford’s Law in Equation (1) are 30.10% for leading digit 1 and 4.58% for

leading digit 9. The mean absolute deviation of the observed distribution from the predicted

distribution of leading digits is 0.1580%. We scale this by 100 and arrive at the raw Benford score

(B_Raw score) of 0.1580. Figure 1 shows the close fit of the aggregate leading digit distribution

in our sample to the distribution predicted by Benford’s Law.

[Figure 1 here]

3.2 Adjusting the Raw Benford Score

Although the overall distribution of leading digits in financial statement numbers in Compustat

closely follows the distribution predicted by Benford’s Law, at the firm-year level there is

significant variation (Amiram et al., 2015). To examine this variation, we calculate the B_Raw

score for each firm-year.

Our goal is to assess how the B_Raw score behaves in the cross-section and over time so that

we can fine-tune its usefulness as a predictor of earnings misstatement. We first consider financial

statement length. Bowler (2017) shows that the Benford score is vulnerable to continuity frictions

when a smaller pool of numbers is used to compute it. For example, Benford’s law states that the

15 Restricting the sample to firm-years with at least 100 variables, as in Amiram et al. (2015), yields identical inference.

12

leading digit nine should appear approximately 4.6% of the time. For a firm-year with 50

Compustat numbers, that should be 2.3 occurrences, which will create a mechanistic deviation

when the observed instance is either a two or a three. Furthermore, individual line item leading-

digit deviations will be a larger percentage when there are fewer line items. Thus we expect a

mechanistic, negative relationship between the number of line items and the B_Raw score for a

firm. In Panel A of Figure 2 we plot the average B_Raw score for each firm-year against the number

of inputs (line items) used to compute the score; the graph shows a clear negative slope. This result

suggests that a simple comparison of the B_Raw scores of two firms to proxy for relative

likelihoods of financial statement manipulation may be misleading if the firms’ financial

statements are of different lengths.

[Figure 2 here]

We next examine whether the B_Raw score varies across industries, motivated by the fact that

research on earnings management, discretionary accruals, and financial reporting quality generally

controls for industry classification (e. g., Bergstresser and Philippon, 2006). Panel B of Figure 2

shows that the B_Raw scores in our sample exhibit moderate heterogeneity across two-digit SIC

industries. The third dimension we examine is how the B_Raw score behaves over time, since

academic research and the popular press report that financial misconduct is more concentrated in

certain periods. For example, at the turn of this century, the dot-com bust was followed by the

revelation of several financial scandals including Enron, Tyco, and WorldCom, prompting the

expansive Sarbanes-Oxley Act to strengthen existing financial disclosure rules and mandate new

ones. Panel C of Figure 2 shows that the B_Raw score varies over time, with a sharp peak around

the dot-com bubble. Finally, we compute a firm-level B_Raw score to examine how much firms’

average B_Raw scores vary from one another. Panel D of Figure 2 shows there is significant firm-

level heterogeneity in B_Raw scores.

The results of this examination suggest that the B_Raw score for a firm-year should be adjusted

to account for predictable cross-sectional and time-series variations if it is to be compared across

firms and over time. We calculate four such adjusted measures, where each adjusts for baseline

differences in one of the four dimensions examined above (number of inputs, year, industry, and

13

firm). The four adjusted Benford score measures are:

B_Input adjusts the B_Raw score for the number of inputs used in its computation.

Within each year, observations are sorted into 20 bins by how many non-missing

financial statement numbers they contain.16 We compute the average B_Raw score for

each bin and the standard deviation of B_Raw score within that bin. For each

observation, we subtract the average B_Raw score of its bin and divide by that bin’s

standard deviation.

B_Industry subtracts from each firm-year’s B_Raw score that industry-year’s mean

B_Raw score and divides by the industry-year’s standard deviation of B_Raw score.

B_Year subtracts from each firm-year’s B_Raw score that year’s mean B_Raw score

and divides by the standard deviation of B_Raw scores, calculated across all firms in

that year.

B_Firm subtracts from each firm-year’s B_Raw score that firm’s cumulative (prior to

that year) mean B_Raw score and divides by the firm’s cumulative standard deviation

of B_Raw scores.17

3.3 Building the AB-score model

We test whether the B_Raw score and the four adjusted measures (B_Input, B_Year,

B_Industry, and B_Firm) can be used to predict material financial misstatements as proxied by

AAER firm-years. We estimate a logistic regression as in Shumway (2001), where the dependent

variable, AAERi,t, is an indicator that assumes the value one if the SEC released an AAER alleging

that firm i’s financials in year t are misstated, zero otherwise. We estimate a model with all five

measures together and a model with measures chosen via a backward elimination technique,

beginning with all of the variables and then using the computational algorithm of Lawless and

16 By normalizing by bins within each year, we implicitly adjust for any trends in the number of line items reported in financial statements over time. Bloomfield (2012) suggests that firm disclosures have been increasing over time because of regulatory requirements. 17 We require data over the prior two years to compute a firm’s cumulative standard deviation of B_Raw scores.

14

Singhal (1978) as a basis for removing variables, as in DGLS.18 The logistic regressions take the

following form:

𝐴𝐴𝐸𝑅 , 𝛼 𝛽 𝐵𝑒𝑛𝑓𝑜𝑟𝑑_𝑀𝑒𝑎𝑠𝑢𝑟𝑒 , 𝜖 , , 2

where k = the number of variables included and Benford_Measurei,t is B_Raw score, B_Input,

B_Year, B_Industry, or B_Firm.

[Table 3 here]

Specification (1) in Table 3 includes all five variables, while specification (2) includes only

the three explanatory variables B_Raw, B_Input, and B_Year, which are chosen via backward

elimination. Specification (2) in Table 3 has the higher predictive power. In this model the

individual coefficients are not of primary interest, since they suffer from multicollinearity; our

interest going forward is the model’s ability to predict AAER firm-years. The estimates in

specification (2) indicate that B_Input and B_Year are significant and have incremental power over

B_Raw in explaining AAERs. We call this specification the Adjusted Benford score (AB-score)

model. In the following sections we examine the scope, accuracy, and predictive power of the AB-

score.

4. Testing the performance of the AB-score

In this section we first replicate the DGLS F-score models and develop an additional model,

the ABF-score model, which combines the AB-score and F-score variables into a single model.

Second, we examine the scope of each model, comparing how many AAER and non-AAER firm-

years each model can be applied to given its data requirements. Third, we compare the in-sample

performance of the three models, followed by formal out-of-sample tests for model evaluation.

Finally, we examine the overlap between the three models’ predictions.

4.1 Replication of F-score model and development of ABF-score model

Testing the efficacy of any metric requires a benchmark: What are we comparing it to? To

18 Using forward or stepwise methods, instead of backward, yields the same set of variables for inclusion.

15

examine the performance of the AB-score model, we choose the F-score model of DGLS as the

benchmark because of its prominence in the literature. The F-score integrates disparate warning

signals of financial misreporting into a comprehensive measure, the odds that a firm is “cooking

the books.”

DGLS predict the issuance of AAERs using three models in decreasing order of parsimony.

First, they use backward selection to build a baseline model with seven predictors: (1) change in

noncash net operating assets (RSST_Accruals), (2) change in receivables (Chg_Rcv), (3) change

in inventory (Chg_Invt), (4) percent soft assets (Pct_SoftA), (5) change in cash sales

(Chg_CashSales), (6) change in return on assets (Chg_ROA), and (7) an indicator equal to 1 if the

firm issued debt or equity during that year, 0 otherwise (Issue). Their second specification adds

(8) abnormal change in employees (Abn_Chg_Emp) and (9) an indicator equal to 1 if the company

has operating leases, 0 otherwise (OL). Their third model adds (10) market-adjusted stock returns

(MASR) and (11) one-year-lagged market-adjusted stock returns (Lag_MASR). We refer to these

three models as F-score M1, F-score M2, and F-score M3, respectively. Our goal is to test how the

AB-score model performs at AAER and non-AAER firm-year prediction in comparison to the

DGLS F-score models. We first carefully replicate the DGLS estimation and then examine the

incremental explanatory power of the AB-score. To that end, we compute the variables that enter

the DGLS estimation, both over their sample period and over our full sample period, and present

the results alongside the ones reported in the DGLS study.19

[Table 4 here]

Table 4 provides descriptive statistics for the AB-score and F-score variables for the entire

sample (Panel A) and for AAER firm-years (Panel B). All continuous variables are winsorized at

19 We calculate the variables following the description in DGLS (pp.35-38). Because bank and insurance company financial statements substantially differ from industrials in accrual variables, DGLS exclude the two industries. Following suit, we drop observations with 2-digit SIC codes from 60 to 69 when running their models. However, prediction using our Benford variables makes no distinction on industry. Therefore, we retain all observations when running the AB-score model. Note that we force our AAER sample to match the one used in DGLS. This is important because some AAERs to which DGLS did not have access allege misstatement within their 1979-2002 sample period. DGLS also follow Richardson, Sloan, Soliman, and Tuna (2005) in setting missing Compustat data items 9, 32, 34, 130, and 193 to zero. We follow this approach when computing the F-score but note that not doing so materially reduces the number of observations over which F-score can be computed.

16

1% and 99%, as in DGLS. For the full sample, mean and median B_Raw scores are slightly over

3, similar to the 2.96 (percent) FSD scores reported in Amiram et al. (2015) for their smaller

sample. Comparing across rows, the means and medians of the F-score variables are fairly closely

replicated in our sample and the DGLS sample. Comparing across the two panels in Table 3, we

find that the B_Raw score is lower in the AAER firm-years (Panel B), consistent with our results

in Table 3 and those reported in Amiram et al. (2015).

To lay the groundwork for comparing the AB-score and F-score models, we next present each

model’s estimated coefficients and compare our replication to the original F-score coefficients in

DGLS. For each model, we run logistic regressions to predict AAER firm-years, where the

dependent variable is a dummy that equals 1 if the observation is an AAER firm-year, 0 otherwise.

We estimate five models. Panel A presents the three F-score models in decreasing order of

parsimony. Panel B presents coefficients from the AB-score model and the ABF-score model,

which includes the three variables from the AB-score model and the seven variables from F-score

M1.20 We estimate each of these models over two time periods, the DGLS sample period (1979-

2002) and our full sample period (1979-2011).

[Table 5 here]

Table 5 presents the results. Comparing the “Reported” column of each of the three F-score

models with the estimates we obtain (under the column labeled “1979-2002”) shows that our

regressions closely reproduce the coefficient estimates reported in each of the three F-score

models.21 The coefficient estimates for 1979-2011, which includes nine additional years, diverge

somewhat from the 1979-2002 estimates but are still similar. Finally, the coefficient estimates on

the F-score variables in the ABF-score model are fairly close to those estimated in our F-score

20 We use F-score M1 in the ABF-score model because it is the most parsimonious of the F-score models, requiring the fewest inputs. Using M2 or M3 would further reduce the ABF-score model’s coverage relative to the AB-score model’s. 21 One likely reason that our replication exercise in Table 5 does not produce perfect matches for the DGLS coefficients is that Compustat backfills historical data (Cohen, Polk, and Vuolteenaho, 2003). The DGLS authors downloaded their data from Compustat sometime before 2011 (their paper’s publication date), and we downloaded our data from Compustat in 2017. In ongoing work, we are repeating the replication exercise using data that were downloaded from Compustat in 2012 to estimate the effects of such backfilling.

17

model replication as well as the ones reported in the DGLS paper. This exercise supports the

validity of our replication method, paving the way for us to use the F-score as a benchmark for

analyzing the accuracy and effectiveness of the AB-score and ABF-score.

4.2 Comparison of model scope

DGLS show that as input requirements increase, the number of AAER firm-years over which

their models can be predicted decreases monotonically, making a case for model parsimony. In the

same spirit, we begin by examining how many AAER and non-AAER firm-years each of the three

F-score models, the AB-score model, and the ABF-score model can be applied to. Table 6 presents

the results.

[Table 6 here]

The first row shows that for the most parsimonious version of their model, F-score M1, DGLS

report that 494 AAER and 132,967 non-AAER firm-years’ F-scores can be estimated (columns

labeled “DGLS 1979-2002”). We find similar numbers of observations in our replication of their

sample period (columns labeled “Replication 1979-2002”): 492 compared to the 494 AAER firm-

years that DGLS estimate (99.6%) and 132,139 compared to the 132,967 non-AAER firm-years

that DGLS estimate (99.4%). The number of observations that can be estimated drops for F-score

M2 and M3, shown in the second and third rows, as each successive F-score model requires more

inputs. The AB-score model, which has less demanding data requirements, can be estimated for

697 AAER firm-years (41.7% more than F-score M1) and 212,902 non-AAER firm-years (61.1%

more than F-score M1) over the 1979-2002 period; a similar sample expansion occurs over the

longer 1979-2011 period. The ABF-score model has the same observational counts as F-score M1

because both have the same binding input requirements from Compustat.

4.3 In-sample comparisons

The main goal of this paper is to improve the detection of financial misstatements. To that end,

in this subsection we examine each candidate model’s ability to correctly predict AAER and non-

AAER firm-years within-sample; in the following subsection we perform out-of-sample tests. For

comparison we use the F-score M1 in this and all subsequent analyses because it requires the least

number of inputs, which gives the F-score model the broadest sample coverage. Each model is

18

applied to predict AAER and non-AAER firm-year observations, and each observation’s odds of

being a misstated firm-year are determined from that model’s coefficients.

We estimate an observation’s odds of being an AAER firm-year under each model following

the methodology of DGLS. First, using the full sample period 1979-2011, we compute the

unconditional probability that an observation is an AAER firm-year by dividing the number of

AAER firm-years by the number of total firm-years. Next, we obtain the predicted value for the

dependent variable by multiplying the independent variable matrix by the coefficient matrix. We

then determine the conditional probability of an observation being an AAER firm-year by

exponentiating the predicted value (using base e) and dividing by one plus that amount. Finally,

we determine an observation’s odds of being misstated relative to a random observation by

dividing the conditional probability by the unconditional probability.22 The average firm has an

odds ratio of 1; the higher a firm-year’s odds ratio, the higher its probability of misstatements.

We compare each observation’s odds of being an AAER firm-year against a threshold.

Observations with odds greater than or equal to (less than) the threshold are classified as likely

AAER firm-years (non-AAER firm-years). A threshold of 1.0 has an intuitive interpretation: At

odds of 1.0, an observation is as likely to be an AAER firm-year as a random observation pulled

from the sample. Those with odds above (below) are more (less) likely.

Table 7 reports each model’s sample coverage and accuracy. The first column reports the

number of firm-year observations that can be estimated by each model (of the 296,645 firm-year

observations from Compustat in the 1979-2011 period). Observations correctly identified as

misstated firm-years are counted under the Correct AAER firm-years column; those correctly

identified as not misstated, under Correct Non-AAER firm-years; those erroneously identified as

misstated, under Type I Error; and those erroneously identified as not misstated, under Type II

Error. The last column reports how many of the Compustat firm-year observations cannot be

classified by each model (Unclassified).

[Table 7 here]

22 DGLS illustrate this procedure on p. 61.

19

Panel A presents the results for all three models over the samples for which each can be

estimated, using an odds-ratio threshold of 1.0. The AB-score model correctly identifies 72.8% of

the AAER firm-years, which compares favorably with the 69.4% of AAER firm-years correctly

identified by the F-score model. In terms of the number of AAER firm-years correctly identified,

the AB-score does considerably better (973 versus 678 AAER firm-years for F-score) because of

its broader sample coverage. In addition to correctly identifying true AAER firm-years, we also

care about minimizing the number of false positives (Type I errors), i.e., non-AAER observations

that are erroneously flagged by the model as likely to be misstated, and false negatives (Type II

errors). The F-score model has a lower Type I error rate than the AB-score model, while the AB-

score model has a lower Type II error rate. The Type II error rate, which captures observations

that are mistakenly identified as not misstated, is generally of greater concern to auditors than the

Type I error rate (Carcello, Vanstraelen, and Willenborg, 2009) because auditors are more likely

to be sued for failure to detect misstatements (Bonner, Palmrose, and Young, 1998). Auditors

would suffer more if they gave a green light to misstated financial statements than if they treated

correct financial statements as suspect (in the latter case, in the process of trying to detect the non-

existent errors the auditors would likely discover that the statements were correct). Finally, Panel

A shows that the ABF-score model performs slightly better than the F-score model in-sample, with

a few more AAER firm-years correctly identified (though not as many as the AB-score) and lower

Type I and Type II error rates.

Panel B of Table 7 presents a closer look at the 109,233 firm-year observations that cannot be

predicted by the F-score and ABF-score models. The AB-score model performs well in this subset

overall, and about equally well in the 47% of observations for financial firms as for the 53% that

are non-financial firms missing some data in Compustat. As in the full sample (Panel A), we find

correct AAER prediction rates of over 70% (Type II error rates below 30%) for both subsamples,

suggesting that the AB-score is a good metric for identifying possible misstatements in firms that

20

cannot be estimated by the F-score model.23

Table 7 applies the intuitive threshold of 1.0, but the relative performance of the models can

vary with the odds ratio threshold chosen. We next construct ROC curves as a more formal test of

the models' predictive power. The ROC curve plots a model’s true positive rate against its false

positive rate across every possible threshold; a higher area under the curve (AUC) indicates that a

model is more effective at distinguishing between positive and negative outcomes when all

thresholds are considered. The AUC can range from 50% (purely random prediction) to 100%

(perfect prediction). An AUC of 60% is generally considered desirable in low-information

environments, while an AUC of 70% is desirable in information-rich environments (Berg, Burg,

Gombovic, and Puri, 2018; Iyer, Khwaja, Luttmer, and Shue, 2016).

[Figure 3 here]

Panel A of Figure 3 compares the ROC curves of the F-score and the ABF-score models for

the firm-year observations over which both models can be estimated. The AUC for the F-score is

70%, while the AUC for the ABF-score is 72.42%. Recall that a completely uninformative model

would have an AUC of 50%; a 1% increase in AUC is considered a noteworthy gain (Iyer et al.,

2016). By the AUC metric, the ABF-score predicts AAER firm-years with 12.1% greater accuracy

than the F-score model.24 Panel B compares the ROC curves for the AB-score and the ABF-score

for the firm-year observations over which they can both be calculated. The AB-score’s AUC is a

respectable 63.77%, but the ABF-score dominates the AB-score at every threshold, with a 62.82%

greater accuracy over the AB-score.25 Finally, Panel C presents the AB-score ROC curve for the

firm-year observations that only the AB-score model can estimate (because they are financial firms

or are missing accounting variables required to calculate the F-score and ABF-score). The AB-

score’s AUC in this non-overlapping subsample is 66.32%, better than the AB-score’s AUC in the

23 There is no evidence that misstatements are more common among the firm-years for which F-score cannot be calculated. The actual AAER rate in firm-years in the non-overlapping sample is 0.33%, compared to 0.52% in the overlapping sample, 24 We follow Iyer et al. (2016) in computing the percentage improvement as (0.7242 – 0.5)/((0.7000 – 0.5) = 1.121, where 0.5 (the AUC under a non-informative random model) is subtracted from both AUCs. 25 As above, the percentage improvement is calculated as (0.7242 – 0.5)/((0.6377 – 0.5) = 1.6282.

21

overlapping sample (63.77% in Panel B) and considerably better than chance (50%).

Taken as a whole, the in-sample tests suggest that models based on the Benford score can

improve the prediction of earnings misstatements, both in terms of increasing the scope of firms

that can be analyzed (with the AB-score) and boosting the accuracy of the accounting-based metric

(with the ABF-score).

4.4 Out-of-sample comparisons

Prediction models are useful when they not only show a good in-sample fit but also perform

well out-of-sample. Thus we next test how well the AB-score, F-score, and ABF-score predict

AAER firm-years out-of-sample. We do so by estimating each model over half of our data and

using the other half for prediction, using a random holdout specification. We randomly select half

of the firm-year observations from the full sample to calibrate the model and use the estimated

coefficients to obtain predicted values in the other half. The random holdout approach has two

advantages over a simple partitioning into early and late subsamples, namely (i) preserving the full

time period span in both the calibration and prediction subsamples, and (ii) allowing multiple

simulations, which together yield more stable, representative relationships between misstatement

predictors and observed instances of misstatement.26 We repeat the random holdout procedure 100

times. We report the mean number and percentage of correctly predicted AAER and non-AAER

firm-years and their associated Type I and Type II errors in Table 8.

[Table 8 here]

Panel A presents the results for all three models using an odds-ratio threshold of 1.0. The AB-

score model correctly identifies 72.5% of the AAER firm-years, while the F-score and ABF-score

models correctly identify 73.5% and 73.6%, respectively, similar to the in-sample rates (72.8%,

69.4%, and 69.9%, respectively, in Table 7). The out-of-sample prediction rates suggest that the

models do not suffer from overfitting. The AB-score continues to outperform in terms of the

number of AAER firm-years correctly predicted, reflecting the AB-score’s broader coverage,

while the ABF-score again delivers the lowest Type I and Type II error rates.

26 Using a simple partition into early and late subsamples yields similar results (see Internet Appendix).

22

The ideal model would minimize both Type I and Type II errors, but in practice the two are

traded off against each other. If regulators or auditors are not constrained in how many

investigations they can undertake, they may prefer a model that over-identifies firm-years as likely

misstated but captures more true misstatements. Such a model minimizes Type II error at the cost

of Type I error. However, if resources are limited, the investigator may prefer a less conservative

model which fails to identify more misstatement years but reduces the total number of observations

that require follow-up. In this case, Type I error is minimized. In Panels B and C we repeat our

out-of-sample simulations using threshold odds ratios of 0.7 and 1.3, respectively. As expected, a

lower threshold identifies more correct AAER firm-years but also leads to more false positives

(lower Type II and higher Type I errors), while the higher threshold does the opposite. Across all

three thresholds, the AB-score continues to identify the largest number of correct AAER firm-

years, thanks to its broader sample coverage, but the ABF-score has the highest correct AAER

firm-year prediction rate at the 1.3 threshold. While ultimately it is the investigator who must

decide how to set the odds threshold, the results in this table provide useful insight into the trade-

off between Type I and Type II errors in each model.

4.5 Model Overlap

The AB-score and F-score models approach misstatement detection from very different

perspectives (with the ABF-score combining the two), one rooted in the prevalence of leading

digits and the other capturing specific accounting information. A natural question is whether the

models identify the same misstated firm-years or whether they provide incremental predictive

power relative to each other. We address this question first by examining the overlap between the

models’ predictions and then by assessing each model’s ability to discriminate between AAERs

and non-AAERs that another model cannot.

Table 9 reports the percentage of correctly identified observations from each model (in the

overlapping sample) that is correctly identified by the other models, using the threshold odds ratio

of 1.0.

[Table 9 here]

In general, the AB-score is more successful at identifying AAERs correctly identified by the

23

F-score and the ABF-score than the other way around. The AB-score correctly identifies 84.4% of

the AAER firm-years correctly identified by the F-score and 91.8% of those correctly identified

by the ABF-score, while the F-score (ABF-score) correctly identifies only 72.4% (79.4%) of the

AAERs correctly identified by the AB-score. The ABF-score and F-score are more closely

correlated, with each able to predict about 90% of the other’s correctly predicted AAER firm-

years; this high correlation is not surprising given that the ABF-score includes all the F-score

variables. The high correlation between F-score and ABF-score carries over into identifying non-

AAER firm-years. In contrast, the F-score and ABF-score predict far more of the non-AAER firm-

years correctly predicted by the AB-score than the other way around (69.5% and 86.5% versus

43.2% and 52.5%). Correlations between the scores can also be used to assess model overlap. The

Pearson correlation coefficient between AB-score and F-score is 0.13; between AB-score and

ABF-score it is 0.51; and between F-score and ABF-score it is 0.85. These correlations, together

with the classification overlap results, suggest that the AB-score and F-score provide distinct

information.

To more precisely assess the incremental value of each model, we examine how each model

performs in cases where another model is inconclusive. In particular, we ask how well each model

does at distinguishing misstated versus non-misstated firm-years that have similar scores from

another model. We match each AAER firm-year with non-AAER firm-years that have scores

within 0.0005 of the AAER firm-year’s score. This technique resembles propensity score

matching. We consider both one-to-one matching and one-to-many matching, in which we

compare the AAER firm-year’s score against the mean and median score of its matched

observations. Table 10 reports the results.

[Table 10 here]

Panel A shows that for F-score ties, both the ABF-score and the AB-score assign a higher

misstatement likelihood to the AAER firm-year in over 60% of the matches under all three

matching methods, suggesting that the AB-score and ABF-score add incremental information

above the F-score. Similarly, Panel B shows that both the ABF-score and the F-score are successful

at breaking AB-score ties more than 66% of the time, suggesting that the ABF-score and F-score

24

add incremental value above the AB-score. In contrast, Panel C shows that neither the AB-score

nor the F-score is much more successful at breaking ABF-score ties than a coin flip (with success

rates at or below 50.3%). Taken together, these results imply that the Benford-related variables (in

the AB-score and ABF-score) and the F-score variables capture distinct sets of information, rather

than capturing the same information through different channels. The ability of the ABF-score to

accurately detect misstatements when the F-score is tied suggests that it can be a valuable tool for

resource-constrained regulators and fraud examiners.

5. Testing the AB-score, F-score, and ABF-score on well-known cases of financial misconduct

As a final examination of the three measures, we assess their performance at detecting the most

notorious misstatement cases during our sample period. We identify ten high-profile cases

perpetrated by publicly traded U.S. firms during our 1979-2011 sample period by conducting

internet searches using keywords such as “financial fraud” and “largest fraud cases.” These cases

resulted in AAERs alleging misstatements in 57 firm-years. The firms involved (ordered by

primary AAER number) are Cendant Corporation (formerly CUC International), WorldCom Inc.,

Enron Corp., Tyco International, HealthSouth Corp., Adelphia Communications Corp., Waste

Management, Inc., Federal National Mortgage Association (Fannie Mae), Qwest Communications

International, and Federal Home Loan Mortgage Corporation (Freddie Mac). Figure 4 presents the

frequency distribution for each of the three scores across all 57 firm-years. Overall, the AB-score

gives the strongest signal, with no firm-year readings below 0.7 and more observations in each of

the higher ranges than the F-score or ABF-score.

[Figure 4 here]

For a closer look at these notorious cases, Table 11 presents the detailed results of applying

the AB-score, F-score, and ABF-score to the specific AAER firm-years.27

[Table 11 here]

Panel A of Table 11 details the ten misstatement cases and shows the AB-score, F-score, and

27 Both Enron and Waste Management are associated with two distinct GVKEYs in Compustat during the years their financials were materially misstated. In both cases, we include firm-year observations for both reporting entities.

25

ABF-score for each year that an AAER alleges misstatement. For example, for Enron Corp. in

1998 (the first year covered by its AAER), the AB-score is 1.94, the F-score is 1.32, and the ABF-

score is 2.17. All three metrics exceed the odds ratio threshold of 1.0, suggesting likely financial

misstatement for Enron Corp. in 1998. Panel B summarizes the results across the ten prominent

misstatement cases. The AB-score has both greater sample coverage and a higher success rate at

predicting misstatement within the firm-years covered. Of the 57 misstated firm-years in this

sample, the F-score and ABF-score can be computed for only 41 firm-years (69% of the

observations) because they cannot be computed for financial firms (e.g., Fannie Mae and Freddie

Mac) and require specific data not available for other firm-years (e.g., Adelphia Communications

in both years). The AB-score can be computed for all 57 firm-years. The AB-score predicts that

46 firm-years in this misstatement subset (80.7%) have above-average likelihoods of being

misstated (i.e., the AB-score exceeds the 1.0 threshold). In contrast, the F-score predicts that only

26 firm-years in this misstatement subset (45.6% of the firm-years overall) have above-average

likelihood of being misstated. Although it faces the same data limitations as the F-score, the ABF-

score performs better than the F-score, correctly identifying 31 of the firm-years (54.4% of the

firm-years overall). Notably, the ABF-score has the highest mean, hinting at the benefits of

including both the financial statement items from F-score and the numerical patterns from AB-

score in the same model. In this sample of notorious misreporting cases, supplementing the ABF-

score with the AB-score for firm-years when the ABF-score cannot be calculated improves upon

the ABF-score’s predictive ability (see last column in Panel B). Overall, it is reassuring that the

AB-score and ABF-score prediction success rates are higher here than the success rates reported

in Tables 7 and 8, as these are the most egregious cases of misstatement and one would expect

good models to detect more of them, or to detect them with greater ease.

6. Conclusion

As a condition for raising money in public capital markets, firms agree to periodically

communicate their financial health by filing financial statements. While the majority of firms

discharge this duty honestly, some willfully manipulate their financial statements to suggest better

26

financial health. Since it is not easy for firm outsiders to directly identify which firms manipulate

their financial statements, research in earnings management and financial misconduct uses indirect

metrics that correlate with observed instances of such behavior (i.e., ex-post misstatement).

In this study we offer two new metrics to measure the likelihood of manipulation in a firm’s

financial statements. These metrics are based on Benford’s Law, which predicts the frequency with

which leading digits should appear in naturally occurring distributions of numbers. In aggregate

financial statement numbers closely follow Benford’s Law, but at the firm-year level there are

several systematic deviations. Controlling for these deviations in backward selection regressions,

we construct a prediction metric we call the Adjusted Benford score (AB-score). A key advantage

of the AB-score is that it can be computed for a larger sample of firm-years than the leading

accounting-based misstatement prediction metric, the F-score (which requires specific accounting

numbers and cannot be computed for financial firms). For firms with the necessary data available

to compute the F-score, we find that including the AB-score and F-score variables together in a

combined model (the ABF-score model) improves predictive ability. We find that the AB-score

performs well at detecting misstatements overall, and the ABF-score provides incremental

prediction value above the F-score.

In a survey article on the current state of financial reporting misconduct research, Amiram et

al. (2017) point to the gap in our understanding of the estimation errors involved in financial fraud-

related research. While researchers have used several measures to gauge the likelihood, extent, and

damages from financial reporting misconduct, there has been less focus on assessing the

performance of the metrics used in such research. We do extensive testing of the AB-score, F-

score, and ABF-score to validate them as indicators of the likelihood of financial misreporting.

Our bottom-line advice is that researchers interested in misstatement detection should use the

ABF-score for firm-years when the required data are available and the AB-score otherwise.

27

References

Amiram, Dan, Zahn Bozanic, and Ethan Rouen. 2015. Financial statement errors: evidence from

the distributional properties of financial statement numbers. Review of Accounting

Studies 20(4): 1540-1593.

Amiram, Dan, Zahn Bozanic, James Cox, Quentin Dupont, Jonathan Karpoff, and Richard Sloan.

2017. Financial reporting fraud and other forms of misconduct: A multidisciplinary review of

the literature. Review of Accounting Studies, forthcoming.

Bayley, Luke, and Stephen L. Taylor. 2007. Identifying earnings overstatements: a practical test.

Working paper.

Beneish, Messod. 1999. The detection of earnings manipulation. Financial Analyst Journal 55(5):

24-36.

Benford, Frank. 1938. The law of anomalous numbers. Proceedings of the American Philosophical

Society 78(4): 551-572.

Berg, Tobias, Valentin Burg, Ana Gombovic, and Manju Puri. 2018. On the rise and fall of

FinTechs – credit scoring using digital footprints. Working paper.

Bergstresser, Daniel, and Thomas Philippon. 2006. CEO incentives and earnings

management. Journal of Financial Economics 80(3): 511-529.

Bernile, Gennaro, and Gregg A. Jarrell. 2009. The impact of the options backdating scandal on

shareholders. Journal of Accounting and Economics 47(1-2): 2-26.

Bloomfield, Robert J. 2012. A pragmatic approach to more efficient corporate disclosure.

Accounting Horizons 26(2): 357-370.

Bonner, Sarah E., Zoe-Vonna Palmrose, and Susan M. Young. 1998. Fraud type and auditor

litigation: an analysis of SEC accounting and auditing enforcement releases. The Accounting

Review 73(4): 503-532.

28

Erik S. Boyle, and Melissa F. Lewis-Western. 2018. The impact of audits on financial statement

error in the presence of incentive and opportunity. Working paper.

Bowler, Blake D. 2017. Are going concern opinions associated with lower audit impact? Working

Paper.

Bradley, Daniel, Sinan Gokkaya, Xi Liu, and Fei Xie. 2017. Are all analysts created equal?

Industry expertise and monitoring effectiveness of financial analysts. Journal of Accounting

and Economics 63(2): 179-206.

Brazel, Joseph, Keith Jones, and Mark Zimbelman. 2009. Using nonfinancial measures to assess

fraud risk. Journal of Accounting Research 47(5): 1135-1166.

Burns, Natasha, and Simi Kedia. 2006. The impact of performance-based compensation on

misreporting. Journal of Financial Economics 79(1): 35-67.

Carcello, Joseph V., Ann Vanstraelen, and Michael Willenborg. 2009. Rules rather than discretion

in audit standards: going-concern opinions in Belgium. The Accounting Review 84(5): 1395-

1428.

Chan, Lilian, Kevin Chen, and Tai-Yuan Chen. 2013. The effects of firm-initiated clawback

provisions on bank loan contracting. Journal of Financial Economics 110(3): 659-679.

Cheng, Ing-Haw, Harrison Hong, and Jose Scheinkman. 2015. Yesterday's heroes: compensation

and risk at financial firms. Journal of Finance 70(2): 839-879.

Cohen, Randolph B., Christopher Polk, and Tuomo Vuolteenaho. 2003. The value spread. Journal

of Finance 58(2): 609-641.

Dechow, Patricia, Weili Ge, Chad Larson, and Richard Sloan. 2011. Predicting material

accounting misstatements. Contemporary Accounting Research 28(1): 17-82.

DeFond, Mark, Chee Yeow Lim, and Yoonseok Zang. 2015. Client conservatism and auditor-

client contracting. The Accounting Review 91(1): 69-98.

29

Durtschi, Cindy, William Hillison, and Carl Pacini. 2004. The effective use of Benford’s law to

assist in detecting fraud in accounting data. Journal of Forensic Accounting 5(1): 17-34.

Dyck, Alexander, Adair Morse, and Luigi Zingales. 2013. How pervasive is corporate fraud?

Working paper.

Fang, Vivian, Allen Huang, and Jonathan Karpoff. 2016. Short selling and earnings management:

A controlled experiment. Journal of Finance 71(3): 1251-1294.

Fich, Eliezer, and Anil Shivdasani. 2007. Financial fraud, director reputation, and shareholder

wealth. Journal of Financial Economics 86(2): 306-336.

Fulghieri, Paolo, Günter Strobl, and Han Xia. 2013. The economics of solicited and unsolicited

credit ratings. Review of Financial Studies 27(2): 484-518.

Geyer, Adelina, and Joan Marti. 2012. Applying Benford's law to volcanology. Geology 40(4):

327-330.

Iyer, Rajkamal, Khwaja, Asim Ijaz, Luttmer, Erzo R. P., and Kelly Shue. 2016. Screening peers

softly: inferring the quality of small borrowers. Management Science 62(6): 1554-1577.

Jia, Yuping, Lawrence Van Lent, and Yachang Zeng. 2014. Masculinity, testosterone, and

financial misreporting. Journal of Accounting Research 52(5): 1195-1246.

Karpoff, Jonathan, D. Scott Lee, and Gerald Martin. 2008. The cost to firms of cooking the

books. Journal of Financial and Quantitative Analysis 43(3): 581-611.

Karpoff, Jonathan, Allison Koester, D. Scott Lee, and Gerald Martin. 2017. Proxies and databases

in financial misconduct research. The Accounting Review, forthcoming.

Klimek, Peter, Yuri Yegorov, Rudolf Hanel, and Stefan Thurner. 2012. Statistical detection of

systematic election irregularities. Proceedings of the National Academy of

Sciences, 109(41):16469-16473.

Lawless, Jerald, and Kishore Singhal. 1978. Efficient screening of non-normal regression

30

models. Biometrics, 34(2): 318-327.

Mir, Tariq. 2014. The Benford law behavior of the religious activity data. Physica A: Statistical

Mechanics and its Applications 408(1): 1-9.

Newcomb, Simon. 1881. Note on the frequency of use of the different digits in natural

numbers. American Journal of Mathematics 4(1): 39-40.

Nigrini, Mark. 1999. I've got your number: How a mathematical phenomenon can help CPAs

uncover fraud and other irregularities. Journal of Accountancy 187(5): 79-83.

Perols, Johan, Robert Bowen, Carsten Zimmermann, and Basamba Samba. 2017. Finding needles

in a haystack: Using data analytics to improve fraud prediction. The Accounting Review 92(2):

221-245.

Richardson, Scott, Richard Sloan, Mark Soliman, and A. Irem Tuna. 2005. Accrual reliability,

earnings persistence and stock prices. Journal of Accounting and Economics 39(3): 437-485.

Shumway, Tyler. 2001. Forecasting bankruptcy more accurately: A simple hazard model. Journal

of Business 74(1): 101-124.

Skinner, Douglas, and Suraj Srinivasan. 2012. Audit quality and auditor reputation: Evidence from

Japan. The Accounting Review 87(5): 1737-1765.

Varian, Hal. 1972. Benford's Law (Letters to the Editor). The American Statistician 26(3): 62-66.

Wang, Jialin. 2011. Benford’s law and the decreasing liability of accounting data. Economist’s

View blog post, October 12, 2011, http://economistsview.typepad.com/economistsview

/2011/10/benfords-law-and-the-decreasing-reliability-of-accounting-data.html.

Table 1: Distribution of AAERs by number of years covered

Duration (years)12345678910111213141516

Total AAER: 578Mean Duration: 2.31 yearsMedian Duration: 2.00 years

10341

This table summarizes the distribution of the duration of AAERs in our sample. Duration is defined as the number of consecutive years for which a firm’s financials are alleged to be misstated by the AAER.

2

1

1

Number of AAERs260

1

11

14172412712

31

Table 2: Adherence of Compustat numbers to Benford’s Law

Leading Digit 1 2 3 4 5 6 7 8 9 Total

Count (millions) 10.73 6.16 4.33 3.33 2.75 2.29 1.99 1.74 1.56 34.88

Percent of total 30.76% 17.66% 12.40% 9.55% 7.89% 6.57% 5.69% 5.00% 4.47%

Benford Prediction 30.10% 17.61% 12.49% 9.69% 7.92% 6.69% 5.80% 5.12% 4.58%

Deviation 0.66% 0.05% -0.09% -0.14% -0.03% -0.12% -0.10% -0.12% -0.10%

Abs(Deviation) 0.66% 0.05% 0.09% 0.14% 0.03% 0.12% 0.10% 0.12% 0.10%

Mean Abs Deviation 0.1580%

B_Raw score (=100*Mean Abs Dev.) 0.1580

This table illustrates how closely the financial statement numbers reported in the Compustat database follow the leading digit distribution predicted by Benford’s Law. Numbers are drawn from balance sheets, income statements, and cash flow statements for U.S. firms in the Compustat database from 1979 – 2011. The first row lists the leading digits 1 through 9. The second (third) row reports the numbers (percentages) of numbers with the corresponding leading digit. The fourth row reports the proportion with which each digit is expected to appear under Benford’s Law. The fifth row reports the deviation from Benford’s Law (row three minus row four), and the sixth row presents the absolute value of that difference. Absolute values are averaged across all nine digits in the seventh row to compute the mean absolute deviation (comparable to Amiram et al.’s (2015) FSD Score). Finally, the mean absolute deviation is scaled up by a factor of 100 to create the B_Raw score.

32

Table 3: Predicting AAERs using Benford score variables

All Variables Selected VariablesVariable (1) (2)

B_Raw 1.7695*** 1.4727***(0.000) (0.000)

B_Input 0.7004*** 0.6286***(0.000) (0.000)

B_Year -3.6381*** -3.2933***(0.000) (0.000)

B_Industry -0.2024*(0.055)

B_Firm -0.00971(0.676)

Intercept -11.7223*** -10.7153***(0.000) (0.000)

#Obs 239,714 296,645#AAER firm-years 1,196 1,336#non-AAER firm-years 238,518 295,309

This table estimates the relationship between the SEC’s issuance of AAERs and the Raw Benford Score (B_Raw ) and four adjustments applied to B_Raw , using logistic regressions. The dependent variable is an indicator that equals 1 for an AAER, 0 otherwise. The independent variables are B_Raw and four standardized derivations from B_Raw . B_Input accounts for the heterogeneity in number of inputs, B_Year accounts for year differences, B_Industry accounts for industry differences, and B_Firm accounts for firm baseline differences in B_Raw . In specification (1) all variables are included. In specification (2), variables are selected via backward elimination using the computational algorithm of Lawless and Singhal (1978). P-values are in parentheses below coefficient estimates. ***, **, * denote statistical significance at the 1, 5, and 10 percent levels, respectively.

Dependent variable = 1 for AAER, 0 for non-AAER firm-year

33

Table 4: Descriptive statistics

Panel A: All Compustat firm-years

Variable # Obs Mean Median Std Dev # Obs Mean Median Std Dev # Obs Mean Median Std Dev

Inputs 296,645 117.588 117.000 31.900 177,452 116.305 117.000 24.807 -- -- -- --

B_Raw 296,645 3.444 3.200 1.337 177,452 3.293 3.102 1.195 -- -- -- --

B_Input 296,645 -0.003 -0.081 0.979 177,452 0.014 -0.056 0.985 -- -- -- --

B_Year 296,645 -0.014 -0.181 0.918 177,452 -0.119 -0.253 0.837 -- -- -- --

B_Industry 296,604 -0.011 -0.148 0.934 177,418 -0.010 -0.139 0.937 -- -- -- --

B_Firm 239,742 -0.048 -0.150 1.800 138,553 -0.018 -0.125 1.852 -- -- -- --

RSST_Accruals 228,026 0.025 0.026 0.366 156,167 0.033 0.028 0.334 151,862 0.032 0.026 --

Chg_Rcv 263,885 0.019 0.007 0.091 159,302 0.018 0.008 0.091 151,928 0.017 0.008 --

Chg_Invt 265,837 0.008 0.000 0.061 160,206 0.011 0.000 0.068 152,741 0.011 0.000 --

Chg_CashSales 248,722 0.126 0.061 1.440 150,593 0.172 0.070 1.164 135,333 0.208 0.079 --

Pct_SoftA 266,719 0.542 0.570 0.283 161,578 0.505 0.530 0.258 167,982 0.509 0.535 --

Chg_ROA 242,945 -0.007 -0.001 0.267 142,082 -0.011 -0.002 0.244 140,380 -0.010 -0.002 --

Issue 278,660 0.824 1.000 0.381 174,828 0.826 1.000 0.380 166,712 0.826 1.000 --

Abn_Chg_Emp 221,149 -0.092 -0.046 0.545 136,946 -0.095 -0.050 0.545 134,837 -0.093 -0.049 --

OL 296,645 0.651 1.000 0.477 177,452 0.699 1.000 0.459 168,481 0.710 1.000 --

MASR 191,680 0.065 -0.070 0.842 118,385 0.051 -0.106 0.904 110,303 0.008 -0.114 --

Lag_MASR 190,664 0.107 -0.061 1.044 117,684 0.093 -0.098 1.111 99,197 0.030 -0.099 --

This table displays the number of observations, mean, median, and standard deviation for the number of inputs into the Benford variables and the variables used in the DGLS models for the full sample period (1979-2011) and the sample period used in DGLS (1979-2002). B_Raw is the raw Benford Score; B_Input is the raw Benford Score adjusted for the number of inputs used to compute B_Raw ; B_Year is the time-series adjusted Benford Score; B_Industry and B_Firm are the industry- and firm-adjusted Benford scores, respectively; RSST_Accruals is change in noncash net operating assets; Chg_Rcv is change in receivables; Chg_Invt is change in inventory; Pct_SoftA is percent soft assets; Chg_CashSales is change in cash sales; Chg_ROA is change in return on assets; Issue is an indicator equal to 1 if the firm issued debt or equity during that year, 0 otherwise; Abn_Chg_Emp is abnormal change in employees; OL is an indicator equal to 1 if the company has operating leases, 0 otherwise; MASR is market-adjusted stock returns; and Lag_MASR is one-year-lagged market-adjusted stock returns. The last three columns reports mean and median values and the number of observations reported in DGLS Table 6 for comparison. Panel A summarizes the full sample of U.S. firm-years in Compustat between 1979 and 2011; Panel B presents the subset containing only those firm-years for which an AAER alleges overstatement of income.

Full Sample (1979-2011) DGLS Sample (1979-2002) DGLS Table 6

34

Panel B: AAER firm-years

Variable # Obs Mean Median Std Dev # Obs Mean Median Std Dev # Obs Mean Median Std DevInputs 1,336 133.160 131.000 30.406 624 124.607 122.000 25.784 -- -- -- -- B_Raw 1,336 3.015 2.871 1.016 624 3.054 2.914 1.026 -- -- -- -- B_Input 1,336 -0.050 -0.133 0.948 624 -0.042 -0.125 0.955 -- -- -- -- B_Year 1,336 -0.319 -0.421 0.699 624 -0.298 -0.409 0.716 -- -- -- -- B_Industry 1,336 -0.255 -0.357 0.785 624 -0.172 -0.302 0.854 -- -- -- -- B_Firm 1,196 -0.176 -0.269 1.795 523 -0.110 -0.159 1.958 -- -- -- -- RSST_Accr 1,102 0.111 0.062 0.317 556 0.115 0.061 0.359 557 0.126 0.074 -- Chg_Rcv 1,261 0.048 0.025 0.102 581 0.059 0.031 0.116 561 0.061 0.036 -- Chg_Invt 1,239 0.025 0.001 0.072 572 0.040 0.007 0.089 557 0.039 0.008 -- Chg_CashSales 1,214 0.364 0.155 1.403 546 0.467 0.182 1.443 501 0.492 0.217 -- Pct_SoftA 1,236 0.646 0.694 0.228 577 0.644 0.678 0.213 604 0.642 0.682 -- Chg_ROA 1,200 -0.011 -0.004 0.198 528 -0.030 -0.013 0.240 506 -0.024 -0.012 -- Issue 1,297 0.948 1.000 0.223 618 0.930 1.000 0.255 599 0.932 1.000 -- Abn_Chg_Emp 1,142 -0.173 -0.067 0.749 506 -0.221 -0.090 0.862 489 -0.223 -0.103 -- OL 1,336 0.828 1.000 0.378 624 0.838 1.000 0.369 604 0.821 1.000 -- MASR 1,181 0.172 -0.025 0.951 535 0.188 -0.088 1.107 463 0.193 -0.113 -- Lag_MASR 1,168 0.233 0.017 1.059 526 0.261 0.006 1.178 393 0.332 0.031 --

Full Sample (1979-2011) DGLS Sample (1979-2002) DGLS Table 6

35

Table 5: Estimated model coefficients

Panel A: F-Score Models Replication

Reported 1979-2002 1979-2011 Reported 1979-2002 1979-2011 Reported 1979-2002 1979-2011

RSST_Accr 0.79 0.66 0.62 0.67 0.62 0.62 0.91 0.49 0.60

Chg_Rcv 2.52 2.21 1.98 2.46 2.27 1.91 1.73 2.59 2.01

Chg_Invt 1.19 1.84 0.89 1.39 1.84 0.81 1.45 1.64 0.50

Pct_SoftA 1.98 2.23 1.89 2.01 2.09 1.76 2.27 2.17 1.87

Chg_CashSales 0.17 0.13 0.09 0.16 0.11 0.09 0.16 0.12 0.09

Chg_ROA -0.93 -0.94 -0.44 -1.03 -1.05 -0.54 -1.46 -1.20 -0.73

Issue 1.03 0.98 1.39 0.98 0.98 1.35 0.65 0.76 1.12

Abn_Chg_Emp -0.15 -0.14 -0.09 -0.12 -0.15 -0.11

OL 0.42 0.61 0.73 0.35 0.46 0.58

MASR 0.08 0.00 0.00

Lag_MASR 0.10 0.00 0.00

Intercept -7.89 -7.96 -7.72 -8.25 -8.39 -8.23 -7.97 -8.01 -7.80

Panel A of this table reports coefficients from the three F-score models (M1, M2, and M3). The column labeled Reported presents the coefficients from the corresponding F-score models, as reported in Table 7 of DGLS. For each model, the table also presents coefficients estimated over the DGLS sample period (1979-2002) and over the full sample period (1979-2011). Panel B presents estimated coefficients from the AB-score model and the ABF-score model, which includes all variables from AB-score and F-Score M1 models. No measure of statistical significance is reported since individual variables’ predictive abilities are not of interest in this study.

F-score M1 F-score M2 F-score M3

36

Panel B: AB-score and ABF-score Models

1979-2002 1979-2011 1979-2002 1979-2011

RSST_Accr 0.70 0.74

Chg_Rcv 2.35 2.27

Chg_Invt 1.83 0.72

Pct_SoftA 2.24 1.81

Chg_CashSales 0.13 0.11

Chg_ROA -1.07 -0.58

Issue 0.76 1.08

B_Raw 1.03 1.47 1.23 1.71

B_Input 0.40 0.63 0.49 0.66

B_Year -2.19 -3.29 -2.60 -3.70

Intercept -9.38 -10.72 -12.31 -13.75

AB-score ABF-score

37

Table 6: Comparison of model scope

ModelAAER

firm-yearsnon-AAER firm-years

AAER firm-years

non-AAER firm-years

AAER firm-years

non-AAER firm-years

Total firm-years

F-score M1 494 132,967 492 132,139 977 186,435 187,412

F-score M2 449 122,366 450 117,714 899 164,886 165,785

F-score M3 353 88,032 419 93,055 843 126,232 127,075

AB-score -- -- 697 212,902 1,336 295,309 296,645

ABF-score -- -- 492 132,139 977 186,435 187,412

This table reports how many AAER and non-AAER firm-years each of the models can be estimated over given data availability from Compustat and CRSP databases. The columns labeled DGLS 1979-2002 report the number of observations included in the F-score analyses in Table 7 of DGLS. The columns labeled Replication 1979-2002 report the number of observations included when we re-estimate the F-score models over the DGLS sample period, following the DGLS sample selection procedure and variable definitions as closely as possible. The columns labeled 1979-2011 report the number of observations included when we estimate each model over the full sample period.

DGLS 1979-2002 Replication 1979-2002 1979-2011

38

Table 7: In-sample comparisons of model accuracy

Panel A: All models, full sample

ModelSample

firm-years

Correct AAER

firm-years

Correct non-AAERfirm-years

Type I Error

Type II Error Unclassified

AB-score # Observations 296,645 973 149,269 146,040 363 0

% Sample 72.8% 50.6% 49.5% 27.2% 0%

F-score # Observations 187,412 678 111,128 75,307 299 109,233

% Sample 69.4% 59.6% 40.4% 30.6% 36.8%

ABF-score # Observations 187,412 683 113,751 72,684 294 109,233

% Sample 69.9% 61.0% 39.0% 30.1% 36.8%

Panel B: AB-score model, observations not estimated by F-score and ABF-score

SampleSample

firm-years

Correct AAER

firm-years

Correct non-AAERfirm-years

Type I Error

Type II Error

All non-overlapping # Observations 109,233 264 55,550 53,324 95

% Sample 73.5% 51.0% 49.0% 26.5%

Financial non-overlapping # Observations 51,373 123 28,354 22,844 52 47.03%

% Sample 70.3% 55.4% 44.6% 29.7%

Non-financial non-overlapping # Observations 57,860 135 28,220 29,456 49 52.97%

% Sample 73.4% 48.9% 51.1% 26.6%

This table summarizes model accuracy for the AB-score, F-score (M1), and ABF-score models. Each observation’s odds of being a misstated firm-year are determined from model coefficients, and an observation is predicted to be misstated if its odds ratio exceeds the 1.0 threshold. Panel A reports results for the entire sample over which each model can be estimated. Panel B reports results for the firm-year observations which the AB-score model can estimate but the F-score and ABF-score models cannot because the firms are financial firms or required firm/year data are missing in Compustat. For each model and sample, observations are tallied in four bins. The total number of firm-year observations that can be estimated by each model is reported in the column labeled Sample firm-years . Observations correctly identified as misstated firm-years are tallied under the Correct AAER firm-years column; those correctly identified as not misstated, under the Correct non-AAER firm-years column; those mistakenly identified as misstated (false positives), under Type I Error ; those mistakenly identified as not misstated (false negatives), under Type II Error ; and those Compustat observations which cannot be estimated by the model, under Unclassified . % Sample divides the number of observations within a bin by the number of observations in that category over which each model is estimated (or by number of firm-year observations in Compustat, in the case of Unclassified ).

39

Table 8: Out-of-sample comparisons of model accuracy

Panel A: Random out-of-sample sample simulations, Threshold = 1.0

ModelSample

firm-years

Correct AAER firm-

years

Correct non-AAER firm-years

Type I Error

Type II Error Unclassified

AB-score # Observations 148,322 485 74,703 72,949 184 0% Sample 72.5% 50.6% 49.4% 27.5% 0.0%

F-Score # Observations 93,701 359 50,903 42,310 129 54,621% Sample 73.5% 54.6% 45.4% 26.5% 36.8%

ABF-score # Observations 93,701 359 52,771 40,442 129 54,621% Sample 73.6% 56.6% 43.4% 26.4% 36.8%

Panel B: Random out-of-sample sample simulations, Threshold = 0.7




Panel C: Random out-of-sample sample simulations, Threshold = 1.3




This table summarizes the results of out-of-sample tests of the AB-score, F-score, and ABF-score models. In each panel, half of the observations are randomly selected into the estimation period and the other half constitute the prediction period; the procedure is repeated 100 times, and mean counts and percentages are reported in the panel. Panel A uses an odds ratio threshold of 1.0; panels B and C use thresholds of 0.7 and 1.3. For each model and sample, observations are tallied in four bins. The total number of firm-year observations that can be estimated by each model is reported in the column labeled Sample firm-years . Observations correctly identified as misstated firm-years are tallied under the Correct AAER firm-years column; those correctly identified as not misstated, under the Correct non-AAER firm-years column; those mistakenly identified as misstated (false positives), under Type I Error ; those mistakenly identified as not misstated (false negatives), under Type II Error ; and those Compustat observations which cannot be estimated by the model, under Unclassified . % Sample divides the number of observations within a bin by the number of observations in that category over which each model is estimated (or by number of firm-year observations in Compustat, in the case of Unclassified).

40

Table 9: Model prediction overlap

Model AB-score F-score ABF-score

AAER firm-years AB-score - 84.4% 91.8%

F-score 72.4% - 90.0%

ABF-score 79.4% 90.7% -

non-AAER firm-years AB-score - 43.2% 52.5%

F-score 69.5% - 87.3%

ABF-score 86.5% 89.4% -

This table reports the percentage of observations correctly classified as AAER firm-years or non-AAER firm-years by the model (AB-score, F-score, or ABF-score) in each column that are also correctly identified by the model in each row. Only firm-year observations that can be estimated by all three models are included in this analysis. The odds ratio threshold is set at 1.0.

Percent of

41

Table 10: Ability of models to break each others' ties

Panel A: Breaking F-score tiesTie-breaker: ABF-score Tie-breaker: AB-score

Matching method % AAER ABF-score > non-AAER ABF-score % AAER AB-score > non-AAER AB-score

One-to-one 61.7% 61.4%

One-to-many, mean 65.6% 64.4%

One-to-many, median 66.7% 65.1%

Panel B: Breaking AB-score tiesTie-breaker: ABF-score Tie-breaker: F-score

Matching method % AAER ABF-score > non-AAER ABF-score % AAER F-score > non-AAER F-score

One-to-one 69.8% 68.0%

One-to-many, mean 66.6% 66.9%

One-to-many, median 73.9% 73.8%

Panel C: Breaking ABF-score tiesTie-breaker: AB-score Tie-breaker: F-score

Matching method % AAER AB-score > non-AAER AB-score % AAER F-score > non-AAER F-scoreOne-to-one 49.9% 48.5%One-to-many, mean 47.6% 42.1%One-to-many, median 48.1% 50.3%

This table summarizes each model’s ability to discern AAER and non-AAER firm-year observations that the other model could not distinguish. Each AAER firm-year is matched to one or many non-AAER firm years with similar predicted misstatement likelihoods. Panel A uses F-scores (from F-score M1 model) as the misstatement likelihood and compares ABF-scores and AB-scores; Panel B uses AB-scores as the misstatement likelihood and compares ABF-scores and F-scores; Panel C uses ABF-scores as the misstatement likelihood and compares AB-scores and F-Scores. To be considered a match, non-AAER firm-years must have odds ratios within 0.0005 of the AAER firm-year in question. When one AAER firm-year is matched to all of the non-AAER firm-years with odds ratios within 0.0005, the AAER firm-year odds ratio is compared to either the mean (one-to-many, mean ) or the median (one-to-many, median ) of all of the matched non-AAER firm-years.

42

Table 11: High-profile cases of financial misconduct resulting in SEC AAERs

Panel A: AB-score, F-score, and ABF-score by AAER firm-yearAAER ID

(year) Company name Description Year AB-score F-score ABF-score

1986 1.42 n/a n/a1987 0.96 1.88 1.461988 0.90 1.69 1.241989 0.90 1.26 0.811990 1.05 2.20 1.861991 1.02 2.26 1.871992 0.80 2.46 1.571993 1.00 1.65 1.331994 0.83 1.60 1.071995 1.19 1.88 1.901996 1.27 1.99 2.181997 1.51 n/a n/a1998 1.80 n/a n/a

2000 0.89 1.02 0.732001 1.05 1.00 0.84

1998 1.94 1.32 2.171999 1.56 1.33 1.832000 2.70 2.47 5.88

1998 1.60 0.41 0.571999 1.91 0.34 0.552000 1.68 0.59 0.90

1998 2.09 1.58 2.741999 2.44 2.07 4.462000 2.15 1.69 3.042001 1.74 n/a n/a

1272(2000)

1678 (2004)

1821(2003)

"[Two executives] granted themselves hundreds of millions of dollars in secret low interest and interest-free loans from the company that they used for personal expenses. They later caused Tyco to forgive tens of millions of dollars they owed"

Tyco International Ltd.1852(2003)

"For the last three fiscal years of the scheme, pre-tax income was artificially overstated by nearly one third, an aggregate misstatement of approximately one-half billion dollars"

Cendant Corporation (formerly CUC International)

"WorldCom materially overstated the income it reported on its financial statements by approximately $9 billion"

WorldCom, Inc.

"The fraudulent transactions included the "Raptor" sham hedges used by Enron to avoide earnings write-downs of over $1 billion, the fraudulent "sale" of an interest in Nigerian barges to Merrill Lynch, and "prepay" transactions, which were loans disguised as commodity sales contracts, used by Enron to overstate its cash flows by hundreds of millions of dollars."

Enron Corp.

Enron Oil and Gas Co.

Panel A of this table summarizes ten of the highest profile financial misconduct cases in the 1979-2011 period. AAER identifiers from the SEC are listed in the first column followed by the company name and a quote from the AAER or a related SEC legal release illustrating the magnitude of the transgression. To the right are years affected by the misstatement and odds ratios predicted by the AB-score, F-score (M1), and ABF-score models. Firm-years for which the AB-score, F-score, or ABF-score model cannot be estimated are marked n/a. Panel B summarizes how many firm-years each model is able to estimate and summarizes the success and failure rates of each model in this subset of high-profile fraud cases. Correctly identified % is calculated out of total number of fraud firm-years (57); Type II error is calculated out of number of firm-years estimated by each model.

43

1999 1.99 1.06 1.782000 2.01 1.04 1.752001 1.78 1.09 1.602002 2.57 0.61 1.19

1999 1.70 n/a n/a2000 1.61 n/a n/a

1992 0.97 0.84 0.641993 2.00 0.82 1.411994 1.87 0.84 1.321995 1.99 0.82 1.391996 1.97 0.74 1.211997 1.88 0.71 1.11

1992 1.11 0.93 0.881993 1.59 0.73 1.011994 1.57 1.10 1.551995 1.46 1.13 1.541996 1.52 1.19 1.671997 1.58 1.04 1.46

1998 1.08 n/a n/a1999 1.01 n/a n/a2000 0.83 n/a n/a2001 1.06 n/a n/a2002 0.99 n/a n/a2003 1.05 n/a n/a2004 1.50 n/a n/a

2000 2.39 2.08 4.612001 1.83 1.04 1.572002 2.01 0.36 0.571999 1.25 n/a n/a

2000 0.84 n/a n/a2001 1.06 n/a n/a2002 1.21 n/a n/a

"HRC systematically overstated its earnings by at least 1.4 billion"HealthSouth Corp.2082(2004)

"Understated its subsidiary debt by $1.6 billion, overstated equity by at least $368 million"

Adelphia Communications Corp.

"The company misreported its net income in [2000, 2001 and 2002] by 30.5%, 23.9% and 42.9% respectively"

Federal Home Loan Mortgage Corporation

2728(2007)

"anticipated restatement of at least an $11 billion reduction of previously reported net income"

Federal National Mortgage Association

2433(2006)

"recognized approximately $3.8 billion of spurious revenue and fraudulently excluded $231 million in expenses"

Qwest Communications International

2613(2007)

2337(2005)

"used netting to eliminate approximately $490 million in current period operating expenses"

Waste Management, Inc.

Waste Management, Inc. Del

2313(2005)

44

Panel B: Summary of AB-score, F-score, and ABF-score model performance

Model AB-score F-score ABF-scoreABF-score with AB-score fill-in

Firm-years covered 57 41 41 57

# score > 1.0 46 26 31 45

# score < 1.0 11 15 10 12

Mean score 1.50 1.27 1.68 1.56

Median score 1.52 1.10 1.46 1.41

Correctly identified % 80.7% 45.6% 54.4% 78.9%

Type II error % 19.3% 36.6% 24.4% 21.1%

45

Figure 1: Leading digit frequency

This figure depicts the observed distribution of leading digits in the annual reports of all firms in the Compustat database 1979-2011 compared to the distribution predicted by Benford’s Law.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9

Observed Distribution

Benford's distribution

46

Figure 2: Raw Benford score by number of inputs, industry, year, and firm

Panel A: B_Raw by number of financial statement items

Panel B: B_Raw by industry

This figure depicts the mean raw Benford score (B_Raw ), which is the mean absolute deviation of the leading-digit distribution in financial statement numbers from the distribution predicted by Benford’s Law, across four dimensions. Panel A shows the relationship between B_Raw and the number of inputs used to compute it. Panel B shows the the mean B_Raw across each 2-digit SIC code. Panel C shows the relationship between B_Raw and the year in which the observation is drawn. Panel D shows mean B_Raw by firm.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 9 14 20 24 28 32 36 40 45 49 53 57 61 65 73 79 83 89

Med

ian

B_R

aw

2-Digit SIC Code

0

5

10

15

20

25

1 21 41 61 81 101 121 141 161 181 201 221 241

Med

ian

B_R

aw

Number of Financial Statement Items

47

Panel C: B_Raw by year

Panel D: B_Raw by firm

2.95

3

3.05

3.1

3.15

3.2

3.25

3.3

3.35

3.4

1979 1983 1987 1991 1995 1999 2003 2007 2011

Med

ian

B_R

aw

Year

48

Figure 3: Receiver operator characteristic (ROC) curves

Panel A: ABF-score versus F-score

The three charts present ROC curves for the AB-score, F-score, and ABF-score models. Each graph plots the True Positive Rate on the y-axis versus the False Positive Rate on the x-axis, for all possible thresholds. Panel A compares the ABF-score model to the F-score model for the firm-year observations for which both scores can be calculated; Panel B compares the ABF-score model to the AB-score model for the firm-year observations for which both scores can be calculated; and Panel C presents the ROC curve for the AB-score model for the firm-year observations for which the AB-score can be calculated but the F-score and ABF-score cannot.

True Positive

False Positive Rate

49

Panel B: ABF-score versus AB-score

True Positive

False Positive Rate

50

Panel C: AB-score for observations where F-score and ABF-score cannot be estimated

True Positive

False Positive Rate

51

Figure 4: Distribution of scores for high-profile financial misconduct firm-years

This figure plots the frequency distribution of the firm-year AB-scores, F-Scores, and ABF-scores for the ten notorious misstatement cases, a total of 57 firm-year observations. The grouping labeled n/a reflects firm-year observations for which F-scores and ABF-scores could not be calculated.

0

5

10

15

20

25

30

35

< 0.7 0.7 to 1.0 1.0 to 1.3 >1.3 n/a

# F

irm

-yea

rs

Score Range

Frequency Distribution of Firm-year Scores

AB-score F-score ABF-score

52

catch me if you can: improving the scope and accuracy of fraud...

Documents