catch me if you can: improving the scope and accuracy of fraud...
TRANSCRIPT
Catch Me If You Can:
Improving the Scope and Accuracy of Fraud Prediction
Bidisha Chakrabarty, Pamela C. Moulton, Leo Pugachev, and Frank Wang*
July 24, 2018
*Chakrabarty ([email protected]) and Wang ([email protected]) are at Saint Louis University; Moulton ([email protected]) is at Cornell University; Pugachev ([email protected]) is at the University of Oklahoma. We thank Dan Amiram, Attila Balogh, Scott Duellman, Quentin Dupont, Jonathan Karpoff, Dave Michayluk, Mark Nigrini, Ethan Rouen, Rik Sen, Wing Wah Tham, Wayne Thomas, and seminar participants at University of Queensland and University of New South Wales for helpful comments.
Catch Me If You Can:
Improving the Scope and Accuracy of Fraud Prediction
Abstract
We propose a parsimonious metric – the Adjusted Benford score (AB-score) – to improve the
detection of financial misstatements. Based on Benford’s Law, which predicts the leading-digit
distribution of naturally occurring numbers, the AB-score estimates a given firm-year’s likelihood
of financial statement manipulation, compared to its peers and controlling for time-series trends.
The AB-score requires less data than the leading accounting-based misstatement metric (the F-
score) and can be computed for many more firm-years, including financial firms. For firm-years
with all data available, combining the AB-score and F-score variables in one model yields higher
accuracy in predicting misstatements in- and out-of-sample.
Keywords: Fraud, Accounting quality, Benford’s Law, F-score, AAERs, Earnings manipulation,
Earnings misstatement
JEL Classification: G20, G23, M41
1
1. Introduction
Financial fraud is difficult to predict because the perpetrators enjoy an informational advantage
over victims and investigators. Globally, organizations lose about 5% ($3.5 trillion) of their annual
revenues to fraud.1 Fraud victimizes shareholders (Karpoff, Lee, and Martin, 2008), affects lenders
(Fulghieri, Strobl, and Xia, 2014), damages the reputation of directors (Fich and Shivdasani, 2007)
and auditors (Skinner and Srinivasan, 2012), and ties up the resources of investigative agencies.
Given the wide reach of fraud, it is not surprising that considerable effort is devoted to fraud
detection and prediction.
We offer a new, parsimonious metric to detect financial reporting irregularities, such as
earnings management, manipulation, and/or misstatement.2 This metric is easy to compute and
requires fewer inputs than existing measures, so it can be computed for a wider range of firms,
including financial firms. It performs well in out-of-sample tests and, importantly, increases the
number of firm-years that can be examined by more than 50% compared to metrics that require
specific accounting variables. Our measure is based on the mathematical observation known as
Benford’s Law (Benford, 1938), which predicts the frequency of each leading digit in a naturally
occurring distribution of numbers (that is, what fraction of numbers should begin with each digit,
1 through 9). For example, in distributions that obey Benford’s Law, the number 1 appears as the
first digit (as in 19 or 168) about 30% of the time, while the number 9 appears as the first digit less
than 5% of the time. Amiram, Bozanic, and Rouen (2015) observe that restated financial
statements more closely adhere to Benford’s Law than the misstated versions in the same year and
that divergence from Benford’s Law can be used to predict material misstatements. Using the
Amiram et al. (2015) findings as a springboard, we conduct a comprehensive investigation into
1 This estimate comes from a survey of Certified Fraud Examiners, who investigated cases between January 2010 and December 2011 and arrived at the estimate by using the 2011 Gross World Product. The Association of Certified Fraud Examiners (ACFE) published the results of the survey in its 2012 Report to the Nations on Occupational Fraud & Abuse., available at http://www.acfe.com/press-release.aspx?id=4294973129. This estimate is comparable to the 3% organizational revenue loss from corporate fraud estimated for the U.S. by Dyck, Morse, and Zingales (2013). 2 We follow the literature in referring to these irregularities as earnings management, manipulation, misreporting, or misstatement (terms that are used interchangeably in the literature), rather than fraud per se. Although the Securities and Exchange Commission (SEC) allegations often imply that they find evidence of fraud, firms typically neither admit nor deny guilt when responding to the SEC’s allegations.
2
the cross-sectional and time-series properties of firms’ financial statement deviations from
Benford’s Law and propose two new metrics that can be used to identify firm-years with higher
likelihood of misreporting.
We call the first metric the Adjusted Benford score (AB-score), and we build it as follows.
First, we verify that in aggregate financial statement numbers in Compustat closely follow the
leading digit distribution predicted by Benford’s Law. We then move to a firm-year level of
analysis by constructing, for each firm-year, a raw score that measures how much the leading digit
distribution of financial statement numbers deviates from the distribution predicted by Benford’s
Law. Our raw score is akin to the Financial Statement Divergence Score of Amiram et al., 2015.
We study how the raw score varies across firms, financial statement length, industry grouping, and
time. Guided by the results of this investigation, we construct several standardized variants of the
raw score and include them in a selection model to predict known cases of financial misstatements,
as identified in the SEC’s Accounting and Auditing Enforcement Releases (AAERs). This
selection process flags one particular combination that has the best predictive ability; we call this
the AB-score model. This model produces an odds ratio, the AB-score, which expresses the
probability that a firm’s financial numbers are misstated in a given year.
The main advantage of the AB-score is that it can be computed over every firm-year that has
any financial statement information available. As a result, it can be computed over a wider range
of firm-years than prediction models that require specific financial statement inputs such as
accruals. We choose as our benchmark one of the most comprehensive and popular measures of
earnings manipulation, the F-score of Dechow, Ge, Larson, and Sloan (2011; DGLS henceforth).
The F-score has been shown to be very useful in detecting financial misreporting and is widely
used in the accounting and finance literature.3 DGLS compute the F-score as the predicted
probability of a misstatement (an odds ratio) using fitted values from a model that includes balance
sheet items, nonfinancial measures, off-balance-sheet activities, and market-based measures.
DGLS use the SEC’s AAERs as their misstatement indicator, as do we. We compare the
3 Over 150 studies in the finance and accounting literature use the F-score as a metric of financial misstatement according to a Google Scholar search at the time of writing (February 2018).
3
applicability of the F-score and the AB-score over the original DGLS sample period (1979-2002)
and our full sample period (1979-2011). Specifically, we compare how many AAER and non-
AAER firm-years can be predicted by each measure based on their data requirements. This is
important because models that require a larger number of inputs to identify misstatements often
have a limited scope due to missing data.4 The AB-score can be calculated for about 61% more
firm-years than the F-score during the DGLS sample period. Similarly, the AB-score can be
calculated for about 58% more firm-years than the F-score during our full sample period. About
47% of the additional firm-years that can be estimated by the AB-score model are for financial
firms, and the remainder are non-financial firms that are missing necessary data in Compustat.
In addition to the AB-score, we create a model that includes the Benford-based variables along
with the variables from the F-score model. We call the output of this combined model the ABF-
score. While the ABF-score is limited in scope to the firm-years for which the F-score variables
are available, it offers the benefit of using leading-digit-based information as well as accounting
information to detect misstatements within the smaller sample.
To assess the benefits of using Benford-based variables in predicting misstatements, we first
examine how many AAER firm-years are correctly predicted in-sample by the AB-score, the F-
score, and the ABF-score, using an odds-ratio threshold of 1.0.5 Thanks to its broader sample
coverage, the AB-score correctly predicts the largest number of AAER firm-years, 973, while the
F-score and ABF-score correctly predict 678 and 683, respectively. As a percentage of the sample
to which each model can be applied, all three models perform well: the AB-score correctly predicts
about 73% of the AAER firm-years it can estimate, versus 69% for the F-score and 70% for the
ABF-score. The AB-score has a lower Type II (false negative) error rate than the other two models,
but it comes at the expense of a higher Type I (false positive) error rate. The ABF-score has lower
Type I and Type II error rates as well as a higher correct AAER firm-year prediction rate than the
4 Missing data is an issue because the convention in finance and accounting studies is to drop observations with missing data. For example, Brazel, Jones, and Zimbelman (2009) show that of the 268 AAERs in their 1998 to 2007 sample, 162 had missing or incomplete data to build accounting-based misstatement measures (see their Table 1). 5 Above the threshold of 1.0, a firm-year is more likely to be misstated than a randomly chosen observation from the sample; see Section 4.3.
4
F-score. The ABF-score’s advantage holds both with an odds-ratio threshold of 1.0 and when
considering all possible thresholds through receiver operating characteristic (ROC) curves.6 Taken
as a whole, the in-sample tests suggest that the AB-score and ABF-score models can improve the
prediction of earnings misstatements, both in terms of increasing the scope of firms that can be
analyzed (with the AB-score) and boosting the accuracy of the F-score accounting-based metric
(with the ABF-score).
A common concern about explanatory models is that relationships found in-sample may not
hold out-of-sample. To address this concern, we assess each model’s ability to predict AAER firm-
years out-of-sample using 100 simulations in a random-holdout specification (randomly selecting
half of the observations to estimate the model and testing its predictive ability on the other half of
the observations). The out-of-sample tests confirm the in-sample findings: The AB-score correctly
identifies more AAER firm-years thanks to its broader sample coverage, while as a percentage of
the sample each model can be estimated over, the ABF-score performs best in terms of both correct
prediction rates and error rates at an odds-ratio threshold of 1.0.
The AB-score and F-score models approach misstatement detection from very different
perspectives (with the ABF-score combining the two), one rooted in the prevalence of leading
digits and the other capturing specific accounting information. A natural question is whether they
identify the same misstatement firm-years despite coming from such different perspectives. To
answer this question, we examine the congruence of the AAER firm-year predictions of the AB-
score, F-score, and ABF-score models. Within the overlap sample, the AB-score (ABF-score)
correctly predicts about 84% (91%) of the AAER firm-years that are correctly predicted by the F-
score, while the F-score correctly predicts about 72% (90%) of the AAER firm-years that are
correctly predicted by the AB-score (ABF-score). Furthermore, we find that the ABF-score is more
successful at correctly identifying AAER firm-years that the F-score is unable to distinguish from
non-AAER firm-years than the F-score is at breaking ABF-score ties. Taken together, these
findings suggest that the ABF-score provides incremental benefit above the F-score.
6 The ROC curve is a diagnostic tool to evaluate the efficacy of a binary model. It plots the true positive rate against the false positive rate for all possible threshold cutoffs.
5
Finally, we adopt a case-study approach and investigate how the AB-score, F-score, and ABF-
score perform in detecting notorious cases of financial misstatement. We identify ten high-profile
financial misstatement cases during our sample period and calculate the AB-score, F-score, and
ABF-score for each case. Of the 57 AAER firm-years in this sample with Compustat data
available, the AB-score classifies 46 (81%) as likely to be misstated. The F-score and ABF-score
can be computed for only 41 firm-years, of which the F-score classifies 27 (66%) and the ABF-
score classifies 31 (76%) as likely misstated. Furthermore, the AB-score and ABF-score provide
stronger signals of financial misconduct: The average AB-score in this sample is 1.50 and the
average ABF-score is 1.56, implying that these firm-years are about 1.50 to 1.56 times as likely to
be misstated as the average observation. In contrast, the average F-score for these firm-years is
1.27. Overall, the AB-score and ABF-score provide sharper identification of likely misreporting
behavior in this sample of notorious cases.
Our primary contributions to the literature are the new AB-score and ABF-score models. The
AB-score provides a reliable metric for detecting potential misstatements in a much broader set of
firms than the leading F-score metric. Of particular note is the fact that unlike the F-score (and the
ABF-score), the AB-score can be applied to financial firms. In examining the relationship between
managerial compensation and risk in financial firms, Cheng, Hong, and Scheinkman (2015) note
that the conduct of financial firms such as Bear Stearns, Merrill Lynch, AIG, and Lehman Brothers
during the financial crisis underscores the importance of bringing greater scrutiny to their reporting
activities. The AB-score allows such scrutiny. Furthermore, much of the research in misstatement
prediction relies on databases of firms that have been caught. Karpoff, Koester, Lee, and Martin
(2017) find that databases of firms identified as having engaged in misstatement (i.e., ex-post
samples) have several systematic biases that are economically meaningful; predictive models such
as the AB-score and ABF-score alleviate such biases. Given the small sample characteristics and
biases in ex-post misconduct samples, Amiram, Bozanic, Cox, DuPont, Karpoff, and Sloan (2017)
highlight the need for “more robust, and possibly yet-to-be discovered, techniques and
methodologies” for fraud-related research. Our study directly addresses this call.
For non-financial firms with the necessary financial statement variables available, the ABF-
6
score encompasses both the financial intuition of the F-score variables and the leading-digit
detection of the AB-score, producing a metric with a higher correct classification rate and lower
error rates than the F-score and the AB-score, both in-sample and out-of-sample. Our
recommendation is that researchers use the ABF-score for firm-years with all the necessary
financial statement data available and use the AB-score for financial firms and any other firms
lacking the necessary data. To this end, will share our programs for constructing the AB-score and
the ABF-score with interested researchers and make them publicly available at a later date.
The remainder of the paper is organized as follows. Section 2 discusses the related literature.
Section 3 describes our data and develops the AB-score model. Section 4 develops the ABF-score
model and tests the models’ abilities to detect misstatements in-sample and out-of-sample, and
Section 5 examines high-profile misstatement cases. Section 6 concludes.
2. Related literature
2.1 Benford’s Law and financial statement numbers
The original research establishing that there is a predictable frequency with which leading
digits occur in a natural distribution began with astronomer Simon Newcomb (1881), who noticed
that books of logarithm tables were generally more worn on the early pages than toward the back.
People seemed to look up numbers beginning with the digits 1 and 2 far more often than they
looked up numbers beginning with the digits 8 and 9. Newcomb later sketched a proof that
numbers beginning with 1 and 2 actually occur more often in nature than numbers beginning with
8 and 9. His proof shows that a randomly selected number should begin with the digit 1 about
log10(2) or 30.1% of the time, the frequency of numbers with leading digit 2 should be log10(3/2)
or about 18%, those with leading digit 3 should be log10(4/3) or about 12%, and so on until the
frequency of 8’s should be 5.1% and that of 9’s should be 4.6%. In general, the probability with
which the leading digit (d) should appear in a distribution of numbers is:
𝑃 𝑑 𝐿𝑜𝑔 11𝑑
, 1
where d = 1, 2, … , 9, and P is the probability associated with that number’s appearance in the
7
data. Fifty-seven years after Newcomb’s work, physicist Frank Benford rediscovered the property
and did extensive work to provide a more rigorous mathematical underpinning. Benford found
support in over 20,000 entries from 20 different sources, including data on river surface areas,
populations, specific heats of chemical compounds, American League baseball statistics, and
numbers obtained from newspaper and Reader’s Digest articles.7 Since Benford’s (1938) article
gained widespread attention while Newcomb’s (1881) work had been somewhat overlooked, the
law became known as Benford’s Law.
Benford’s Law has been used to investigate data-related irregularities in settings as disparate
as political elections (Klimek, Yegorov, Hanel, and Thurner, 2012), religious activity (Mir, 2014),
and volcanology (Geyer and Marti, 2012). Varian (1972) was an early champion for the use of
Benford’s Law in the social sciences. In an accounting context, Nigrini (1999) shows that
deviations of financial statement numbers and tax-related data from the prediction of Benford’s
Law can be useful to flag cases for further scrutiny. Durtschi, Hillison, and Pacini (2004) show the
value of Benford’s Law as a signaling device to identify accounts more likely to involve
misstatement, thus improving on the random selection process auditors employ to assess the
validity of a firm’s reported numbers. Deviations from Benford’s Law in financial statements
appear to vary over time: Wang (2011) finds an increase in deviations from 1960 to 2011.
In recent work related to financial misconduct, Amiram et al. (2015) use Benford’s Law to
create their Financial Statement Divergence (FSD) Score.8 Our study extends Amiram et al.’s
(2015) work in several ways. First, we show that the relationship they find in a sample of 73 AAER
firm-year observations holds for the universe of 1,336 observations. Second, we refine the FSD
score to account for time-series variation and variation caused by the number of inputs. While
Amiram et al. (2015) use only financial statements that include more than 100 inputs, our sample
contains the universe of Compustat observations with any non-missing financial statement data.
Most importantly, while Amiram et al. (2015) document the relationship between AAERs and
Benford’s Law, we launch an extensive investigation into how to best convert that relationship
7 Benford (1938), Table I. 8 Bowler (2017) and Boyle and Lewis-Western (2018) test the use of the FSD score in an audit setting.
8
into information that researchers and auditors can use to detect misstatement. Finally, we test our
measures both in- and out-of-sample alongside the leading accounting-based misstatement
measure and investigate the advantages of combining the two approaches.
2.2 The F-score and other metrics used in financial misconduct research
There is no single definition of what constitutes financial misconduct, so there are multiple
approaches used to gauge it. Some studies adopt the strictest definition of financial misconduct:
fraud. These studies are based on small, often hand-collected samples of firms that are sanctioned
for fraud. One such example is Brazel, Jones, and Zimbelman (2009), who begin with a sample of
AAERs and then go through each release to determine whether fraud is established, cross-checking
against other sources. Other studies use broader indications of financial misconduct, including
direct measures such as restatements that arise from U.S. GAAP violations (Burns and Kedia,
2006) and indirect measures such as total accruals (Bayley and Taylor, 2007), earnings
management (Beneish, 1999), and options back-dating (Bernile and Jarrell, 2009).
One of the most advanced and widely used measures to detect financial statement manipulation
is the F-score developed by DGLS (F is for “fudging,” according to one of the authors). DGLS
compile a database of financial misstatements by hand-collecting information in the SEC’s
AAERs, noting whether the firm or employees were named in the AAER and whether the
wrongdoing was related to overstated earnings (understatement of earnings is more likely to be an
unintentional mistake).9 Using this database, DGLS develop a prediction model which provides
the F-score, a scaled probability that can be used to estimate the likelihood of earnings
misstatement.
The F-score is generated from a model that analyzes financial statement data, combining
several accounting variables that have been used in previous studies to signal earnings
management or financial misreporting. DGLS present three such models in decreasing order of
parsimony. The first includes several measures of accruals quality and discretionary accruals. To
gauge whether diminishing firm performance prompts misreporting, it includes annual changes in
9 DGLS make their AAER data available to other researchers to promote research on earnings misstatements.
9
return on asset and cash sales items. To capture financing activities, it includes debt or equity
issuance, and because soft assets may be easier to manipulate, the model includes the ratio of soft
assets to total assets. The second model adds to these variables the abnormal change in the number
of employees because firms may try to boost short-term earnings by cutting employee headcount.
It also adds operating lease activities because leases can be used to frontload earnings. The final
model adds current and lagged market-adjusted returns, because firms may misstate to compensate
for poor performance. DGLS demonstrate that by including financial statement and market
information beyond accruals, the F-score offers a robust approach to detecting misstatements.
A large number of studies use the DGLS F-score in misstatement-related research. For
example, Fang, Huang, and Karpoff (2016) use the F-score to document how short selling, or its
prospect, curbs earnings management. Jia, Van Lent, and Zeng (2014) use the F-score to examine
male CEOs’ facial masculinity and financial misreporting. Bradley, Gokkaya, Liu, and Xie (2017)
use the F-score to gauge the ability of analysts to detect firms engaging in financial misreporting
activities. To test how firm-initiated clawbacks reduce accounting manipulation, Chan, Chen, and
Chen (2013) use the F-score as a metric of financial statement manipulation. DeFond, Lim, and
Zang (2015) use the F-score to assess which client firms present greater engagement risk for
auditors.
Given the widespread use of the F-score in financial misconduct research, we believe it is the
most useful benchmark against which to test the AB-score. In a recent study, Perols, Bowen,
Zimmermann, and Samba (2017) propose other potential benchmarks. They investigate three data
analytic techniques and show that two of these outperform the F-score in detecting AAER firm-
years in a limited sample. We choose the F-score as our benchmark because the Perols et al. (2017)
models have been tested on only a small sample of AAER firm-years (51 out of nearly 1,400),
while the F-score model has been tested more broadly including out-of-sample.
3. Development of the AB-score model
3.1 Data and sample
We use two main data sources for this study. For our financial statement data, we use all
10
Compustat variables that appear in the balance sheet, income statement, and statement of cash
flow, as in Amiram et al., 2015.10 We obtain data on the SEC-issued AAERs from the Center for
Financial Reporting and Management (CFRM) at University of California, Berkeley.11 Our full
sample period is 1979 – 2011; because AAERs are issued with a lag relative to alleged
misstatement years, we use AAERs issued through 2014 to identify misstatements through 2011.
To facilitate comparisons with the results of DGLS, we also examine their sub-period of 1979 –
2002.
The AAER dataset documents firms that are issued accounting and auditing enforcements by
the SEC at the conclusion of an investigation against the firm, an auditor, or an officer for alleged
accounting and/or auditing misconduct. These releases provide details on the nature of the
misconduct, the individuals and entities involved, and their effect on the financial statements. We
begin with the 1,383 AAERs issued in our sample period, covering 1,909 firm-years. Because our
study focuses on financial misstatements, we filter out 403 actions that do not allege misstated
annual financials. We further eliminate 224 AAERs in which the recipient or misstatement year
cannot be precisely determined. We delete seven AAERs that allege earnings understatement
(instead of inflated earnings) to facilitate comparison between our prediction model and that of
DGLS.12 We lose 189 observations when merging with Compustat data, yielding a final sample of
1,336 distinct firm-years covering 578 AAERs issued to 577 firms.13
Each AAER alleges at least one year of misstated financials, and many identify multiple
consecutive misstated years per firm. Our sample contains AAERs that allege financial
misstatement ranging from one to 16 years.14 Table 1 summarizes the distribution, showing that
10 Because our goal is to predict as many firm-years as possible, we do not require that a firm-year have a minimum number of line items to be included. Amiram et al. (2015) point out that their results are robust to including firm-years with fewer than 100 line items, and our results are robust to excluding firm-years with fewer than 100 line items. 11 We thank Dechow, Ge, Larson, and Sloan for making these data available. The data collection procedure for the AAER dataset is described in detail in DGLS. 12 Including the seven understatement AAERs strengthens our results in further analyses. 13 In our sample period, Time Warner AOL receives AAERs related to two separate material accounting misstatements. The first alleges 1995-1996 financials to be misstated, and the second relates to 2000-2002 financials. 14 In 16 cases, a single AAER from the CFRM database alleges misstatement over a non-contiguous time horizon. We compute these AAERs’ durations as the difference between the first and last AAER-year rather than treating each as multiple AAERs with shorter durations.
11
the mean (median) AAER in our sample alleges 2.31 (2.00) years of financial misstatement.
[Table 1 here]
We establish the applicability of Benford’s Law to the Compustat universe by examining the
leading digit distribution of all non-missing financial statement variables in Compustat over our
sample period. Each firm-year must have at least one non-missing financial statement variable in
Compustat to be included.15 Table 2 presents the results.
[Table 2 here]
In the full Compustat sample, there are 10.73 million numbers that begin with the leading digit
1 and 1.56 million beginning with the leading digit 9. As a percentage of the total (34.88 million
numbers), the leading digit 1 appears with a frequency of 30.76% (10.73/34.88) while numbers
with a leading digit of 9 appear with a frequency of 4.47% (1.56/34.88). The comparable
predictions from Benford’s Law in Equation (1) are 30.10% for leading digit 1 and 4.58% for
leading digit 9. The mean absolute deviation of the observed distribution from the predicted
distribution of leading digits is 0.1580%. We scale this by 100 and arrive at the raw Benford score
(B_Raw score) of 0.1580. Figure 1 shows the close fit of the aggregate leading digit distribution
in our sample to the distribution predicted by Benford’s Law.
[Figure 1 here]
3.2 Adjusting the Raw Benford Score
Although the overall distribution of leading digits in financial statement numbers in Compustat
closely follows the distribution predicted by Benford’s Law, at the firm-year level there is
significant variation (Amiram et al., 2015). To examine this variation, we calculate the B_Raw
score for each firm-year.
Our goal is to assess how the B_Raw score behaves in the cross-section and over time so that
we can fine-tune its usefulness as a predictor of earnings misstatement. We first consider financial
statement length. Bowler (2017) shows that the Benford score is vulnerable to continuity frictions
when a smaller pool of numbers is used to compute it. For example, Benford’s law states that the
15 Restricting the sample to firm-years with at least 100 variables, as in Amiram et al. (2015), yields identical inference.
12
leading digit nine should appear approximately 4.6% of the time. For a firm-year with 50
Compustat numbers, that should be 2.3 occurrences, which will create a mechanistic deviation
when the observed instance is either a two or a three. Furthermore, individual line item leading-
digit deviations will be a larger percentage when there are fewer line items. Thus we expect a
mechanistic, negative relationship between the number of line items and the B_Raw score for a
firm. In Panel A of Figure 2 we plot the average B_Raw score for each firm-year against the number
of inputs (line items) used to compute the score; the graph shows a clear negative slope. This result
suggests that a simple comparison of the B_Raw scores of two firms to proxy for relative
likelihoods of financial statement manipulation may be misleading if the firms’ financial
statements are of different lengths.
[Figure 2 here]
We next examine whether the B_Raw score varies across industries, motivated by the fact that
research on earnings management, discretionary accruals, and financial reporting quality generally
controls for industry classification (e. g., Bergstresser and Philippon, 2006). Panel B of Figure 2
shows that the B_Raw scores in our sample exhibit moderate heterogeneity across two-digit SIC
industries. The third dimension we examine is how the B_Raw score behaves over time, since
academic research and the popular press report that financial misconduct is more concentrated in
certain periods. For example, at the turn of this century, the dot-com bust was followed by the
revelation of several financial scandals including Enron, Tyco, and WorldCom, prompting the
expansive Sarbanes-Oxley Act to strengthen existing financial disclosure rules and mandate new
ones. Panel C of Figure 2 shows that the B_Raw score varies over time, with a sharp peak around
the dot-com bubble. Finally, we compute a firm-level B_Raw score to examine how much firms’
average B_Raw scores vary from one another. Panel D of Figure 2 shows there is significant firm-
level heterogeneity in B_Raw scores.
The results of this examination suggest that the B_Raw score for a firm-year should be adjusted
to account for predictable cross-sectional and time-series variations if it is to be compared across
firms and over time. We calculate four such adjusted measures, where each adjusts for baseline
differences in one of the four dimensions examined above (number of inputs, year, industry, and
13
firm). The four adjusted Benford score measures are:
B_Input adjusts the B_Raw score for the number of inputs used in its computation.
Within each year, observations are sorted into 20 bins by how many non-missing
financial statement numbers they contain.16 We compute the average B_Raw score for
each bin and the standard deviation of B_Raw score within that bin. For each
observation, we subtract the average B_Raw score of its bin and divide by that bin’s
standard deviation.
B_Industry subtracts from each firm-year’s B_Raw score that industry-year’s mean
B_Raw score and divides by the industry-year’s standard deviation of B_Raw score.
B_Year subtracts from each firm-year’s B_Raw score that year’s mean B_Raw score
and divides by the standard deviation of B_Raw scores, calculated across all firms in
that year.
B_Firm subtracts from each firm-year’s B_Raw score that firm’s cumulative (prior to
that year) mean B_Raw score and divides by the firm’s cumulative standard deviation
of B_Raw scores.17
3.3 Building the AB-score model
We test whether the B_Raw score and the four adjusted measures (B_Input, B_Year,
B_Industry, and B_Firm) can be used to predict material financial misstatements as proxied by
AAER firm-years. We estimate a logistic regression as in Shumway (2001), where the dependent
variable, AAERi,t, is an indicator that assumes the value one if the SEC released an AAER alleging
that firm i’s financials in year t are misstated, zero otherwise. We estimate a model with all five
measures together and a model with measures chosen via a backward elimination technique,
beginning with all of the variables and then using the computational algorithm of Lawless and
16 By normalizing by bins within each year, we implicitly adjust for any trends in the number of line items reported in financial statements over time. Bloomfield (2012) suggests that firm disclosures have been increasing over time because of regulatory requirements. 17 We require data over the prior two years to compute a firm’s cumulative standard deviation of B_Raw scores.
14
Singhal (1978) as a basis for removing variables, as in DGLS.18 The logistic regressions take the
following form:
𝐴𝐴𝐸𝑅 , 𝛼 𝛽 𝐵𝑒𝑛𝑓𝑜𝑟𝑑_𝑀𝑒𝑎𝑠𝑢𝑟𝑒 , 𝜖 , , 2
where k = the number of variables included and Benford_Measurei,t is B_Raw score, B_Input,
B_Year, B_Industry, or B_Firm.
[Table 3 here]
Specification (1) in Table 3 includes all five variables, while specification (2) includes only
the three explanatory variables B_Raw, B_Input, and B_Year, which are chosen via backward
elimination. Specification (2) in Table 3 has the higher predictive power. In this model the
individual coefficients are not of primary interest, since they suffer from multicollinearity; our
interest going forward is the model’s ability to predict AAER firm-years. The estimates in
specification (2) indicate that B_Input and B_Year are significant and have incremental power over
B_Raw in explaining AAERs. We call this specification the Adjusted Benford score (AB-score)
model. In the following sections we examine the scope, accuracy, and predictive power of the AB-
score.
4. Testing the performance of the AB-score
In this section we first replicate the DGLS F-score models and develop an additional model,
the ABF-score model, which combines the AB-score and F-score variables into a single model.
Second, we examine the scope of each model, comparing how many AAER and non-AAER firm-
years each model can be applied to given its data requirements. Third, we compare the in-sample
performance of the three models, followed by formal out-of-sample tests for model evaluation.
Finally, we examine the overlap between the three models’ predictions.
4.1 Replication of F-score model and development of ABF-score model
Testing the efficacy of any metric requires a benchmark: What are we comparing it to? To
18 Using forward or stepwise methods, instead of backward, yields the same set of variables for inclusion.
15
examine the performance of the AB-score model, we choose the F-score model of DGLS as the
benchmark because of its prominence in the literature. The F-score integrates disparate warning
signals of financial misreporting into a comprehensive measure, the odds that a firm is “cooking
the books.”
DGLS predict the issuance of AAERs using three models in decreasing order of parsimony.
First, they use backward selection to build a baseline model with seven predictors: (1) change in
noncash net operating assets (RSST_Accruals), (2) change in receivables (Chg_Rcv), (3) change
in inventory (Chg_Invt), (4) percent soft assets (Pct_SoftA), (5) change in cash sales
(Chg_CashSales), (6) change in return on assets (Chg_ROA), and (7) an indicator equal to 1 if the
firm issued debt or equity during that year, 0 otherwise (Issue). Their second specification adds
(8) abnormal change in employees (Abn_Chg_Emp) and (9) an indicator equal to 1 if the company
has operating leases, 0 otherwise (OL). Their third model adds (10) market-adjusted stock returns
(MASR) and (11) one-year-lagged market-adjusted stock returns (Lag_MASR). We refer to these
three models as F-score M1, F-score M2, and F-score M3, respectively. Our goal is to test how the
AB-score model performs at AAER and non-AAER firm-year prediction in comparison to the
DGLS F-score models. We first carefully replicate the DGLS estimation and then examine the
incremental explanatory power of the AB-score. To that end, we compute the variables that enter
the DGLS estimation, both over their sample period and over our full sample period, and present
the results alongside the ones reported in the DGLS study.19
[Table 4 here]
Table 4 provides descriptive statistics for the AB-score and F-score variables for the entire
sample (Panel A) and for AAER firm-years (Panel B). All continuous variables are winsorized at
19 We calculate the variables following the description in DGLS (pp.35-38). Because bank and insurance company financial statements substantially differ from industrials in accrual variables, DGLS exclude the two industries. Following suit, we drop observations with 2-digit SIC codes from 60 to 69 when running their models. However, prediction using our Benford variables makes no distinction on industry. Therefore, we retain all observations when running the AB-score model. Note that we force our AAER sample to match the one used in DGLS. This is important because some AAERs to which DGLS did not have access allege misstatement within their 1979-2002 sample period. DGLS also follow Richardson, Sloan, Soliman, and Tuna (2005) in setting missing Compustat data items 9, 32, 34, 130, and 193 to zero. We follow this approach when computing the F-score but note that not doing so materially reduces the number of observations over which F-score can be computed.
16
1% and 99%, as in DGLS. For the full sample, mean and median B_Raw scores are slightly over
3, similar to the 2.96 (percent) FSD scores reported in Amiram et al. (2015) for their smaller
sample. Comparing across rows, the means and medians of the F-score variables are fairly closely
replicated in our sample and the DGLS sample. Comparing across the two panels in Table 3, we
find that the B_Raw score is lower in the AAER firm-years (Panel B), consistent with our results
in Table 3 and those reported in Amiram et al. (2015).
To lay the groundwork for comparing the AB-score and F-score models, we next present each
model’s estimated coefficients and compare our replication to the original F-score coefficients in
DGLS. For each model, we run logistic regressions to predict AAER firm-years, where the
dependent variable is a dummy that equals 1 if the observation is an AAER firm-year, 0 otherwise.
We estimate five models. Panel A presents the three F-score models in decreasing order of
parsimony. Panel B presents coefficients from the AB-score model and the ABF-score model,
which includes the three variables from the AB-score model and the seven variables from F-score
M1.20 We estimate each of these models over two time periods, the DGLS sample period (1979-
2002) and our full sample period (1979-2011).
[Table 5 here]
Table 5 presents the results. Comparing the “Reported” column of each of the three F-score
models with the estimates we obtain (under the column labeled “1979-2002”) shows that our
regressions closely reproduce the coefficient estimates reported in each of the three F-score
models.21 The coefficient estimates for 1979-2011, which includes nine additional years, diverge
somewhat from the 1979-2002 estimates but are still similar. Finally, the coefficient estimates on
the F-score variables in the ABF-score model are fairly close to those estimated in our F-score
20 We use F-score M1 in the ABF-score model because it is the most parsimonious of the F-score models, requiring the fewest inputs. Using M2 or M3 would further reduce the ABF-score model’s coverage relative to the AB-score model’s. 21 One likely reason that our replication exercise in Table 5 does not produce perfect matches for the DGLS coefficients is that Compustat backfills historical data (Cohen, Polk, and Vuolteenaho, 2003). The DGLS authors downloaded their data from Compustat sometime before 2011 (their paper’s publication date), and we downloaded our data from Compustat in 2017. In ongoing work, we are repeating the replication exercise using data that were downloaded from Compustat in 2012 to estimate the effects of such backfilling.
17
model replication as well as the ones reported in the DGLS paper. This exercise supports the
validity of our replication method, paving the way for us to use the F-score as a benchmark for
analyzing the accuracy and effectiveness of the AB-score and ABF-score.
4.2 Comparison of model scope
DGLS show that as input requirements increase, the number of AAER firm-years over which
their models can be predicted decreases monotonically, making a case for model parsimony. In the
same spirit, we begin by examining how many AAER and non-AAER firm-years each of the three
F-score models, the AB-score model, and the ABF-score model can be applied to. Table 6 presents
the results.
[Table 6 here]
The first row shows that for the most parsimonious version of their model, F-score M1, DGLS
report that 494 AAER and 132,967 non-AAER firm-years’ F-scores can be estimated (columns
labeled “DGLS 1979-2002”). We find similar numbers of observations in our replication of their
sample period (columns labeled “Replication 1979-2002”): 492 compared to the 494 AAER firm-
years that DGLS estimate (99.6%) and 132,139 compared to the 132,967 non-AAER firm-years
that DGLS estimate (99.4%). The number of observations that can be estimated drops for F-score
M2 and M3, shown in the second and third rows, as each successive F-score model requires more
inputs. The AB-score model, which has less demanding data requirements, can be estimated for
697 AAER firm-years (41.7% more than F-score M1) and 212,902 non-AAER firm-years (61.1%
more than F-score M1) over the 1979-2002 period; a similar sample expansion occurs over the
longer 1979-2011 period. The ABF-score model has the same observational counts as F-score M1
because both have the same binding input requirements from Compustat.
4.3 In-sample comparisons
The main goal of this paper is to improve the detection of financial misstatements. To that end,
in this subsection we examine each candidate model’s ability to correctly predict AAER and non-
AAER firm-years within-sample; in the following subsection we perform out-of-sample tests. For
comparison we use the F-score M1 in this and all subsequent analyses because it requires the least
number of inputs, which gives the F-score model the broadest sample coverage. Each model is
18
applied to predict AAER and non-AAER firm-year observations, and each observation’s odds of
being a misstated firm-year are determined from that model’s coefficients.
We estimate an observation’s odds of being an AAER firm-year under each model following
the methodology of DGLS. First, using the full sample period 1979-2011, we compute the
unconditional probability that an observation is an AAER firm-year by dividing the number of
AAER firm-years by the number of total firm-years. Next, we obtain the predicted value for the
dependent variable by multiplying the independent variable matrix by the coefficient matrix. We
then determine the conditional probability of an observation being an AAER firm-year by
exponentiating the predicted value (using base e) and dividing by one plus that amount. Finally,
we determine an observation’s odds of being misstated relative to a random observation by
dividing the conditional probability by the unconditional probability.22 The average firm has an
odds ratio of 1; the higher a firm-year’s odds ratio, the higher its probability of misstatements.
We compare each observation’s odds of being an AAER firm-year against a threshold.
Observations with odds greater than or equal to (less than) the threshold are classified as likely
AAER firm-years (non-AAER firm-years). A threshold of 1.0 has an intuitive interpretation: At
odds of 1.0, an observation is as likely to be an AAER firm-year as a random observation pulled
from the sample. Those with odds above (below) are more (less) likely.
Table 7 reports each model’s sample coverage and accuracy. The first column reports the
number of firm-year observations that can be estimated by each model (of the 296,645 firm-year
observations from Compustat in the 1979-2011 period). Observations correctly identified as
misstated firm-years are counted under the Correct AAER firm-years column; those correctly
identified as not misstated, under Correct Non-AAER firm-years; those erroneously identified as
misstated, under Type I Error; and those erroneously identified as not misstated, under Type II
Error. The last column reports how many of the Compustat firm-year observations cannot be
classified by each model (Unclassified).
[Table 7 here]
22 DGLS illustrate this procedure on p. 61.
19
Panel A presents the results for all three models over the samples for which each can be
estimated, using an odds-ratio threshold of 1.0. The AB-score model correctly identifies 72.8% of
the AAER firm-years, which compares favorably with the 69.4% of AAER firm-years correctly
identified by the F-score model. In terms of the number of AAER firm-years correctly identified,
the AB-score does considerably better (973 versus 678 AAER firm-years for F-score) because of
its broader sample coverage. In addition to correctly identifying true AAER firm-years, we also
care about minimizing the number of false positives (Type I errors), i.e., non-AAER observations
that are erroneously flagged by the model as likely to be misstated, and false negatives (Type II
errors). The F-score model has a lower Type I error rate than the AB-score model, while the AB-
score model has a lower Type II error rate. The Type II error rate, which captures observations
that are mistakenly identified as not misstated, is generally of greater concern to auditors than the
Type I error rate (Carcello, Vanstraelen, and Willenborg, 2009) because auditors are more likely
to be sued for failure to detect misstatements (Bonner, Palmrose, and Young, 1998). Auditors
would suffer more if they gave a green light to misstated financial statements than if they treated
correct financial statements as suspect (in the latter case, in the process of trying to detect the non-
existent errors the auditors would likely discover that the statements were correct). Finally, Panel
A shows that the ABF-score model performs slightly better than the F-score model in-sample, with
a few more AAER firm-years correctly identified (though not as many as the AB-score) and lower
Type I and Type II error rates.
Panel B of Table 7 presents a closer look at the 109,233 firm-year observations that cannot be
predicted by the F-score and ABF-score models. The AB-score model performs well in this subset
overall, and about equally well in the 47% of observations for financial firms as for the 53% that
are non-financial firms missing some data in Compustat. As in the full sample (Panel A), we find
correct AAER prediction rates of over 70% (Type II error rates below 30%) for both subsamples,
suggesting that the AB-score is a good metric for identifying possible misstatements in firms that
20
cannot be estimated by the F-score model.23
Table 7 applies the intuitive threshold of 1.0, but the relative performance of the models can
vary with the odds ratio threshold chosen. We next construct ROC curves as a more formal test of
the models' predictive power. The ROC curve plots a model’s true positive rate against its false
positive rate across every possible threshold; a higher area under the curve (AUC) indicates that a
model is more effective at distinguishing between positive and negative outcomes when all
thresholds are considered. The AUC can range from 50% (purely random prediction) to 100%
(perfect prediction). An AUC of 60% is generally considered desirable in low-information
environments, while an AUC of 70% is desirable in information-rich environments (Berg, Burg,
Gombovic, and Puri, 2018; Iyer, Khwaja, Luttmer, and Shue, 2016).
[Figure 3 here]
Panel A of Figure 3 compares the ROC curves of the F-score and the ABF-score models for
the firm-year observations over which both models can be estimated. The AUC for the F-score is
70%, while the AUC for the ABF-score is 72.42%. Recall that a completely uninformative model
would have an AUC of 50%; a 1% increase in AUC is considered a noteworthy gain (Iyer et al.,
2016). By the AUC metric, the ABF-score predicts AAER firm-years with 12.1% greater accuracy
than the F-score model.24 Panel B compares the ROC curves for the AB-score and the ABF-score
for the firm-year observations over which they can both be calculated. The AB-score’s AUC is a
respectable 63.77%, but the ABF-score dominates the AB-score at every threshold, with a 62.82%
greater accuracy over the AB-score.25 Finally, Panel C presents the AB-score ROC curve for the
firm-year observations that only the AB-score model can estimate (because they are financial firms
or are missing accounting variables required to calculate the F-score and ABF-score). The AB-
score’s AUC in this non-overlapping subsample is 66.32%, better than the AB-score’s AUC in the
23 There is no evidence that misstatements are more common among the firm-years for which F-score cannot be calculated. The actual AAER rate in firm-years in the non-overlapping sample is 0.33%, compared to 0.52% in the overlapping sample, 24 We follow Iyer et al. (2016) in computing the percentage improvement as (0.7242 – 0.5)/((0.7000 – 0.5) = 1.121, where 0.5 (the AUC under a non-informative random model) is subtracted from both AUCs. 25 As above, the percentage improvement is calculated as (0.7242 – 0.5)/((0.6377 – 0.5) = 1.6282.
21
overlapping sample (63.77% in Panel B) and considerably better than chance (50%).
Taken as a whole, the in-sample tests suggest that models based on the Benford score can
improve the prediction of earnings misstatements, both in terms of increasing the scope of firms
that can be analyzed (with the AB-score) and boosting the accuracy of the accounting-based metric
(with the ABF-score).
4.4 Out-of-sample comparisons
Prediction models are useful when they not only show a good in-sample fit but also perform
well out-of-sample. Thus we next test how well the AB-score, F-score, and ABF-score predict
AAER firm-years out-of-sample. We do so by estimating each model over half of our data and
using the other half for prediction, using a random holdout specification. We randomly select half
of the firm-year observations from the full sample to calibrate the model and use the estimated
coefficients to obtain predicted values in the other half. The random holdout approach has two
advantages over a simple partitioning into early and late subsamples, namely (i) preserving the full
time period span in both the calibration and prediction subsamples, and (ii) allowing multiple
simulations, which together yield more stable, representative relationships between misstatement
predictors and observed instances of misstatement.26 We repeat the random holdout procedure 100
times. We report the mean number and percentage of correctly predicted AAER and non-AAER
firm-years and their associated Type I and Type II errors in Table 8.
[Table 8 here]
Panel A presents the results for all three models using an odds-ratio threshold of 1.0. The AB-
score model correctly identifies 72.5% of the AAER firm-years, while the F-score and ABF-score
models correctly identify 73.5% and 73.6%, respectively, similar to the in-sample rates (72.8%,
69.4%, and 69.9%, respectively, in Table 7). The out-of-sample prediction rates suggest that the
models do not suffer from overfitting. The AB-score continues to outperform in terms of the
number of AAER firm-years correctly predicted, reflecting the AB-score’s broader coverage,
while the ABF-score again delivers the lowest Type I and Type II error rates.
26 Using a simple partition into early and late subsamples yields similar results (see Internet Appendix).
22
The ideal model would minimize both Type I and Type II errors, but in practice the two are
traded off against each other. If regulators or auditors are not constrained in how many
investigations they can undertake, they may prefer a model that over-identifies firm-years as likely
misstated but captures more true misstatements. Such a model minimizes Type II error at the cost
of Type I error. However, if resources are limited, the investigator may prefer a less conservative
model which fails to identify more misstatement years but reduces the total number of observations
that require follow-up. In this case, Type I error is minimized. In Panels B and C we repeat our
out-of-sample simulations using threshold odds ratios of 0.7 and 1.3, respectively. As expected, a
lower threshold identifies more correct AAER firm-years but also leads to more false positives
(lower Type II and higher Type I errors), while the higher threshold does the opposite. Across all
three thresholds, the AB-score continues to identify the largest number of correct AAER firm-
years, thanks to its broader sample coverage, but the ABF-score has the highest correct AAER
firm-year prediction rate at the 1.3 threshold. While ultimately it is the investigator who must
decide how to set the odds threshold, the results in this table provide useful insight into the trade-
off between Type I and Type II errors in each model.
4.5 Model Overlap
The AB-score and F-score models approach misstatement detection from very different
perspectives (with the ABF-score combining the two), one rooted in the prevalence of leading
digits and the other capturing specific accounting information. A natural question is whether the
models identify the same misstated firm-years or whether they provide incremental predictive
power relative to each other. We address this question first by examining the overlap between the
models’ predictions and then by assessing each model’s ability to discriminate between AAERs
and non-AAERs that another model cannot.
Table 9 reports the percentage of correctly identified observations from each model (in the
overlapping sample) that is correctly identified by the other models, using the threshold odds ratio
of 1.0.
[Table 9 here]
In general, the AB-score is more successful at identifying AAERs correctly identified by the
23
F-score and the ABF-score than the other way around. The AB-score correctly identifies 84.4% of
the AAER firm-years correctly identified by the F-score and 91.8% of those correctly identified
by the ABF-score, while the F-score (ABF-score) correctly identifies only 72.4% (79.4%) of the
AAERs correctly identified by the AB-score. The ABF-score and F-score are more closely
correlated, with each able to predict about 90% of the other’s correctly predicted AAER firm-
years; this high correlation is not surprising given that the ABF-score includes all the F-score
variables. The high correlation between F-score and ABF-score carries over into identifying non-
AAER firm-years. In contrast, the F-score and ABF-score predict far more of the non-AAER firm-
years correctly predicted by the AB-score than the other way around (69.5% and 86.5% versus
43.2% and 52.5%). Correlations between the scores can also be used to assess model overlap. The
Pearson correlation coefficient between AB-score and F-score is 0.13; between AB-score and
ABF-score it is 0.51; and between F-score and ABF-score it is 0.85. These correlations, together
with the classification overlap results, suggest that the AB-score and F-score provide distinct
information.
To more precisely assess the incremental value of each model, we examine how each model
performs in cases where another model is inconclusive. In particular, we ask how well each model
does at distinguishing misstated versus non-misstated firm-years that have similar scores from
another model. We match each AAER firm-year with non-AAER firm-years that have scores
within 0.0005 of the AAER firm-year’s score. This technique resembles propensity score
matching. We consider both one-to-one matching and one-to-many matching, in which we
compare the AAER firm-year’s score against the mean and median score of its matched
observations. Table 10 reports the results.
[Table 10 here]
Panel A shows that for F-score ties, both the ABF-score and the AB-score assign a higher
misstatement likelihood to the AAER firm-year in over 60% of the matches under all three
matching methods, suggesting that the AB-score and ABF-score add incremental information
above the F-score. Similarly, Panel B shows that both the ABF-score and the F-score are successful
at breaking AB-score ties more than 66% of the time, suggesting that the ABF-score and F-score
24
add incremental value above the AB-score. In contrast, Panel C shows that neither the AB-score
nor the F-score is much more successful at breaking ABF-score ties than a coin flip (with success
rates at or below 50.3%). Taken together, these results imply that the Benford-related variables (in
the AB-score and ABF-score) and the F-score variables capture distinct sets of information, rather
than capturing the same information through different channels. The ability of the ABF-score to
accurately detect misstatements when the F-score is tied suggests that it can be a valuable tool for
resource-constrained regulators and fraud examiners.
5. Testing the AB-score, F-score, and ABF-score on well-known cases of financial misconduct
As a final examination of the three measures, we assess their performance at detecting the most
notorious misstatement cases during our sample period. We identify ten high-profile cases
perpetrated by publicly traded U.S. firms during our 1979-2011 sample period by conducting
internet searches using keywords such as “financial fraud” and “largest fraud cases.” These cases
resulted in AAERs alleging misstatements in 57 firm-years. The firms involved (ordered by
primary AAER number) are Cendant Corporation (formerly CUC International), WorldCom Inc.,
Enron Corp., Tyco International, HealthSouth Corp., Adelphia Communications Corp., Waste
Management, Inc., Federal National Mortgage Association (Fannie Mae), Qwest Communications
International, and Federal Home Loan Mortgage Corporation (Freddie Mac). Figure 4 presents the
frequency distribution for each of the three scores across all 57 firm-years. Overall, the AB-score
gives the strongest signal, with no firm-year readings below 0.7 and more observations in each of
the higher ranges than the F-score or ABF-score.
[Figure 4 here]
For a closer look at these notorious cases, Table 11 presents the detailed results of applying
the AB-score, F-score, and ABF-score to the specific AAER firm-years.27
[Table 11 here]
Panel A of Table 11 details the ten misstatement cases and shows the AB-score, F-score, and
27 Both Enron and Waste Management are associated with two distinct GVKEYs in Compustat during the years their financials were materially misstated. In both cases, we include firm-year observations for both reporting entities.
25
ABF-score for each year that an AAER alleges misstatement. For example, for Enron Corp. in
1998 (the first year covered by its AAER), the AB-score is 1.94, the F-score is 1.32, and the ABF-
score is 2.17. All three metrics exceed the odds ratio threshold of 1.0, suggesting likely financial
misstatement for Enron Corp. in 1998. Panel B summarizes the results across the ten prominent
misstatement cases. The AB-score has both greater sample coverage and a higher success rate at
predicting misstatement within the firm-years covered. Of the 57 misstated firm-years in this
sample, the F-score and ABF-score can be computed for only 41 firm-years (69% of the
observations) because they cannot be computed for financial firms (e.g., Fannie Mae and Freddie
Mac) and require specific data not available for other firm-years (e.g., Adelphia Communications
in both years). The AB-score can be computed for all 57 firm-years. The AB-score predicts that
46 firm-years in this misstatement subset (80.7%) have above-average likelihoods of being
misstated (i.e., the AB-score exceeds the 1.0 threshold). In contrast, the F-score predicts that only
26 firm-years in this misstatement subset (45.6% of the firm-years overall) have above-average
likelihood of being misstated. Although it faces the same data limitations as the F-score, the ABF-
score performs better than the F-score, correctly identifying 31 of the firm-years (54.4% of the
firm-years overall). Notably, the ABF-score has the highest mean, hinting at the benefits of
including both the financial statement items from F-score and the numerical patterns from AB-
score in the same model. In this sample of notorious misreporting cases, supplementing the ABF-
score with the AB-score for firm-years when the ABF-score cannot be calculated improves upon
the ABF-score’s predictive ability (see last column in Panel B). Overall, it is reassuring that the
AB-score and ABF-score prediction success rates are higher here than the success rates reported
in Tables 7 and 8, as these are the most egregious cases of misstatement and one would expect
good models to detect more of them, or to detect them with greater ease.
6. Conclusion
As a condition for raising money in public capital markets, firms agree to periodically
communicate their financial health by filing financial statements. While the majority of firms
discharge this duty honestly, some willfully manipulate their financial statements to suggest better
26
financial health. Since it is not easy for firm outsiders to directly identify which firms manipulate
their financial statements, research in earnings management and financial misconduct uses indirect
metrics that correlate with observed instances of such behavior (i.e., ex-post misstatement).
In this study we offer two new metrics to measure the likelihood of manipulation in a firm’s
financial statements. These metrics are based on Benford’s Law, which predicts the frequency with
which leading digits should appear in naturally occurring distributions of numbers. In aggregate
financial statement numbers closely follow Benford’s Law, but at the firm-year level there are
several systematic deviations. Controlling for these deviations in backward selection regressions,
we construct a prediction metric we call the Adjusted Benford score (AB-score). A key advantage
of the AB-score is that it can be computed for a larger sample of firm-years than the leading
accounting-based misstatement prediction metric, the F-score (which requires specific accounting
numbers and cannot be computed for financial firms). For firms with the necessary data available
to compute the F-score, we find that including the AB-score and F-score variables together in a
combined model (the ABF-score model) improves predictive ability. We find that the AB-score
performs well at detecting misstatements overall, and the ABF-score provides incremental
prediction value above the F-score.
In a survey article on the current state of financial reporting misconduct research, Amiram et
al. (2017) point to the gap in our understanding of the estimation errors involved in financial fraud-
related research. While researchers have used several measures to gauge the likelihood, extent, and
damages from financial reporting misconduct, there has been less focus on assessing the
performance of the metrics used in such research. We do extensive testing of the AB-score, F-
score, and ABF-score to validate them as indicators of the likelihood of financial misreporting.
Our bottom-line advice is that researchers interested in misstatement detection should use the
ABF-score for firm-years when the required data are available and the AB-score otherwise.
27
References
Amiram, Dan, Zahn Bozanic, and Ethan Rouen. 2015. Financial statement errors: evidence from
the distributional properties of financial statement numbers. Review of Accounting
Studies 20(4): 1540-1593.
Amiram, Dan, Zahn Bozanic, James Cox, Quentin Dupont, Jonathan Karpoff, and Richard Sloan.
2017. Financial reporting fraud and other forms of misconduct: A multidisciplinary review of
the literature. Review of Accounting Studies, forthcoming.
Bayley, Luke, and Stephen L. Taylor. 2007. Identifying earnings overstatements: a practical test.
Working paper.
Beneish, Messod. 1999. The detection of earnings manipulation. Financial Analyst Journal 55(5):
24-36.
Benford, Frank. 1938. The law of anomalous numbers. Proceedings of the American Philosophical
Society 78(4): 551-572.
Berg, Tobias, Valentin Burg, Ana Gombovic, and Manju Puri. 2018. On the rise and fall of
FinTechs – credit scoring using digital footprints. Working paper.
Bergstresser, Daniel, and Thomas Philippon. 2006. CEO incentives and earnings
management. Journal of Financial Economics 80(3): 511-529.
Bernile, Gennaro, and Gregg A. Jarrell. 2009. The impact of the options backdating scandal on
shareholders. Journal of Accounting and Economics 47(1-2): 2-26.
Bloomfield, Robert J. 2012. A pragmatic approach to more efficient corporate disclosure.
Accounting Horizons 26(2): 357-370.
Bonner, Sarah E., Zoe-Vonna Palmrose, and Susan M. Young. 1998. Fraud type and auditor
litigation: an analysis of SEC accounting and auditing enforcement releases. The Accounting
Review 73(4): 503-532.
28
Erik S. Boyle, and Melissa F. Lewis-Western. 2018. The impact of audits on financial statement
error in the presence of incentive and opportunity. Working paper.
Bowler, Blake D. 2017. Are going concern opinions associated with lower audit impact? Working
Paper.
Bradley, Daniel, Sinan Gokkaya, Xi Liu, and Fei Xie. 2017. Are all analysts created equal?
Industry expertise and monitoring effectiveness of financial analysts. Journal of Accounting
and Economics 63(2): 179-206.
Brazel, Joseph, Keith Jones, and Mark Zimbelman. 2009. Using nonfinancial measures to assess
fraud risk. Journal of Accounting Research 47(5): 1135-1166.
Burns, Natasha, and Simi Kedia. 2006. The impact of performance-based compensation on
misreporting. Journal of Financial Economics 79(1): 35-67.
Carcello, Joseph V., Ann Vanstraelen, and Michael Willenborg. 2009. Rules rather than discretion
in audit standards: going-concern opinions in Belgium. The Accounting Review 84(5): 1395-
1428.
Chan, Lilian, Kevin Chen, and Tai-Yuan Chen. 2013. The effects of firm-initiated clawback
provisions on bank loan contracting. Journal of Financial Economics 110(3): 659-679.
Cheng, Ing-Haw, Harrison Hong, and Jose Scheinkman. 2015. Yesterday's heroes: compensation
and risk at financial firms. Journal of Finance 70(2): 839-879.
Cohen, Randolph B., Christopher Polk, and Tuomo Vuolteenaho. 2003. The value spread. Journal
of Finance 58(2): 609-641.
Dechow, Patricia, Weili Ge, Chad Larson, and Richard Sloan. 2011. Predicting material
accounting misstatements. Contemporary Accounting Research 28(1): 17-82.
DeFond, Mark, Chee Yeow Lim, and Yoonseok Zang. 2015. Client conservatism and auditor-
client contracting. The Accounting Review 91(1): 69-98.
29
Durtschi, Cindy, William Hillison, and Carl Pacini. 2004. The effective use of Benford’s law to
assist in detecting fraud in accounting data. Journal of Forensic Accounting 5(1): 17-34.
Dyck, Alexander, Adair Morse, and Luigi Zingales. 2013. How pervasive is corporate fraud?
Working paper.
Fang, Vivian, Allen Huang, and Jonathan Karpoff. 2016. Short selling and earnings management:
A controlled experiment. Journal of Finance 71(3): 1251-1294.
Fich, Eliezer, and Anil Shivdasani. 2007. Financial fraud, director reputation, and shareholder
wealth. Journal of Financial Economics 86(2): 306-336.
Fulghieri, Paolo, Günter Strobl, and Han Xia. 2013. The economics of solicited and unsolicited
credit ratings. Review of Financial Studies 27(2): 484-518.
Geyer, Adelina, and Joan Marti. 2012. Applying Benford's law to volcanology. Geology 40(4):
327-330.
Iyer, Rajkamal, Khwaja, Asim Ijaz, Luttmer, Erzo R. P., and Kelly Shue. 2016. Screening peers
softly: inferring the quality of small borrowers. Management Science 62(6): 1554-1577.
Jia, Yuping, Lawrence Van Lent, and Yachang Zeng. 2014. Masculinity, testosterone, and
financial misreporting. Journal of Accounting Research 52(5): 1195-1246.
Karpoff, Jonathan, D. Scott Lee, and Gerald Martin. 2008. The cost to firms of cooking the
books. Journal of Financial and Quantitative Analysis 43(3): 581-611.
Karpoff, Jonathan, Allison Koester, D. Scott Lee, and Gerald Martin. 2017. Proxies and databases
in financial misconduct research. The Accounting Review, forthcoming.
Klimek, Peter, Yuri Yegorov, Rudolf Hanel, and Stefan Thurner. 2012. Statistical detection of
systematic election irregularities. Proceedings of the National Academy of
Sciences, 109(41):16469-16473.
Lawless, Jerald, and Kishore Singhal. 1978. Efficient screening of non-normal regression
30
models. Biometrics, 34(2): 318-327.
Mir, Tariq. 2014. The Benford law behavior of the religious activity data. Physica A: Statistical
Mechanics and its Applications 408(1): 1-9.
Newcomb, Simon. 1881. Note on the frequency of use of the different digits in natural
numbers. American Journal of Mathematics 4(1): 39-40.
Nigrini, Mark. 1999. I've got your number: How a mathematical phenomenon can help CPAs
uncover fraud and other irregularities. Journal of Accountancy 187(5): 79-83.
Perols, Johan, Robert Bowen, Carsten Zimmermann, and Basamba Samba. 2017. Finding needles
in a haystack: Using data analytics to improve fraud prediction. The Accounting Review 92(2):
221-245.
Richardson, Scott, Richard Sloan, Mark Soliman, and A. Irem Tuna. 2005. Accrual reliability,
earnings persistence and stock prices. Journal of Accounting and Economics 39(3): 437-485.
Shumway, Tyler. 2001. Forecasting bankruptcy more accurately: A simple hazard model. Journal
of Business 74(1): 101-124.
Skinner, Douglas, and Suraj Srinivasan. 2012. Audit quality and auditor reputation: Evidence from
Japan. The Accounting Review 87(5): 1737-1765.
Varian, Hal. 1972. Benford's Law (Letters to the Editor). The American Statistician 26(3): 62-66.
Wang, Jialin. 2011. Benford’s law and the decreasing liability of accounting data. Economist’s
View blog post, October 12, 2011, http://economistsview.typepad.com/economistsview
/2011/10/benfords-law-and-the-decreasing-reliability-of-accounting-data.html.
Table 1: Distribution of AAERs by number of years covered
Duration (years)12345678910111213141516
Total AAER: 578Mean Duration: 2.31 yearsMedian Duration: 2.00 years
10341
This table summarizes the distribution of the duration of AAERs in our sample. Duration is defined as the number of consecutive years for which a firm’s financials are alleged to be misstated by the AAER.
2
1
1
Number of AAERs260
1
11
14172412712
31
Table 2: Adherence of Compustat numbers to Benford’s Law
Leading Digit 1 2 3 4 5 6 7 8 9 Total
Count (millions) 10.73 6.16 4.33 3.33 2.75 2.29 1.99 1.74 1.56 34.88
Percent of total 30.76% 17.66% 12.40% 9.55% 7.89% 6.57% 5.69% 5.00% 4.47%
Benford Prediction 30.10% 17.61% 12.49% 9.69% 7.92% 6.69% 5.80% 5.12% 4.58%
Deviation 0.66% 0.05% -0.09% -0.14% -0.03% -0.12% -0.10% -0.12% -0.10%
Abs(Deviation) 0.66% 0.05% 0.09% 0.14% 0.03% 0.12% 0.10% 0.12% 0.10%
Mean Abs Deviation 0.1580%
B_Raw score (=100*Mean Abs Dev.) 0.1580
This table illustrates how closely the financial statement numbers reported in the Compustat database follow the leading digit distribution predicted by Benford’s Law. Numbers are drawn from balance sheets, income statements, and cash flow statements for U.S. firms in the Compustat database from 1979 – 2011. The first row lists the leading digits 1 through 9. The second (third) row reports the numbers (percentages) of numbers with the corresponding leading digit. The fourth row reports the proportion with which each digit is expected to appear under Benford’s Law. The fifth row reports the deviation from Benford’s Law (row three minus row four), and the sixth row presents the absolute value of that difference. Absolute values are averaged across all nine digits in the seventh row to compute the mean absolute deviation (comparable to Amiram et al.’s (2015) FSD Score). Finally, the mean absolute deviation is scaled up by a factor of 100 to create the B_Raw score.
32
Table 3: Predicting AAERs using Benford score variables
All Variables Selected VariablesVariable (1) (2)
B_Raw 1.7695*** 1.4727***(0.000) (0.000)
B_Input 0.7004*** 0.6286***(0.000) (0.000)
B_Year -3.6381*** -3.2933***(0.000) (0.000)
B_Industry -0.2024*(0.055)
B_Firm -0.00971(0.676)
Intercept -11.7223*** -10.7153***(0.000) (0.000)
#Obs 239,714 296,645#AAER firm-years 1,196 1,336#non-AAER firm-years 238,518 295,309
This table estimates the relationship between the SEC’s issuance of AAERs and the Raw Benford Score (B_Raw ) and four adjustments applied to B_Raw , using logistic regressions. The dependent variable is an indicator that equals 1 for an AAER, 0 otherwise. The independent variables are B_Raw and four standardized derivations from B_Raw . B_Input accounts for the heterogeneity in number of inputs, B_Year accounts for year differences, B_Industry accounts for industry differences, and B_Firm accounts for firm baseline differences in B_Raw . In specification (1) all variables are included. In specification (2), variables are selected via backward elimination using the computational algorithm of Lawless and Singhal (1978). P-values are in parentheses below coefficient estimates. ***, **, * denote statistical significance at the 1, 5, and 10 percent levels, respectively.
Dependent variable = 1 for AAER, 0 for non-AAER firm-year
33
Table 4: Descriptive statistics
Panel A: All Compustat firm-years
Variable # Obs Mean Median Std Dev # Obs Mean Median Std Dev # Obs Mean Median Std Dev
Inputs 296,645 117.588 117.000 31.900 177,452 116.305 117.000 24.807 -- -- -- --
B_Raw 296,645 3.444 3.200 1.337 177,452 3.293 3.102 1.195 -- -- -- --
B_Input 296,645 -0.003 -0.081 0.979 177,452 0.014 -0.056 0.985 -- -- -- --
B_Year 296,645 -0.014 -0.181 0.918 177,452 -0.119 -0.253 0.837 -- -- -- --
B_Industry 296,604 -0.011 -0.148 0.934 177,418 -0.010 -0.139 0.937 -- -- -- --
B_Firm 239,742 -0.048 -0.150 1.800 138,553 -0.018 -0.125 1.852 -- -- -- --
RSST_Accruals 228,026 0.025 0.026 0.366 156,167 0.033 0.028 0.334 151,862 0.032 0.026 --
Chg_Rcv 263,885 0.019 0.007 0.091 159,302 0.018 0.008 0.091 151,928 0.017 0.008 --
Chg_Invt 265,837 0.008 0.000 0.061 160,206 0.011 0.000 0.068 152,741 0.011 0.000 --
Chg_CashSales 248,722 0.126 0.061 1.440 150,593 0.172 0.070 1.164 135,333 0.208 0.079 --
Pct_SoftA 266,719 0.542 0.570 0.283 161,578 0.505 0.530 0.258 167,982 0.509 0.535 --
Chg_ROA 242,945 -0.007 -0.001 0.267 142,082 -0.011 -0.002 0.244 140,380 -0.010 -0.002 --
Issue 278,660 0.824 1.000 0.381 174,828 0.826 1.000 0.380 166,712 0.826 1.000 --
Abn_Chg_Emp 221,149 -0.092 -0.046 0.545 136,946 -0.095 -0.050 0.545 134,837 -0.093 -0.049 --
OL 296,645 0.651 1.000 0.477 177,452 0.699 1.000 0.459 168,481 0.710 1.000 --
MASR 191,680 0.065 -0.070 0.842 118,385 0.051 -0.106 0.904 110,303 0.008 -0.114 --
Lag_MASR 190,664 0.107 -0.061 1.044 117,684 0.093 -0.098 1.111 99,197 0.030 -0.099 --
This table displays the number of observations, mean, median, and standard deviation for the number of inputs into the Benford variables and the variables used in the DGLS models for the full sample period (1979-2011) and the sample period used in DGLS (1979-2002). B_Raw is the raw Benford Score; B_Input is the raw Benford Score adjusted for the number of inputs used to compute B_Raw ; B_Year is the time-series adjusted Benford Score; B_Industry and B_Firm are the industry- and firm-adjusted Benford scores, respectively; RSST_Accruals is change in noncash net operating assets; Chg_Rcv is change in receivables; Chg_Invt is change in inventory; Pct_SoftA is percent soft assets; Chg_CashSales is change in cash sales; Chg_ROA is change in return on assets; Issue is an indicator equal to 1 if the firm issued debt or equity during that year, 0 otherwise; Abn_Chg_Emp is abnormal change in employees; OL is an indicator equal to 1 if the company has operating leases, 0 otherwise; MASR is market-adjusted stock returns; and Lag_MASR is one-year-lagged market-adjusted stock returns. The last three columns reports mean and median values and the number of observations reported in DGLS Table 6 for comparison. Panel A summarizes the full sample of U.S. firm-years in Compustat between 1979 and 2011; Panel B presents the subset containing only those firm-years for which an AAER alleges overstatement of income.
Full Sample (1979-2011) DGLS Sample (1979-2002) DGLS Table 6
34
Panel B: AAER firm-years
Variable # Obs Mean Median Std Dev # Obs Mean Median Std Dev # Obs Mean Median Std DevInputs 1,336 133.160 131.000 30.406 624 124.607 122.000 25.784 -- -- -- -- B_Raw 1,336 3.015 2.871 1.016 624 3.054 2.914 1.026 -- -- -- -- B_Input 1,336 -0.050 -0.133 0.948 624 -0.042 -0.125 0.955 -- -- -- -- B_Year 1,336 -0.319 -0.421 0.699 624 -0.298 -0.409 0.716 -- -- -- -- B_Industry 1,336 -0.255 -0.357 0.785 624 -0.172 -0.302 0.854 -- -- -- -- B_Firm 1,196 -0.176 -0.269 1.795 523 -0.110 -0.159 1.958 -- -- -- -- RSST_Accr 1,102 0.111 0.062 0.317 556 0.115 0.061 0.359 557 0.126 0.074 -- Chg_Rcv 1,261 0.048 0.025 0.102 581 0.059 0.031 0.116 561 0.061 0.036 -- Chg_Invt 1,239 0.025 0.001 0.072 572 0.040 0.007 0.089 557 0.039 0.008 -- Chg_CashSales 1,214 0.364 0.155 1.403 546 0.467 0.182 1.443 501 0.492 0.217 -- Pct_SoftA 1,236 0.646 0.694 0.228 577 0.644 0.678 0.213 604 0.642 0.682 -- Chg_ROA 1,200 -0.011 -0.004 0.198 528 -0.030 -0.013 0.240 506 -0.024 -0.012 -- Issue 1,297 0.948 1.000 0.223 618 0.930 1.000 0.255 599 0.932 1.000 -- Abn_Chg_Emp 1,142 -0.173 -0.067 0.749 506 -0.221 -0.090 0.862 489 -0.223 -0.103 -- OL 1,336 0.828 1.000 0.378 624 0.838 1.000 0.369 604 0.821 1.000 -- MASR 1,181 0.172 -0.025 0.951 535 0.188 -0.088 1.107 463 0.193 -0.113 -- Lag_MASR 1,168 0.233 0.017 1.059 526 0.261 0.006 1.178 393 0.332 0.031 --
Full Sample (1979-2011) DGLS Sample (1979-2002) DGLS Table 6
35
Table 5: Estimated model coefficients
Panel A: F-Score Models Replication
Reported 1979-2002 1979-2011 Reported 1979-2002 1979-2011 Reported 1979-2002 1979-2011
RSST_Accr 0.79 0.66 0.62 0.67 0.62 0.62 0.91 0.49 0.60
Chg_Rcv 2.52 2.21 1.98 2.46 2.27 1.91 1.73 2.59 2.01
Chg_Invt 1.19 1.84 0.89 1.39 1.84 0.81 1.45 1.64 0.50
Pct_SoftA 1.98 2.23 1.89 2.01 2.09 1.76 2.27 2.17 1.87
Chg_CashSales 0.17 0.13 0.09 0.16 0.11 0.09 0.16 0.12 0.09
Chg_ROA -0.93 -0.94 -0.44 -1.03 -1.05 -0.54 -1.46 -1.20 -0.73
Issue 1.03 0.98 1.39 0.98 0.98 1.35 0.65 0.76 1.12
Abn_Chg_Emp -0.15 -0.14 -0.09 -0.12 -0.15 -0.11
OL 0.42 0.61 0.73 0.35 0.46 0.58
MASR 0.08 0.00 0.00
Lag_MASR 0.10 0.00 0.00
Intercept -7.89 -7.96 -7.72 -8.25 -8.39 -8.23 -7.97 -8.01 -7.80
Panel A of this table reports coefficients from the three F-score models (M1, M2, and M3). The column labeled Reported presents the coefficients from the corresponding F-score models, as reported in Table 7 of DGLS. For each model, the table also presents coefficients estimated over the DGLS sample period (1979-2002) and over the full sample period (1979-2011). Panel B presents estimated coefficients from the AB-score model and the ABF-score model, which includes all variables from AB-score and F-Score M1 models. No measure of statistical significance is reported since individual variables’ predictive abilities are not of interest in this study.
F-score M1 F-score M2 F-score M3
36
Panel B: AB-score and ABF-score Models
1979-2002 1979-2011 1979-2002 1979-2011
RSST_Accr 0.70 0.74
Chg_Rcv 2.35 2.27
Chg_Invt 1.83 0.72
Pct_SoftA 2.24 1.81
Chg_CashSales 0.13 0.11
Chg_ROA -1.07 -0.58
Issue 0.76 1.08
B_Raw 1.03 1.47 1.23 1.71
B_Input 0.40 0.63 0.49 0.66
B_Year -2.19 -3.29 -2.60 -3.70
Intercept -9.38 -10.72 -12.31 -13.75
AB-score ABF-score
37
Table 6: Comparison of model scope
ModelAAER
firm-yearsnon-AAER firm-years
AAER firm-years
non-AAER firm-years
AAER firm-years
non-AAER firm-years
Total firm-years
F-score M1 494 132,967 492 132,139 977 186,435 187,412
F-score M2 449 122,366 450 117,714 899 164,886 165,785
F-score M3 353 88,032 419 93,055 843 126,232 127,075
AB-score -- -- 697 212,902 1,336 295,309 296,645
ABF-score -- -- 492 132,139 977 186,435 187,412
This table reports how many AAER and non-AAER firm-years each of the models can be estimated over given data availability from Compustat and CRSP databases. The columns labeled DGLS 1979-2002 report the number of observations included in the F-score analyses in Table 7 of DGLS. The columns labeled Replication 1979-2002 report the number of observations included when we re-estimate the F-score models over the DGLS sample period, following the DGLS sample selection procedure and variable definitions as closely as possible. The columns labeled 1979-2011 report the number of observations included when we estimate each model over the full sample period.
DGLS 1979-2002 Replication 1979-2002 1979-2011
38
Table 7: In-sample comparisons of model accuracy
Panel A: All models, full sample
ModelSample
firm-years
Correct AAER
firm-years
Correct non-AAERfirm-years
Type I Error
Type II Error Unclassified
AB-score # Observations 296,645 973 149,269 146,040 363 0
% Sample 72.8% 50.6% 49.5% 27.2% 0%
F-score # Observations 187,412 678 111,128 75,307 299 109,233
% Sample 69.4% 59.6% 40.4% 30.6% 36.8%
ABF-score # Observations 187,412 683 113,751 72,684 294 109,233
% Sample 69.9% 61.0% 39.0% 30.1% 36.8%
Panel B: AB-score model, observations not estimated by F-score and ABF-score
SampleSample
firm-years
Correct AAER
firm-years
Correct non-AAERfirm-years
Type I Error
Type II Error
All non-overlapping # Observations 109,233 264 55,550 53,324 95
% Sample 73.5% 51.0% 49.0% 26.5%
Financial non-overlapping # Observations 51,373 123 28,354 22,844 52 47.03%
% Sample 70.3% 55.4% 44.6% 29.7%
Non-financial non-overlapping # Observations 57,860 135 28,220 29,456 49 52.97%
% Sample 73.4% 48.9% 51.1% 26.6%
This table summarizes model accuracy for the AB-score, F-score (M1), and ABF-score models. Each observation’s odds of being a misstated firm-year are determined from model coefficients, and an observation is predicted to be misstated if its odds ratio exceeds the 1.0 threshold. Panel A reports results for the entire sample over which each model can be estimated. Panel B reports results for the firm-year observations which the AB-score model can estimate but the F-score and ABF-score models cannot because the firms are financial firms or required firm/year data are missing in Compustat. For each model and sample, observations are tallied in four bins. The total number of firm-year observations that can be estimated by each model is reported in the column labeled Sample firm-years . Observations correctly identified as misstated firm-years are tallied under the Correct AAER firm-years column; those correctly identified as not misstated, under the Correct non-AAER firm-years column; those mistakenly identified as misstated (false positives), under Type I Error ; those mistakenly identified as not misstated (false negatives), under Type II Error ; and those Compustat observations which cannot be estimated by the model, under Unclassified . % Sample divides the number of observations within a bin by the number of observations in that category over which each model is estimated (or by number of firm-year observations in Compustat, in the case of Unclassified ).
39
Table 8: Out-of-sample comparisons of model accuracy
Panel A: Random out-of-sample sample simulations, Threshold = 1.0
ModelSample
firm-years
Correct AAER firm-
years
Correct non-AAER firm-years
Type I Error
Type II Error Unclassified
AB-score # Observations 148,322 485 74,703 72,949 184 0% Sample 72.5% 50.6% 49.4% 27.5% 0.0%
F-Score # Observations 93,701 359 50,903 42,310 129 54,621% Sample 73.5% 54.6% 45.4% 26.5% 36.8%
ABF-score # Observations 93,701 359 52,771 40,442 129 54,621% Sample 73.6% 56.6% 43.4% 26.4% 36.8%
Panel B: Random out-of-sample sample simulations, Threshold = 0.7
AB-score # Observations 148,322 590 42,594 105,058 79 0% Sample 88.2% 28.9% 71.2% 11.8% 0.0%
F-Score # Observations 93,701 425 35,442 57,770 63 54,621% Sample 87.2% 38.0% 62.0% 12.8% 36.8%
ABF-score # Observations 93,701 415 38,644 54,568 73 54,621% Sample 85.0% 41.5% 58.5% 15.1% 36.8%
Panel C: Random out-of-sample sample simulations, Threshold = 1.3
AB-score # Observations 148,322 316 107,787 39,866 353 0% Sample 47.3% 73.0% 27.0% 52.7% 0.0%
F-Score # Observations 93,701 285 64,192 29,021 203 54,621% Sample 58.5% 68.9% 31.1% 41.5% 36.8%
ABF-score # Observations 93,701 306 64,034 29,179 182 54,621% Sample 62.7% 68.7% 31.3% 37.3% 36.8%
This table summarizes the results of out-of-sample tests of the AB-score, F-score, and ABF-score models. In each panel, half of the observations are randomly selected into the estimation period and the other half constitute the prediction period; the procedure is repeated 100 times, and mean counts and percentages are reported in the panel. Panel A uses an odds ratio threshold of 1.0; panels B and C use thresholds of 0.7 and 1.3. For each model and sample, observations are tallied in four bins. The total number of firm-year observations that can be estimated by each model is reported in the column labeled Sample firm-years . Observations correctly identified as misstated firm-years are tallied under the Correct AAER firm-years column; those correctly identified as not misstated, under the Correct non-AAER firm-years column; those mistakenly identified as misstated (false positives), under Type I Error ; those mistakenly identified as not misstated (false negatives), under Type II Error ; and those Compustat observations which cannot be estimated by the model, under Unclassified . % Sample divides the number of observations within a bin by the number of observations in that category over which each model is estimated (or by number of firm-year observations in Compustat, in the case of Unclassified).
40
Table 9: Model prediction overlap
Model AB-score F-score ABF-score
AAER firm-years AB-score - 84.4% 91.8%
F-score 72.4% - 90.0%
ABF-score 79.4% 90.7% -
non-AAER firm-years AB-score - 43.2% 52.5%
F-score 69.5% - 87.3%
ABF-score 86.5% 89.4% -
This table reports the percentage of observations correctly classified as AAER firm-years or non-AAER firm-years by the model (AB-score, F-score, or ABF-score) in each column that are also correctly identified by the model in each row. Only firm-year observations that can be estimated by all three models are included in this analysis. The odds ratio threshold is set at 1.0.
Percent of
41
Table 10: Ability of models to break each others' ties
Panel A: Breaking F-score tiesTie-breaker: ABF-score Tie-breaker: AB-score
Matching method % AAER ABF-score > non-AAER ABF-score % AAER AB-score > non-AAER AB-score
One-to-one 61.7% 61.4%
One-to-many, mean 65.6% 64.4%
One-to-many, median 66.7% 65.1%
Panel B: Breaking AB-score tiesTie-breaker: ABF-score Tie-breaker: F-score
Matching method % AAER ABF-score > non-AAER ABF-score % AAER F-score > non-AAER F-score
One-to-one 69.8% 68.0%
One-to-many, mean 66.6% 66.9%
One-to-many, median 73.9% 73.8%
Panel C: Breaking ABF-score tiesTie-breaker: AB-score Tie-breaker: F-score
Matching method % AAER AB-score > non-AAER AB-score % AAER F-score > non-AAER F-scoreOne-to-one 49.9% 48.5%One-to-many, mean 47.6% 42.1%One-to-many, median 48.1% 50.3%
This table summarizes each model’s ability to discern AAER and non-AAER firm-year observations that the other model could not distinguish. Each AAER firm-year is matched to one or many non-AAER firm years with similar predicted misstatement likelihoods. Panel A uses F-scores (from F-score M1 model) as the misstatement likelihood and compares ABF-scores and AB-scores; Panel B uses AB-scores as the misstatement likelihood and compares ABF-scores and F-scores; Panel C uses ABF-scores as the misstatement likelihood and compares AB-scores and F-Scores. To be considered a match, non-AAER firm-years must have odds ratios within 0.0005 of the AAER firm-year in question. When one AAER firm-year is matched to all of the non-AAER firm-years with odds ratios within 0.0005, the AAER firm-year odds ratio is compared to either the mean (one-to-many, mean ) or the median (one-to-many, median ) of all of the matched non-AAER firm-years.
42
Table 11: High-profile cases of financial misconduct resulting in SEC AAERs
Panel A: AB-score, F-score, and ABF-score by AAER firm-yearAAER ID
(year) Company name Description Year AB-score F-score ABF-score
1986 1.42 n/a n/a1987 0.96 1.88 1.461988 0.90 1.69 1.241989 0.90 1.26 0.811990 1.05 2.20 1.861991 1.02 2.26 1.871992 0.80 2.46 1.571993 1.00 1.65 1.331994 0.83 1.60 1.071995 1.19 1.88 1.901996 1.27 1.99 2.181997 1.51 n/a n/a1998 1.80 n/a n/a
2000 0.89 1.02 0.732001 1.05 1.00 0.84
1998 1.94 1.32 2.171999 1.56 1.33 1.832000 2.70 2.47 5.88
1998 1.60 0.41 0.571999 1.91 0.34 0.552000 1.68 0.59 0.90
1998 2.09 1.58 2.741999 2.44 2.07 4.462000 2.15 1.69 3.042001 1.74 n/a n/a
1272(2000)
1678 (2004)
1821(2003)
"[Two executives] granted themselves hundreds of millions of dollars in secret low interest and interest-free loans from the company that they used for personal expenses. They later caused Tyco to forgive tens of millions of dollars they owed"
Tyco International Ltd.1852(2003)
"For the last three fiscal years of the scheme, pre-tax income was artificially overstated by nearly one third, an aggregate misstatement of approximately one-half billion dollars"
Cendant Corporation (formerly CUC International)
"WorldCom materially overstated the income it reported on its financial statements by approximately $9 billion"
WorldCom, Inc.
"The fraudulent transactions included the "Raptor" sham hedges used by Enron to avoide earnings write-downs of over $1 billion, the fraudulent "sale" of an interest in Nigerian barges to Merrill Lynch, and "prepay" transactions, which were loans disguised as commodity sales contracts, used by Enron to overstate its cash flows by hundreds of millions of dollars."
Enron Corp.
Enron Oil and Gas Co.
Panel A of this table summarizes ten of the highest profile financial misconduct cases in the 1979-2011 period. AAER identifiers from the SEC are listed in the first column followed by the company name and a quote from the AAER or a related SEC legal release illustrating the magnitude of the transgression. To the right are years affected by the misstatement and odds ratios predicted by the AB-score, F-score (M1), and ABF-score models. Firm-years for which the AB-score, F-score, or ABF-score model cannot be estimated are marked n/a. Panel B summarizes how many firm-years each model is able to estimate and summarizes the success and failure rates of each model in this subset of high-profile fraud cases. Correctly identified % is calculated out of total number of fraud firm-years (57); Type II error is calculated out of number of firm-years estimated by each model.
43
1999 1.99 1.06 1.782000 2.01 1.04 1.752001 1.78 1.09 1.602002 2.57 0.61 1.19
1999 1.70 n/a n/a2000 1.61 n/a n/a
1992 0.97 0.84 0.641993 2.00 0.82 1.411994 1.87 0.84 1.321995 1.99 0.82 1.391996 1.97 0.74 1.211997 1.88 0.71 1.11
1992 1.11 0.93 0.881993 1.59 0.73 1.011994 1.57 1.10 1.551995 1.46 1.13 1.541996 1.52 1.19 1.671997 1.58 1.04 1.46
1998 1.08 n/a n/a1999 1.01 n/a n/a2000 0.83 n/a n/a2001 1.06 n/a n/a2002 0.99 n/a n/a2003 1.05 n/a n/a2004 1.50 n/a n/a
2000 2.39 2.08 4.612001 1.83 1.04 1.572002 2.01 0.36 0.571999 1.25 n/a n/a
2000 0.84 n/a n/a2001 1.06 n/a n/a2002 1.21 n/a n/a
"HRC systematically overstated its earnings by at least 1.4 billion"HealthSouth Corp.2082(2004)
"Understated its subsidiary debt by $1.6 billion, overstated equity by at least $368 million"
Adelphia Communications Corp.
"The company misreported its net income in [2000, 2001 and 2002] by 30.5%, 23.9% and 42.9% respectively"
Federal Home Loan Mortgage Corporation
2728(2007)
"anticipated restatement of at least an $11 billion reduction of previously reported net income"
Federal National Mortgage Association
2433(2006)
"recognized approximately $3.8 billion of spurious revenue and fraudulently excluded $231 million in expenses"
Qwest Communications International
2613(2007)
2337(2005)
"used netting to eliminate approximately $490 million in current period operating expenses"
Waste Management, Inc.
Waste Management, Inc. Del
2313(2005)
44
Panel B: Summary of AB-score, F-score, and ABF-score model performance
Model AB-score F-score ABF-scoreABF-score with AB-score fill-in
Firm-years covered 57 41 41 57
# score > 1.0 46 26 31 45
# score < 1.0 11 15 10 12
Mean score 1.50 1.27 1.68 1.56
Median score 1.52 1.10 1.46 1.41
Correctly identified % 80.7% 45.6% 54.4% 78.9%
Type II error % 19.3% 36.6% 24.4% 21.1%
45
Figure 1: Leading digit frequency
This figure depicts the observed distribution of leading digits in the annual reports of all firms in the Compustat database 1979-2011 compared to the distribution predicted by Benford’s Law.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9
Observed Distribution
Benford's distribution
46
Figure 2: Raw Benford score by number of inputs, industry, year, and firm
Panel A: B_Raw by number of financial statement items
Panel B: B_Raw by industry
This figure depicts the mean raw Benford score (B_Raw ), which is the mean absolute deviation of the leading-digit distribution in financial statement numbers from the distribution predicted by Benford’s Law, across four dimensions. Panel A shows the relationship between B_Raw and the number of inputs used to compute it. Panel B shows the the mean B_Raw across each 2-digit SIC code. Panel C shows the relationship between B_Raw and the year in which the observation is drawn. Panel D shows mean B_Raw by firm.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 9 14 20 24 28 32 36 40 45 49 53 57 61 65 73 79 83 89
Med
ian
B_R
aw
2-Digit SIC Code
0
5
10
15
20
25
1 21 41 61 81 101 121 141 161 181 201 221 241
Med
ian
B_R
aw
Number of Financial Statement Items
47
Panel C: B_Raw by year
Panel D: B_Raw by firm
2.95
3
3.05
3.1
3.15
3.2
3.25
3.3
3.35
3.4
1979 1983 1987 1991 1995 1999 2003 2007 2011
Med
ian
B_R
aw
Year
48
Figure 3: Receiver operator characteristic (ROC) curves
Panel A: ABF-score versus F-score
The three charts present ROC curves for the AB-score, F-score, and ABF-score models. Each graph plots the True Positive Rate on the y-axis versus the False Positive Rate on the x-axis, for all possible thresholds. Panel A compares the ABF-score model to the F-score model for the firm-year observations for which both scores can be calculated; Panel B compares the ABF-score model to the AB-score model for the firm-year observations for which both scores can be calculated; and Panel C presents the ROC curve for the AB-score model for the firm-year observations for which the AB-score can be calculated but the F-score and ABF-score cannot.
True Positive
False Positive Rate
49
Panel B: ABF-score versus AB-score
True Positive
False Positive Rate
50
Panel C: AB-score for observations where F-score and ABF-score cannot be estimated
True Positive
False Positive Rate
51
Figure 4: Distribution of scores for high-profile financial misconduct firm-years
This figure plots the frequency distribution of the firm-year AB-scores, F-Scores, and ABF-scores for the ten notorious misstatement cases, a total of 57 firm-year observations. The grouping labeled n/a reflects firm-year observations for which F-scores and ABF-scores could not be calculated.
0
5
10
15
20
25
30
35
< 0.7 0.7 to 1.0 1.0 to 1.3 >1.3 n/a
# F
irm
-yea
rs
Score Range
Frequency Distribution of Firm-year Scores
AB-score F-score ABF-score
52