how to spot bad data benford’s law
TRANSCRIPT
How to Spot Bad Data
Peter O’Reilly
theory
&
application
Simon Newcomb
Frank Benford
“mere curious observation”
Chuck Zlotnick/Warner Brothers Pictures
𝑃 𝑑 = log10(1 +1
𝑑)
First non-zero digit, d Probability according to Benford’s Law, P(d)
1 0.3010
2 0.1761
3 0.1249
4 0.0969
5 0.0792
6 0.0669
7 0.0580
8 0.0512
9 0.0458
Total Sum 1.0000
First Significant Digit Law
a.k.a. Benford's Law
SIGNIFICANT DIGIT
• All non-zero digits are significant:
1, 2, 3, 4, 5, 6, 7, 8, 9
• Zero digits between non-zero digits
are significant:
305, 6002, 70008
• Leading zeros are never significant:
0.01, 0.000424
• Number with a decimal point, trailing
zeros are significant:
1.01000, 2.200, 36.5400
Red digits are significant
4210
505
2190.30
0.09
0.23
Data - best application for
• Random sampling
• Large sample size
• Sufficient variability
• No bounded maximum value
• Counting or measuring based
numbers
No–Go for
• Sequentially assigned numbers: e.g.
check numbers, invoice numbers,
purchase order numbers
• Where numbers are influenced by
human thought: e.g. psychological
price setting thresholds ($9.99)
• Accounts with a large number of
firm-specific numbers: e.g.
accounts set up to record $10
refunds
• Accounts with a minimum or maximum
=LEFT(text,[num_chars])
LEFT returns the first
character or characters in a
text string, based on the
number of characters you
specify.
=COUNTIF(range, criteria)
COUNTIF function counts the
number of cells within a
range that meet a single
criterion that you specify.
First non-zero digit, d Probability according to Benford’s Law, P(d)
1 0.3010
2 0.1761
3 0.1249
4 0.0969
5 0.0792
6 0.0669
7 0.0580
8 0.0512
9 0.0458
Total Sum 1.0000
Digit Count Actual Frequency Expected Frequency
1 1,402 29.40% 30.10%
2 909 19.06% 17.61%
3 587 12.31% 12.49%
4 459 9.63% 9.69%
5 382 8.01% 7.92%
6 285 5.98% 6.69%
7 281 5.89% 5.80%
8 258 5.41% 5.12%
9 205 4.30% 4.58%
Totals 4,768 100% 100%
values generated using Excel’s RAND() function
SQL Example (database query)
SELECT
LEFT(deposit_amount,1)
AS Digit,
COUNT(LEFT(deposit_amount,1))
AS Digit_Count
FROM
revenue_tax_collection
GROUP BY
LEFT(deposit_amount,1)
ORDER BY 1;
Recap
1. =LEFT()
2. =COUNTIF()
3. Plot bar chart
Further considerations
• 2nd significant digit
• Chi-Square Test 2
• Not absolute proof
Peter O’Reilly, MBA, CMFO, CTC, QPARed Bank CFO, former Jersey City
Treasurer, Pension Actuary, Finance I.T.
Definitive Guide to Local Public Finance in
New Jersey, 2019 publication, available at:
njcmfo.com
References of copyright and public domain image to
comply with the respective terms of public use:
{source, image description (slide deck page)}
• pixabay.com
• “FAKE” (2), Fraud (2), file cabinets (4), thumbs up
(17, 25), thumbs down (18, 26), curved arrow(28),left
arrow (19), finger counting (21)
• wikipedia.com
• Islamic Republic or Iran flags and presidential
candidates (8), Simon Newcomb (6)
• s9.com
• Frank Benford (7)
• Microsoft.com (https://www.microsoft.com/en-us/legal/intellectualproperty/permissions/default.aspx)
• Microsoft Excel logo (5, 19, 21)
• Chuck Zlotnick/Warner Brothers Pictures, https://www.thewrap.com/accountant-
adds-up-real-review-ben-affleck/
• The Accountant movie screen shot (11)