how to spot bad data benford’s law

Post on 21-Apr-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

How to Spot Bad Data

Peter O’Reilly

theory

&

application

Simon Newcomb

Frank Benford

“mere curious observation”

Chuck Zlotnick/Warner Brothers Pictures

𝑃 𝑑 = log10(1 +1

𝑑)

First non-zero digit, d Probability according to Benford’s Law, P(d)

1 0.3010

2 0.1761

3 0.1249

4 0.0969

5 0.0792

6 0.0669

7 0.0580

8 0.0512

9 0.0458

Total Sum 1.0000

First Significant Digit Law

a.k.a. Benford's Law

SIGNIFICANT DIGIT

• All non-zero digits are significant:

1, 2, 3, 4, 5, 6, 7, 8, 9

• Zero digits between non-zero digits

are significant:

305, 6002, 70008

• Leading zeros are never significant:

0.01, 0.000424

• Number with a decimal point, trailing

zeros are significant:

1.01000, 2.200, 36.5400

Red digits are significant

4210

505

2190.30

0.09

0.23

Data - best application for

• Random sampling

• Large sample size

• Sufficient variability

• No bounded maximum value

• Counting or measuring based

numbers

No–Go for

• Sequentially assigned numbers: e.g.

check numbers, invoice numbers,

purchase order numbers

• Where numbers are influenced by

human thought: e.g. psychological

price setting thresholds ($9.99)

• Accounts with a large number of

firm-specific numbers: e.g.

accounts set up to record $10

refunds

• Accounts with a minimum or maximum

=LEFT(text,[num_chars])

LEFT returns the first

character or characters in a

text string, based on the

number of characters you

specify.

=COUNTIF(range, criteria)

COUNTIF function counts the

number of cells within a

range that meet a single

criterion that you specify.

First non-zero digit, d Probability according to Benford’s Law, P(d)

1 0.3010

2 0.1761

3 0.1249

4 0.0969

5 0.0792

6 0.0669

7 0.0580

8 0.0512

9 0.0458

Total Sum 1.0000

Digit Count Actual Frequency Expected Frequency

1 1,402 29.40% 30.10%

2 909 19.06% 17.61%

3 587 12.31% 12.49%

4 459 9.63% 9.69%

5 382 8.01% 7.92%

6 285 5.98% 6.69%

7 281 5.89% 5.80%

8 258 5.41% 5.12%

9 205 4.30% 4.58%

Totals 4,768 100% 100%

values generated using Excel’s RAND() function

SQL Example (database query)

SELECT

LEFT(deposit_amount,1)

AS Digit,

COUNT(LEFT(deposit_amount,1))

AS Digit_Count

FROM

revenue_tax_collection

GROUP BY

LEFT(deposit_amount,1)

ORDER BY 1;

Recap

1. =LEFT()

2. =COUNTIF()

3. Plot bar chart

Further considerations

• 2nd significant digit

• Chi-Square Test 2

• Not absolute proof

Peter O’Reilly, MBA, CMFO, CTC, QPARed Bank CFO, former Jersey City

Treasurer, Pension Actuary, Finance I.T.

Definitive Guide to Local Public Finance in

New Jersey, 2019 publication, available at:

njcmfo.com

peter@njcmfo.com

References of copyright and public domain image to

comply with the respective terms of public use:

{source, image description (slide deck page)}

• pixabay.com

• “FAKE” (2), Fraud (2), file cabinets (4), thumbs up

(17, 25), thumbs down (18, 26), curved arrow(28),left

arrow (19), finger counting (21)

• wikipedia.com

• Islamic Republic or Iran flags and presidential

candidates (8), Simon Newcomb (6)

• s9.com

• Frank Benford (7)

• Microsoft.com (https://www.microsoft.com/en-us/legal/intellectualproperty/permissions/default.aspx)

• Microsoft Excel logo (5, 19, 21)

• Chuck Zlotnick/Warner Brothers Pictures, https://www.thewrap.com/accountant-

adds-up-real-review-ben-affleck/

• The Accountant movie screen shot (11)

top related