Universality of Benford’s law

Universality of second order test.
benford
statistics
forensics
python
Published

April 2, 2024

By Nigrini’s definition, second order test look at relationships and patterns in data and is based on the digits of the differences between amount that have been sorted from smallest to largest. The digit patterns of the differences are expected to closely approximate the digit frequencies of Benford’s law. The second order test gives few (if any) false positives in that if the results are not as expected (close to Benford), the the data do indeed have some characteristic that is rare and unusual, abnormal or irregular.

As described in previous article, mixture of approximate geometric sequences will produce a Benford Set.

What is Benford Set?

A set of numbers that conforms closely to Benford’s Law is called a Benford Set.

Geometric sequence can be described as follows:

\[ S_n = ar^{n-1} \]

Meaning of symbols is as follows:

The second order test is based on differences between the successive elements of a geometric sequence \(D_n\):

\[ D_n = ar^{n} - ar^{n - 1} = a(r - 1)r^{n - 1} \]

Since the elements of this new sequence form a geometric series, the distribution of these digits will also conform to Benford’s Law and the \(N - 1\) differences will form a Benford Set.

Nigrini makes the following statement:

If the data is made up of nondiscrete random variables drawn from any continuos distribution with a smooth density function (Uniform, Triangular, Normnal or Gamma distributions), then the digit patterns of the \(N - 1\) differences between the ordered elements will be Almost Benford (meaning that digit pattern will conform closely, but no exactly to Benford’s Law).

This also, funny enough, applies to when data is drawn from most of the continuous distributions encountered in practice.

Let’s check this out.

1 Normal distribution

Let’s import our libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

sns.set_style("darkgrid")

from src.BenfordExpectedProbability import BenfordExpectedProbability
from src.BenfordAnalysis import BenfordAnalysis

np.random.seed(13)

BenfordExpectedProbability and BenfordAnalysis are local classes that I have used during writing last article.

Let’s draw from normal distribution and plot this:

x = np.random.normal(100_000, 10_000, 100_000)

fig, ax = plt.subplots()

sns.histplot(x, ax=ax)
ax.set(title="Histogram of random samples from normal distribution")

plt.tight_layout()
plt.show()

Calculating first order differences through pandas is very easy:

x_diff = (
    pd.Series(x)
    .sort_values()
    .diff()
    .dropna()
    .pipe(lambda a: a[a>=0.0001])
)

x_diff *= 100_000
x_diff
6916     2.981607e+06
76912    7.386981e+08
91236    6.330016e+07
92269    6.214962e+06
76806    1.827014e+07
             ...     
75749    1.262661e+07
5266     2.617066e+07
6455     1.278739e+08
37773    7.745682e+07
12258    5.277675e+08
Length: 99973, dtype: float64

Do note that we are multiplying with multiple of 100 so that we can get first two digits (Nigrini multiplies by 100, I chose greater number).

After this, we can see that these differences conform almost perfectly to Benford’s law:

We can see that these differences conform almost perfectly to Benford’s Law. The red columns (meaning that this particular subset is not conforming to Benford’s Law) can be disregarded, since the difference is very, very small.

2 Uniform distribution

Same methodology, uniform distribution:

x = np.random.uniform(10_000, 100_000, 100_000)

Picture says thousand words™:

3 Triangular distribution

Let’s run this test on triangular distribution:

x = np.random.triangular(10_000, 50_000, 100_000, 100_000)

And we get this plot:

4 Gamma distribution

Finally, let’s check gamma distribution:

x = np.random.gamma(10_000, 1_000, 100_000)

Will conformity fail?

No.

5 Second order test on real world data

We can use Nigrini’s invoices to see how will test behave with real world data.

ID SUPPLIER DATE INVOICE AMOUNT
0 1 2001 2010-01-01 4242J10 25.19
1 2 2001 2010-01-01 7899J10 25.86
2 3 2001 2010-01-01 3830J10 26.57
3 4 2001 2010-01-01 9514J10 27.83
4 5 2001 2010-01-01 6296J10 28.09
... ... ... ... ... ...
189465 189466 52935 2010-07-01 270221266736 33.46
189466 189467 52936 2010-07-01 270348386110 61.52
189467 189468 52937 2010-02-01 271253401514 12.36
189468 189469 52938 2010-02-01 261715090450 8.02
189469 189470 52939 2010-02-01 270241460335 16.30

189470 rows × 5 columns

After applying identical methodology to AMOUNT column, we can plot conformity with Benford’s Law.

We can see that the analyst should check all invoices where the first two digits of differences between sorted Invoice amounts are 10, 19, 20, 29, 30 … and so on (all red columns). The most critical case is 99.