import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
"darkgrid")
sns.set_style(
from src.BenfordExpectedProbability import BenfordExpectedProbability
from src.BenfordAnalysis import BenfordAnalysis
13) np.random.seed(
By Nigrini’s definition, second order test look at relationships and patterns in data and is based on the digits of the differences between amount that have been sorted from smallest to largest. The digit patterns of the differences are expected to closely approximate the digit frequencies of Benford’s law. The second order test gives few (if any) false positives in that if the results are not as expected (close to Benford), the the data do indeed have some characteristic that is rare and unusual, abnormal or irregular.
As described in previous article, mixture of approximate geometric sequences will produce a Benford Set.
A set of numbers that conforms closely to Benford’s Law is called a Benford Set.
Geometric sequence can be described as follows:
\[ S_n = ar^{n-1} \]
Meaning of symbols is as follows:
- \(S_n\) => member in geometric sequence.
- \(a\) => first term in geometric sequence.
- \(r\) => common ratio of the \((n + 1)^{st}\) element divided by the nth element.
The second order test is based on differences between the successive elements of a geometric sequence \(D_n\):
\[ D_n = ar^{n} - ar^{n - 1} = a(r - 1)r^{n - 1} \]
Since the elements of this new sequence form a geometric series, the distribution of these digits will also conform to Benford’s Law and the \(N - 1\) differences will form a Benford Set.
Nigrini makes the following statement:
If the data is made up of nondiscrete random variables drawn from any continuos distribution with a smooth density function (Uniform, Triangular, Normnal or Gamma distributions), then the digit patterns of the \(N - 1\) differences between the ordered elements will be Almost Benford (meaning that digit pattern will conform closely, but no exactly to Benford’s Law).
This also, funny enough, applies to when data is drawn from most of the continuous distributions encountered in practice.
Let’s check this out.
1 Normal distribution
Let’s import our libraries.
BenfordExpectedProbability
and BenfordAnalysis
are local classes that I have used during writing last article.
Let’s draw from normal distribution and plot this:
= np.random.normal(100_000, 10_000, 100_000)
x
= plt.subplots()
fig, ax
=ax)
sns.histplot(x, axset(title="Histogram of random samples from normal distribution")
ax.
plt.tight_layout() plt.show()
Calculating first order differences through pandas
is very easy:
= (
x_diff
pd.Series(x)
.sort_values()
.diff()
.dropna()lambda a: a[a>=0.0001])
.pipe(
)
*= 100_000
x_diff x_diff
6916 2.981607e+06
76912 7.386981e+08
91236 6.330016e+07
92269 6.214962e+06
76806 1.827014e+07
...
75749 1.262661e+07
5266 2.617066e+07
6455 1.278739e+08
37773 7.745682e+07
12258 5.277675e+08
Length: 99973, dtype: float64
Do note that we are multiplying with multiple of 100 so that we can get first two digits (Nigrini multiplies by 100, I chose greater number).
After this, we can see that these differences conform almost perfectly to Benford’s law:
We can see that these differences conform almost perfectly to Benford’s Law. The red columns (meaning that this particular subset is not conforming to Benford’s Law) can be disregarded, since the difference is very, very small.
2 Uniform distribution
Same methodology, uniform distribution:
= np.random.uniform(10_000, 100_000, 100_000) x
Picture says thousand words™:
3 Triangular distribution
Let’s run this test on triangular distribution:
= np.random.triangular(10_000, 50_000, 100_000, 100_000) x
And we get this plot:
4 Gamma distribution
Finally, let’s check gamma distribution:
= np.random.gamma(10_000, 1_000, 100_000) x
Will conformity fail?
No.
5 Second order test on real world data
We can use Nigrini’s invoices to see how will test behave with real world data.
ID | SUPPLIER | DATE | INVOICE | AMOUNT | |
---|---|---|---|---|---|
0 | 1 | 2001 | 2010-01-01 | 4242J10 | 25.19 |
1 | 2 | 2001 | 2010-01-01 | 7899J10 | 25.86 |
2 | 3 | 2001 | 2010-01-01 | 3830J10 | 26.57 |
3 | 4 | 2001 | 2010-01-01 | 9514J10 | 27.83 |
4 | 5 | 2001 | 2010-01-01 | 6296J10 | 28.09 |
... | ... | ... | ... | ... | ... |
189465 | 189466 | 52935 | 2010-07-01 | 270221266736 | 33.46 |
189466 | 189467 | 52936 | 2010-07-01 | 270348386110 | 61.52 |
189467 | 189468 | 52937 | 2010-02-01 | 271253401514 | 12.36 |
189468 | 189469 | 52938 | 2010-02-01 | 261715090450 | 8.02 |
189469 | 189470 | 52939 | 2010-02-01 | 270241460335 | 16.30 |
189470 rows × 5 columns
After applying identical methodology to AMOUNT
column, we can plot conformity with Benford’s Law.
We can see that the analyst should check all invoices where the first two digits of differences between sorted Invoice amounts are 10, 19, 20, 29, 30 … and so on (all red columns). The most critical case is 99.