Efficient Samples for Control and Audit

Spreadsheet Formulae

For error rates in streams of transactions, there is a very simple type of mathematical curve that can represent our views about the error rate, and it is exceptionally simple to calculate how it changes as sample items are tested. The curve is called a Beta distribution and is conveniently built into Microsoft Excel as BETADIST( ). It is also in other commonly used software.

BETADIST( ) gives you the probability that the error rate is below a given level for given values of two parameters that are set by your sample results. For example, BETADIST(0.05, 3+1, 97+1) gives you the probability that the actual error rate is less than 5 percent once you have tested 100 and found three errors. It's just under 75 percent.

To create graphs such as the ones above, you need to use BETADIST lots of times to convert the cumulative probabilities given by this function into approximate probability densities. For example, to find the data point at an error rate of 5 percent, you could use the following formula.

(BETADIST(0.05, 4, 98) - BETADIST(0.049, 4, 98) ) / 0.001

The formula given above assume you started out thinking all error rates were equally likely. If this was not your initial view, then the technique is to act almost as if you have already tested a sample, starting from the "all equally likely" assumption.

Cutting Sample Sizes

Using this Bayesian approach there are two ways to cut sample sizes. The least important method is to take advantage of information about error rates acquired before the sample test takes place. Perhaps you have information from sample tests in previous periods or from some other type of test such as an overall analytical comparison. Or perhaps it is simply that you would be out of business if the error rate was anything other than small.

The big opportunity to cut sample sizes comes from avoiding wasted sample items. The logic of the usual approach is to guess an error rate, then work out a sample size that will deliver the target confidence. What if the actual error rate is higher? In this case, you normally end up doing additional sample items. What if the actual error rate is lower? In this case, you normally end up doing the number of items originally planned, even though that is more than you really need.

As you can see, whenever the actual error rate is lower than the rate you assumed to calculate the sample size, you will end up doing unnecessary sample items. Using the Bayes's Rule approach and continuously updating the analysis as each result comes in, it is possible to stop work the moment the required confidence is reached. This is because the distribution represents all your beliefs about the true error rate, even taking into account what you believe you might find from additional testing.

On average, the number of sample items you test will be lower. Simple simulations can establish exactly how much lower in any given situation, and they can be used to test rules of thumb for extending samples in batches if that is more efficient.

Conclusion

The logic and calculations required for the Bayes's Rule approach are surprisingly simple and intuitive. I find the way the graphs change as data comes in mesmerizing, almost beautiful. It's also a practical approach. In a demonstration for a major telecommunications company, I showed that they could expect to cut average sample sizes by over 25 percent just by avoiding wasted sample items, despite some awkward industry regulations.

Efficient Samples for Control and Audit

Spreadsheet Formulae

Cutting Sample Sizes

Conclusion

Further Reading