Time To Put Numbers on Internal Controls

August 2005

There's a difference in culture and approach between two of the major groups involved in "risk management" in its widest sense. On one side are the insurance specialists, busy putting numbers on risk. Their high priesthood, the actuaries, like mathematics and thirst for empirical data. On the other side are the accountants and auditors, busy putting words on risk. Their high priesthood, the audit partners in big external audit firms, rely on judgment and seek "comfort."

by Matthew Leitch *

Why this difference when, in effect, external audit is a form of insurance too? The differences are that external audit firms very rarely receive "claims" (i.e., get taken to court) and the claim is decided by a court after extensive analysis of the auditor's actions. An insurance company receives many claims and just pays out if the terms of the policy are met.

Consequently, the auditor's strategy is based on leaving no evidence that a hostile lawyer could exploit. Although the big firms have occasionally flirted with quantitative methods for setting audit sample sizes, for example, they have shied away from them, preferring to obscure their decisions under the general heading of "professional judgment."

Risks are described as "high," assurance is often described as "reasonable" (or "high" if necessary), and control weaknesses may be deemed "significant," "serious," or even "material." None of these key terms has any quantitative basis.

The Problem with "Professional Judgment"

As far as I know, "professional judgment" as exercised by auditors and controls specialists from that background (including me) has never been systematically tested. If it were, what might be found?

Judgments of probability have been studied in a few other professions. By far the best data comes from studies of weather forecasters. These have shown that people who routinely make probability judgments and get feedback on their accuracy can become what is called "well calibrated." This means that, for example, if you take all the instances when the forecaster has said the probability of rain is 50 percent, then in fact there was rain on about 50 percent of those instances.

(You might think that makes their probabilities good ones. Not necessarily. There is rain over Cambridge on about 50 percent of days. If a forecaster says the chance of rain is 50 percent everyday, he or she will be perfectly calibrated but how useful is that?)

The key point is that this good calibration is rare and the result of long practice and quantitative feedback. Controls specialists get a lot of practice but next to no quantitative feedback, so it seems unlikely that they will be well calibrated.

Indeed, research into human judgment generally paints a bleak picture. We are biased in favor of confirming what we already believe. We cannot weigh more than two or three factors simultaneously without being inconsistent.

The Need for Empirical Support

For "professional judgment," the precedents are discouraging, and we should consider the controls specialists guilty until proven innocent. For all anybody knows, organizations across the world could be spending millions on audit work and controls they do not need, while doing nothing on controls that matter more than anyone realizes.

If I were a lawyer building a case against an audit firm, I would argue that the firm had been negligent in not collecting and making use of empirical data to support its judgments. The fact that other auditors have also failed to do this is no defense because other risk management professionals, such as those dealing with insurance, health, safety, and project management, have all made efforts to gather and use empirical data—as any educated person would expect.

A Lot Can Be Done

The data we need is all around us, collected already. Virtually all large organizations collect data on processing errors, fraud, safety incidents, and so on. Usability testing generates thousands of statistics a year about error rates from different kinds of computer interface in different situations and tasks. Manufacturers test their computer systems and peripherals providing extensive information about their reliability. Telecom companies continually monitor their network availability. Internal and external auditors perform hundreds of thousands of audits a year, gathering information about millions of internal controls, and investigating hundreds of thousands of errors and a smaller number of frauds.

There's no shortage of data. The trick is simply to pull some of it together into a usable form. I have two suggestions.

Suggestion 1: Controls Designers' Risk Tables

Imagine you are tasked with designing controls over some process or system and estimate that one of the controls you want would cost $100,000 to implement but would save money if your assumptions about the number of errors it will have to handle are correct. You are challenged on the need for spending $100,000.

Wouldn't it be helpful to be able to refer to tables of error rates for different types of work and system derived from research? It may be that the tables do not cover the exact situation you are looking at, but even having something slightly similar would give you a starting point. Let's imagine the tables say 1 in 50 invoices will be wrong without the control you are thinking of, but in your case, the risk factors are slightly worse than those in the table. At least you have a starting point, and that's a lot better than nothing, which is what we have today.

Suggestion 2: Research Risk Factors in Your Organization

Consider the benefits of understanding and quantifying how risk works in your organization. For example, you may have the view that staff turnover leads to more mistakes and bigger backlogs of work. It drives risk and productivity. But how much? Does it matter who changes their job? Is there an interaction with the complexity of work? When weighing decisions about staff how important are these effects?

Another example concerns the effectiveness of attempts to improve controls. Did past attempts actually improve control or did it just encourage people to change their priorities and let quality problems show somewhere else? For example, after being pushed for greater accuracy people often slow down and this increases incidences of lateness. Taking all things into consideration—including changes in workload, customer and supplier originated errors, new developments, staffing changes, and so on—did our attempts to improve controls actually work?

Or consider workload. If workload is related to mistakes and backlogs, then can the impact for process performance of taking on new business be quantified? Is it possible to say how much effort might be needed on controls development to meet new requirements?

One final example concerns money lost through undetected billing errors. In telecom this is called "revenue leakage," and it has been searched for extensively over the last several years. A big part of solving revenue leakage problems is deciding where to search for them. Common sense says that some risk factors make leakage more likely, but the problem is that nobody knows for sure which factors are most important, and how much weight to put on each factor. Research is needed.

Already, leading banks seeking to measure Operational Risk accurately are gathering and analyzing data in an attempt to quantify the sort of problems that internal control has traditionally tackled. Although the objectives of this work are narrow and concerned with compliance, perhaps we will see some exciting progress nonetheless.

Useful Techniques

To get the most out of our data (for either of my suggestions), we need to use multivariate statistics to tease out the impact of different factors on error and other risks. Fortunately, there are several great software packages that offer a range of techniques for visualizing and quantifying these multiple relationships.

Although the mathematics and algorithms underlying these packages are often complex and hard to understand, we don't have to understand everything to use it effectively. (Just like you don't have to understand how a car works to drive safely from A to B.)

The techniques available go far beyond fitting straight lines to scatterplots, as most of us did at school. The model doesn't have to be a straight line. It can rely on a potentially large number of variables, some of which are numbers while others are categories. This is an exciting area of development with new ideas being tried all the time.

Neural networks are still an important area. Decision trees (e.g., C4.5, CART) use information theory to identify the most important variables. Statistical learning can also be done using Kernel machines (not actually a physical machine) such as Support Vector Machines. Many of these tools work best when give a large amount of data to learn from (e.g., hundreds or thousands of examples). However, there are good reasons to consider using mathematical methods even where the amount of data available is low.

One problem with human judgment is that we struggle to weigh more than two or three variables at once. We try to eliminate variables from consideration by finding one or two that are decisive on their own, or by pairing up pros and cons in the hope that we can ignore those that seem to balance. But, often, none of these strategies is applicable or safe.

Using even very crude mathematical formulae can be advantageous. The formula does not make mistakes, succumb to confirmation bias, or give way to special pleading. If we want to make objective decisions this is very useful.

Some traditional statistical methods do not give sensible results on the basis of small amounts of empirical data, or simply refuse to provide any estimates at all. In contrast, Bayesian methods involve starting with an initial view and modifying it as data are received. Consequently there is always some view to use.

Another fascinating approach is the use of "fast and frugal" algorithms that analyze the database of past experience only when a prediction is required. One interesting example is PROBEX (PROBabilities from EXemplars) which relies on a similarity function and then computes probabilities as a similarity weighted frequency. Although it does not discriminate between important and irrelevant variables it still performs well, giving results close to human judgment but without the mistakes and biases.

Conclusions

For too long, internal control and audit specialists have relied on "professional judgment" and failed to seek and use empirical evidence relevant to their judgments about risk and control. The data to remedy this are all around us, and today's powerful, yet easy to use, statistical tools give us the best chance ever to make sense of this data.


Further reading:

Fast and Frugal Use of Cue Direction in States of Limited Knowledge by Magnus Persson and Peter Juslin.


*I would like to acknowledge the influence of Colin Tuerena at British Telecommunications plc for highlighting the value of empirically testing beliefs about risk, and of Michael Mainelli of Z/Yen Limited, who introduced me to Support Vector Machines.


Opinions expressed in Expert Commentary articles are those of the author and are not necessarily held by the author’s employer or IRMI. This article does not purport to provide legal, accounting, or other professional advice or opinion. If such advice is needed, consult with your attorney, accountant, or other qualified adviser.