Risks and Consequences

Posted in real world problems on Friday, July 22 2016

I want to take a minute and talk about risk analysis and some of the underlying assumptions left unexamined. First a short intro to risk analysis in industry.

A brief tour of Risk Analysis

"Risk" is defined, as office jargon, as the product of probability and consequence, i.e. if an event has probability $p$ of occuring and a cost of $c$ to deal with then the risk is $pc$. This is fairly straight forward and should tickle your nerve endings with recognition if you've ever looked at loss functions more generally (more on that later). Typically risk is visualized in a matrix, for example the following is a typical risk matrix for project management.

risk matrix

One advantage of coarse-graining risk like this is that it allows qualitatitive information to enter the picture, and also to evaluate a risk along multiple axis of consequence. Suncor, for example, used to use a matrix with three consequence axis: financial consequence, health and safety consequence (i.e. the impact on the workers), and environmental consequence. The three axis were lined up so that you could use one matrix to evaluate the risk in the three different categories, taking the greatest overall risk as the one used in future decisions.

As something of an aside: one big advantage of using separate categories for financial risks quantified in dollars, risks to personnel quantified as expected number of injuries or fatalities, and risks to the environment is that it avoids doing the dreaded "cost of a human life" calculation. I mean the calculation is still implicit in the decision, but it looks more humane at least.

Anyways, back to risk analysis. In the world of big industrial pressure equipment this kind of risk analysis can be used to determine if repairs or replacement of equipment is necessary this turn-around or if it can be pushed to later. Typically some risk threshold is chosen and any piece of equipment with a risk of failure greater than that is repaired, replaced, or whatever. This is pretty rational, on the surface, but I think there are a lot of assumptions baked into it that are completely unnoticed by most people.

Naive Risk Assessment vs Equally Naive Decision Theory

First off, and probably most obvious if you have actually project managed a repair, is that the follow-up work also has a cost to it (and it's own risks, risk analysis really goes down a rabit hole of risk matrices pretty quickly). The threshold method seems to be assuming there are only consequences to inaction and ignoring the consequences of action. We can visualize this in a standard decision matrix:

Vessel will fail in future Vessel won't fail in future
do nothing, $d_0$ cost of failure, $c_f$ 0
repair now, $d_1$ cost of repair, $c_r$ cost of repair, $c_r$

Classic risk analysis takes as it's decision criteria that one should repair if the expected loss from doing nothing is greater than the set threshold. $$ E[ d_0 ] > threshold $$

Standard decision theory would suggest picking the decision with that minimizes the expected loss, so do the repair if: $$ E[ d_0 ] > E[ d_1 ] $$ $$ p*c_f > c_r $$

Where I assume the cost of the repair does not depend on whether or not a vessel will fail in the future (i.e. that whatever the underlying causes of the failure will be have no bearing on the repair).

Another advantage of doing it this way is that we can add decisions to our list, and be assured that there are very thick textbooks in university libraries justifying our decision methodology. For example instead of merely naively repairing everything I could instead look at the option to "investigate further". Suppose I have an inspector who is very good at her job who can identify with perfect accuracy whether or not a vessel will fail in the future, but being so good she costs a fair bit $c_i$. I can add this decision to my table easily and evaluate:

Vessel will fail in future Vessel won't fail in future
do nothing, $d_0$ $c_f$ 0
repair now, $d_1$ $c_r$ $c_r$
inspect, $d_2$ $c_i + c_r$ $c_i$

$$ E[ d_2 ] = p \cdot \left( c_i + c_r \right) + \left( 1 - p \right) c_i $$ $$ E[ d_2 ] = p c_r + c_i $$

We prefer this option when the expected loss for inspecting + possible repair is less than the expected loss for just repairing:

$$ E[ d_2 ] < E[ d_1 ] $$ $$ p c_r + c_i < c_r $$ $$ c_i < \left( 1 - p \right) c_r $$

When the probability of failure is high we may end up in what appears at first glance to be a counter-intuitive result: that inspecting further is a bad decision even if the cost of inspection is less than the overall cost of repair. I can think of some tube bundles that fall into this category.

Considering Uncertainty and Fallibility

Now to switch gears a bit. Suppose we have those tube bundles, we have the option to replace them every turnaround or inspect the tubes first thus giving us the option to save on replacement costs. However we now recognize that inspectors are fallible and the inspection could fail, leaving us at risk of a failure.

I'm going to introduce some probability notation and an assumption. Suppose a tube bundle fails if it has a flaw (and if it has a flaw it is guaranteed to fail), so our probability that the tube bundle will fail is $P(flaw)$. Now suppose that inspectors make binary decisions $good$ and $bad$, we can summarize the inspection possibilities in a table:

Bundle has flaw Bundle has no flaws
Inspection good $$ P( good \mid flaw ) $$ $$ P( good \mid \neg flaw ) $$
Inspection bad $$ P( bad \mid flaw ) $$ $$ P( bad \mid \neg flaw ) $$

Where $P(good | \neg flaw)$ is the probability that the inspector will mark the vessel $good$ given that it does not have a flaw, the true negative rate in this case. $P( bad \mid flaw)$ is the true positive rate.

This opens up the possibility that the inspector can go out, look at the bundle, decide it's good to go and be totally wrong. We can probe this probability, given our prior belief that the vessel will fail is $P(flaw)$ which was likely obtained from the vessel history and some models of different failure modes (e.g. API 580 and 581).

$$ P(flaw \mid good) = { { P( good \mid flaw ) \cdot P(flaw) } \over { P( good \mid flaw ) \cdot P(flaw) + P( good \mid \neg flaw ) \cdot P(\neg flaw) }} $$

Let's put some numbers to this. Suppose we have a tube bundle that, based on it's operating history and some prior experience, has a 5% chance of having a fatal flaw in it. Furthermore let's suppose that the inspector correctly identifies flaws 99% of the time, and correctly identifies bundles that are fit for service 99% of the time. Then the probability that a tube bundle will have a fatal flaw given that the inspector rated it as "good" is:

$$ P(flaw \mid good) = { { 0.01 * 0.05 } \over { 0.01 * 0.05 + 0.99 * 0.05 }} = 0.0005 = 0.05\% $$

So we have a 0.05% chance of a tube bundle being flawed given that it has passed inspection. We can summarize all of this into an expected loss if we always believe our inspector (i.e. repair if and only if the inspector marks the vessel "bad")

$$ E[d_3] = c_i + c_f \cdot P(flaw \mid good) + c_r \cdot \left( P( bad \mid flaw ) \cdot P(flaw) + P( bad \mid \neg flaw ) \cdot P(\neg flaw) \right) $$

If we start playing around with the skill of the inspectors and the consequences of failures we can find interesting zones where listening to the inspector is a bad idea, in the long run, because the chance of a false positive is too high to bear.

To make this a little more clear I've plotted a few examples, assuming the inspector given above, where we can see that once the cost of failure gets too unbearable we are better off always replacing the tuble bundle during a turn-around (in the plots I assumed inspections are free, just to simplify things). All costs are normalized relative to the cost of replacement.

Whether to inspect or not

Final Thoughts

Another area that I haven't really talked about is evaluating whether we should try to repair a vessel at all versus simply replace it. There are all sorts of tools for modeling repair/replacement decisions from a pure finance perspective but I want to talk about another side to repairs. Repairs can fail. Any time you are welding on the pressure envelope you run the risk of introducing a new weak point that can fail. The more often a vessel is repaired the more likely it is that at least one of those repairs will be faulty. We can quantify this in an analogous manner as what I did above with the risk of bad inspections, etc., and then we can make a more clear-headed view about doing a repair in the context of our decision table.

It should be pretty clear that risk analysis is a good first step, but stopping at just threshold values and either winging it with rule-of-thumb judgements or with standardized decision trees is leaving out a lot of potential gains. While incorporating probabilities and loss functions can get really complicated overall, each step in each decision is very simple math and the whole thing can be built up with special software or even just in a spreadsheet. By doing so you can capture all sorts of additional information about your situation and make more rational decisions.

None of this is new, and hopefully it isn't new to the industry (I can't imagine it is), but in my experience there are a lot of people working in reliability and maintenance who go no further than simple risk assessments and a threshold value in deciding what to do. Furthermore while people might have an inkling when it is best to just replace a vessel vs. repair it or do extensive inspections, most of the time they are basing it on overall cost instead of on expected loss. This means folks in industry that I know are paying more money than they need to, in the long run, quantifying the degree to which their worn out equipment is worn out.