Safety in numbers. Part 1

It’s a common misconception that the more you have of something the better it is. Well, I say, misconception but in simple cases it’s not a misconception. For safety’s sake, it’s common to have more than one of something. In a classic everyday aircraft that might be two engines, two flight controls, two electrical generators and two pilots, so on.

It seems the most common-sense of common-sense conclusions. That if one thing fails or doesn’t do what it should we have another one to replace it. It’s not always the case that both things work together, all the time, and when one goes the other does the whole job. That’s because, like two aircraft engines, the normal situation is both working together in parallel. There are other situations where a system can be carrying the full load and another one is sitting there keeping an eye on what’s happening ready to take over, if needed.

This week, as with many weeks, thinkers and politicians have been saying we need more people with a STEM education (Science, Technology, Engineering, and Math). Often this seems common-sense and little questioned. However, it’s not always clear that people mean the same things when talking about STEM. Most particularly it’s not always clear what they consider to be Math.

To misquote the famous author H. G. Wells: Statistical thinking may, one day be as necessary as the ability to read and write. His full quote was a bit more impenetrable, but the overall meaning is captured in my shorten version.

To understand how a combination of things work together, or not, some statistical thinking is certainly needed. Fighting against the reaction that maths associated with probabilities can scare people off. Ways to keep our reasoning simple do help.

The sums for dual aircraft systems are not so difficult. That is provided we know that the something we are talking about is reliable in the first place. If it’s not reliable then the story is a different one. For the sake of argument, and considering practical reality let say that the thing we are talking about only fails once every 1000 hours.

What’s that in human terms? It’s a lot less than a year’s worth of daylight hours. That being roughly half of 24 hours x 7 days x 52 weeks = 4368 hours (putting aside location and leap years). In a year, in good health, our bodies operate continuously for that time. For the engineered systems under discussion that may not be the case. We switch the on, and we switch them off, possibly many times in a year.

That’s why we need to consider the amount of time something is exposed to the possibility of failure. We can now use the word “probability” instead of possibility. Chance and likelihood work too. When numerically expressed, probabilities range from 0 to 1. That is zero being when something will never happen and one being when something will always happen.

So, let’s think about any one hour of operation of an engineered system, and use the reliability number from our simple argument. We can liken that, making an assumption, to a probability number of P = 1/1000 or 1 x 10-3 per hour. That gives us a round number that represents the likelihood of failure in any one hour of operation of one system.

Now, back to the start. We have two systems. Maybe two engines. That is two systems that can work independently of each other. It’s true that there are some cases where they may not work independently of each other but let’s park those cases for the moment.

As soon as we have more than one thing we need to talk of combinations. Here the simple question is how many combinations exist for two working systems?

Let’s give them the names A and B. In our simplified world either A or B can work, or not work when needed to work. That’s failed or not failed, said another way. There are normally four combinations that can exist. Displayed in a table this looks like:

A okB ok
A failsB ok
A okB fails
A failsB fails
Table 1

This is all binary. We are not considering any near failure, or other anomalous behaviour that can happen in the real world. We are not considering any operator intervention that switches on or switches off our system. We are looking at the probability of a failure happening in a period of operation of both systems together.

Now, let’s say that the systems A and B each have a known probability of failure.

Thus, the last line of the table becomes: P4 = PA and PB

That is in any given hour of operation the chances of both A and B failing together are the product of their probabilities. Assuming the failures to be random.

Calculating the last line of the table becomes: P4 = PA x PB

In the first line of the table, we have the case of perfection. Simultaneous operation is not interrupted, even though we know both A and B have a likelihood of failure in any one hour of operation.

Thus, the first line becomes: P1 = (1 – PA) x (1 – PB)

Which nicely approximates to P1 = 1, given that 1/1000 is tiny by comparison.

The cases where either A or B fails are in the middle of the table.

P2 = PA x (1 – PB) together with P3 = (1 – PA) x PB

Thus, using the same logic as above the probability of A or B failing is PA + PB

It gets even better if we consider the two systems to be identical. Namely, that probabilities PA and PB  are equal.

A double failure occurs at probability P2

A single failure occurs at probability 2P

So, two systems operating in parallel there’s a decreased the likelihood of a double failure but an increase in the likelihood of a single failure. This can be taken beyond an arrangement with two systems. For an arrangement with four systems, there’s a massively decreased likelihood of a total failure but four times the increase in the likelihood of a single failure. Hence my remark at the beginning. 

[Please let me know if this is in error or there’s a better way of saying it]