Is Airworthiness Dead? 2/

Where I left the discussion there was a question mark. What does conformity mean when constant change is part of the way an aircraft system works?

It’s reasonable to say – that’s nothing new. Every time, I boot up this computer it will go through a series of states that can be different from any that it has been through before. Cumulative operating system updates are regularly installed. I depend on the configuration management practices of the Original Equipment Manufacturer (OEM). That’s the way it is with aviation too. The more safety critical the aircraft system the more rigorous the configuration management processes.

Here comes the – yes, but. Classical complex systems are open to verification and validation. They can be decomposed and reconstructed and shown to be in conformance with a specification.

Now, we are going beyond that situation where levels of complexity prohibit deconstruction. Often, we are stuck with viewing a system as a “black box[1]. This is because the internal workings of a system are opaque or “black.” This abstraction is not new. The treatment of engineered systems as black boxes dates from the 1960s. However, this has not been the approach used for safety critical systems. Conformity to an approved design remains at the core of our current safety processes. 

It’s as well to take an example to illustrate where a change in thinking is needed. In many ways the automotive industry is already wrestling with these issues. Hands free motoring means that a car takes over from a driver and act as a driver does. A vehicle may be semi or fully autonomous. Vehicles use image processing technologies that take vast amounts of data from multiple sensors and mix it up in a “black box” to arrive at the control outputs needed to safely drive.

Neural networking or heuristic algorithms may be the tools used to make sense of a vast amount of constantly changing real world data. The machine learns as it goes. As technology advances, particularly in machine learning ability, it becomes harder and harder to say that a vehicle system will always conform to an understandable set of rules. Although my example is automotive the same challenges are faced by aviation.

There’s a tendance to see such issues as over the horizon. They are not. Whereas the research, design and development communities are up to speed there are large parts of the aviation community that are not ready for a step beyond inspection and conformity checking in the time honoured way.

Yes, Airworthiness is alive and kicking. As a subject, it now must head into unfamiliar territory. Assumptions held and reinforced over decades must be revisited. Checking conformity to an approved design may no longer be sufficient to assure safety.

There are more questions than answers but a lot of smart people seeking answers.

POST 1: Explainability is going to be one of the answers – I’m sure. Explained: How to tell if artificial intelligence is working the way we want it to | MIT News | Massachusetts Institute of Technology

POST 2: The world of the smart phone and the cockpit are getting ever closer How HUE Shaped the Groundbreaking Honeywell Anthem Cockpit


[1] In science, computing, and engineering, a black box is a device, system, or object which produces useful information without revealing information about its internal workings.

Is Airworthiness dead?

Now, there’s a provocative proposition. Is Airworthiness dead? How you answer may depend somewhat on what you take to be the definition of airworthiness.

I think the place to start is the internationally agreed definition in the ICAO Annexes[1] and associated manuals[2]. Here “Airworthy” is defined as: The status of an aircraft, engine, propeller or part when it conforms to its approved design and is in a condition for safe operation.

Right away we start with a two-part definition. There’s a need for conformity and safety. Some might say that they are one and the same. That is, that conformity with an approved design equals safety. That statement always makes me uneasy given that, however hard we work, we know approved designs are not perfect, and can’t be perfect.

The connection between airworthiness and safety seems obvious. An aircraft deemed unsafe is unlikely to be considered airworthy. However, the caveat there is that centred around the degree of safety. Say, an aircraft maybe considered airworthy enough to make a ferry flight but not to carry passengers on that flight. Safety, that freedom from danger is a particular level of freedom.

At one end is that which is thought to be absolutely safe, and at the other end is a boundary beyond which an aircraft is unsafe. When evaluating what is designated as “unsafe” a whole set of detailed criteria are called into action[3].

Dictionaries often give a simpler definition of airworthiness as “fit to fly.” This is a common definition that is comforting and explainable. Anyone might ask: is a vehicle fit to make a journey through air or across sea[4] or land[5]? That is “fit” in the sense of providing an acceptable means of travel. Acceptable in terms of risk to the vehicle, and any person or cargo travelling or 3rd parties on route. In fact, “worthiness” itself is a question of suitability.

My provocative proposition isn’t aimed at the fundamental need for safety. The part of Airworthiness meaning in a condition for safe operation is universal and indisputable. The part that needs exploring is the part that equates of safety and conformity.

A great deal of my engineering career has been accepting the importance of configuration management[6]. Always ensuring that the intended configuration of systems, equipment or components is exactly what is need for a given activity or situation. Significant resources can be expended ensuing that the given configuration meets a defined specification.

The assumption has always been that once a marker has been set down and proven, then repeating a process will produce a good (safe) outcome. Reproducibility becomes fundamental. When dealing with physical products this works well. It’s the foundation of approved designs.

But what happens when the function and characteristics of a product change as it is used? For example, an expert system learns from experience. On day one, a given set of inputs may produce predicable outputs. On day one hundred, when subject to the same stimulus those outputs may have changed significantly. No longer do we experience steadfast repeatable.

So, what does conformity mean in such situations? There’s the crux of the matter.


[1] ICAO Annex 8, Airworthiness of Aircraft. ISBN 978-92-9231-518-4

[2] ICAO Doc 9760, Airworthiness Manual. ISBN 978-92-9265-135-0

[3] https://www.ecfr.gov/current/title-14/chapter-I/subchapter-C/part-39

[4] Seaworthiness: the fact that a ship is in a good enough condition to travel safely on the sea.

[5] Roadworthy: (of a vehicle) in good enough condition to be driven without danger.

[6] https://www.apm.org.uk/resources/what-is-project-management/what-is-configuration-management/

Safety Research

I’ve always found Patrick Hudson’s[1] graphic, that maps safety improvements to factors, like technology, systems, and culture an engaging summary. Unfortunately, it’s wrong or at least that’s my experience. I mean not wholly wrong but the reality of achieving safety performance improvement doesn’t look like this graph. Figure 1[2].

Yes, aviation safety improvement has been as story of continuous improvement, at least if the numbers are aggregated. Yes, a great number of the earlier improvements (1950s-70s) were made by what might be called hard technology improvements. Technical requirements mandated systems and equipment that had to meet higher performance specifications.

For the last two decades, the growth in support for safety management, and the use of risk assessment has made a considerable contribution to aviation safety. Now, safety culture is seen as part of a safety management system. It’s undeniably important[3].

My argument is that aviation’s complex mix of technology, systems, and culture is not of one superseding the other. This is particularly relevant in respect of safety research. Looking at Figure 1, it could be concluded that there’s not much to be gained by spending on technological solutions to problems because most of the issues rest with the human actors in the system. Again, not diminishing the contribution human error makes to accidents and incidents, the physical context within which errors occur is changing dramatically.

Let’s imagine the role of a sponsor of safety related research who has funds to distribute. For one, there are few such entities because most of the available funds go into making something happen in the first place. New products, aircraft, components, propulsion, or control systems always get the lion’s share of funds. Safety related research is way down the order.

The big aviation safety risks haven’t changed much in recent years, namely: controlled flight into terrain (CFIT), loss of control in-flight (LOC-I), mid-air collision (MAC), runway excursion (RE) and runway incursion (RI)[4]. What’s worth noting is that the potential for reducing each one of them is changing as the setting within which aviation operates is changing. Rapid technological innovation is shaping flight and ground operations. The balance between reliance on human activities and automation is changing. Integrated systems are getting more integrated.

As the contribution of human activities reduces so an appeal to culture has less impact. Future errors may be more machine errors rather than human errors.

It’s best to get back to designing in hard safety from day one. Safety related research should focus more on questions like; what does hard safety look like for high levels of automation, including use of artificial intelligence? What does hard safety look like for autonomous flight? What does hard safety look like for dense airspace at low level?

Just a thought.


[1] https://nl.linkedin.com/in/patrick-hudson-7221aa6

[2] Achieving a Safety Culture in Aviation (1999).

[3] https://www.flightsafetyaustralia.com/2017/08/safety-in-mind-hudsons-culture-ladder/

[4] https://www.icao.int/Meetings/a41/Documents/10004_en.pdf

Ockham

It’s a small Surrey village just off the A3. The Black Swan[1] in Ockham is a nice place to eat on a summer day. Although Surrey is a populous county there are many picturesque spots in its countryside. It’s best to describe the village as semi-rural as it’s an easy commute to Guildford.

It’s often a dictum used by politician, managers, and decision makers. Keep it Simple Stupid (KISS) appeals because it’s simple to remember as much as it implores simplicity.

Some sayings are plain folk-law and get repeated because they strike cord with everyday lived experience. Dig a bit and there’s little logic or foundation. KISS offers both a sense that it’s common sense and that there must be some underlying reasoning behind it. Surely, it must be more efficient to try to keep arrangements as simple as possible. That might be processes, procedures, training or even designs.

Although KISS is highly appealing it isn’t, by closer inspection, how we live our lives. Layers and layers of complexity underly everything we do. The issue is that most of the time we do not see the complexity that serves us. A case in point is my iPhone. Yes, its human interface has been designed with KISS in mind, but its functions are provided by levels of complex circuitry and software that go way beyond my understanding. So, we have an illusion of simplicity because complexity is hidden from our eyes. Quite frankly, I have no need to know how my iPhone works. It would only be curiously that would lead me to find out.

Now, I’m going to sound crazy. Because within the complexity, I have ignored there’s a simplicity. Deep in the complex circuitry and software of my iPhone is a design that has converged on the minimum needed to perform its functions. If that were not so then this handheld device would likely be the size of a house.

Ockham’s Razor[2] is a principle of simplicity. It asks us to believe that the simplest theory is more likely to be the true one. It’s like saying nature is lazy. It will not make its inner workings more complex than is needs to be. Even when those inner working can appear complex.

I remember one of my teachers saying that mathematicians are inherently lazy. What he meant was that they are always seeking the simplest way of explaining something. If there are two ways of getting from A to B why take the long one?

The popular expression of Ockham’s Razor is: “Entities should not be multiplied beyond necessity.”

Ockham did not invent the principle of simplicity, but his name is ever associated with it. He pushed the boundaries of thinking. Not bad for a 14th-century English philosopher. 


[1] https://www.blackswanockham.com/

[2] https://iep.utm.edu/ockham/

Safety in numbers. Part 4

In the last 3 parts, we have covered just 2 basic types about failures that can be encountered in any flight. Now, that’s those that effect single systems, and their subsystems and those that impact a whole aircraft as a common effect.

The single failure cases were considered assuming that failures were independent. That is something fails but the effects are contained within one system.

There’s a whole range of other failures where dependencies exist between different systems as they fail. We did mention the relationship between a fuel system and a propulsion system. Their coexistence is obvious. What we need to do is to go beyond the obvious and look for relationships that can be characterised and studied.

At the top of my list is a condition where a cascade of failures ripple through aviation systems. This is when a trigger event starts a set of interconnected responses. Videos of falling dominoes pepper social media and there’s something satisfying about watching them fall one by one.

Aircraft systems cascade failures can start with a relatively minor event. When one failure has the potential to precipitate another it’s important to understand the nature of the dependency that can be hardwired into systems, procedures, or training.

It’s as well to note that a cascade, or avalanche breakdown may not be straightforward as it is with a line of carefully arranged dominos. The classical linear way of representing causal chains is useful. The limitation is that dominant, or hidden interdependencies can exist with multiple potential paths and different sequences of activation.

The next category of failure is a variation on the common-mode theme. This has more to do with the physical positions of systems and equipment on an aircraft. For example, a localised fire, flood, or explosion can defeat built-in redundancies or hardened components.

Earlier we mentioned particular risks. Now, we need to add to the list; bird strike, rotor burst, tyre burst and battery fires. The physical segregation of sub-systems can help address this problem.

Yes, probabilistic methods can be used to calculate likelihood of these failure conditions occurring.

The next category of failure is more a feature of failure rather than a type of failure. Everything we have talked about, so far, may be evident at the moment of occurrence. There can then be opportunities to take mitigating actions to overcome the impact of failure.

What about those aircraft systems failures that are dormant? That is that they remain passive and undetected until a moment when systems activation is needed or there’s demand for a back-up. One example could be just that, an emergency back-up battery that has discharged. It’s then unavailable when it’s needed the most. Design strategies like, pre-flight checks, built-in-test and continuous monitoring can overcome some of these conditions.

Safety in numbers, Part 3

The wind blows, the sun shines, a storm brews, and rain falls. Weather is the ultimate everyday talking point. Stand at a bus stop, start a conversation and it’ll likely be about the weather. Snow, sleet, ice or hail the atmosphere can be hostile to our best laid plans. It’s important to us because it affects us all. It has a common effect.

We started a discussion of common-mode failures in earlier paragraphs. We’ll follow it up here. Aircraft systems employ an array of strategies to address combinations and permutations of failure conditions. That said, we should not forget that these can be swamped by common-mode effects.

Environmental effects are at the top of the list of effects to consider. It’s a basic part of flying that the atmosphere changes with altitude. So, aircraft systems and equipment that work well on the ground may have vulnerabilities when exposed to large variations in temperatures, atmospheric pressure, and humidity.

Then there’s a series of effects that are inherent with rotating machinery and moving components. Vibration, shock impacts and heat all need to be addressed in design and testing.

It is possible to apply statistical methods to calculate levels of typical exposure to environmental effects, but it is more often the case that conservative limits are set as design targets.

Then there are particular risks. These are threats that, maybe don’t happen everyday but have the potential to be destructive and overcome design safety strategies. Electromagnetic interference and atmospheric disturbances, like lightning and electrostatic discharge can be dramatic. The defences against these phenomena can be to protect systems and limit impacts. Additionally, the separation or segregation of parts of systems can take advantage of any built-in redundancies.

Some common-mode effects can occur due to operational failures. The classic case is that of running out of fuel or electrical power. This is where there’s a role for dedicated back-up systems. It could be a hydraulic accumulator, a back-up battery, or a drop-out ram air turbine, for example.

Some common-mode effects are reversable and tolerable in that they don’t destroy systems and equipment but do produce forms of performance degradation. We get into the habit of talking about failure as if they are absolute, almost digital, but it’s an analogue world. There’s a range of cases where adjustments to operations can mitigate effects on aircraft performance. In fact, an aircraft’s operational envelope can be adjusted to ensure that it remains in a zone where safe flight and landing are possible, however much systems are degraded.

Probabilities can play a role in such considerations. Getting reliable data on which to base sound conclusions is often the biggest challenge. Focusing on maintaining a controllable aircraft with a minimum of propulsion, in the face of multiple hazards takes a lot of clear thought.

Safety in numbers. Part 1

It’s a common misconception that the more you have of something the better it is. Well, I say, misconception but in simple cases it’s not a misconception. For safety’s sake, it’s common to have more than one of something. In a classic everyday aircraft that might be two engines, two flight controls, two electrical generators and two pilots, so on.

It seems the most common-sense of common-sense conclusions. That if one thing fails or doesn’t do what it should we have another one to replace it. It’s not always the case that both things work together, all the time, and when one goes the other does the whole job. That’s because, like two aircraft engines, the normal situation is both working together in parallel. There are other situations where a system can be carrying the full load and another one is sitting there keeping an eye on what’s happening ready to take over, if needed.

This week, as with many weeks, thinkers and politicians have been saying we need more people with a STEM education (Science, Technology, Engineering, and Math). Often this seems common-sense and little questioned. However, it’s not always clear that people mean the same things when talking about STEM. Most particularly it’s not always clear what they consider to be Math.

To misquote the famous author H. G. Wells: Statistical thinking may, one day be as necessary as the ability to read and write. His full quote was a bit more impenetrable, but the overall meaning is captured in my shorten version.

To understand how a combination of things work together, or not, some statistical thinking is certainly needed. Fighting against the reaction that maths associated with probabilities can scare people off. Ways to keep our reasoning simple do help.

The sums for dual aircraft systems are not so difficult. That is provided we know that the something we are talking about is reliable in the first place. If it’s not reliable then the story is a different one. For the sake of argument, and considering practical reality let say that the thing we are talking about only fails once every 1000 hours.

What’s that in human terms? It’s a lot less than a year’s worth of daylight hours. That being roughly half of 24 hours x 7 days x 52 weeks = 4368 hours (putting aside location and leap years). In a year, in good health, our bodies operate continuously for that time. For the engineered systems under discussion that may not be the case. We switch the on, and we switch them off, possibly many times in a year.

That’s why we need to consider the amount of time something is exposed to the possibility of failure. We can now use the word “probability” instead of possibility. Chance and likelihood work too. When numerically expressed, probabilities range from 0 to 1. That is zero being when something will never happen and one being when something will always happen.

So, let’s think about any one hour of operation of an engineered system, and use the reliability number from our simple argument. We can liken that, making an assumption, to a probability number of P = 1/1000 or 1 x 10-3 per hour. That gives us a round number that represents the likelihood of failure in any one hour of operation of one system.

Now, back to the start. We have two systems. Maybe two engines. That is two systems that can work independently of each other. It’s true that there are some cases where they may not work independently of each other but let’s park those cases for the moment.

As soon as we have more than one thing we need to talk of combinations. Here the simple question is how many combinations exist for two working systems?

Let’s give them the names A and B. In our simplified world either A or B can work, or not work when needed to work. That’s failed or not failed, said another way. There are normally four combinations that can exist. Displayed in a table this looks like:

A okB ok
A failsB ok
A okB fails
A failsB fails
Table 1

This is all binary. We are not considering any near failure, or other anomalous behaviour that can happen in the real world. We are not considering any operator intervention that switches on or switches off our system. We are looking at the probability of a failure happening in a period of operation of both systems together.

Now, let’s say that the systems A and B each have a known probability of failure.

Thus, the last line of the table becomes: P4 = PA and PB

That is in any given hour of operation the chances of both A and B failing together are the product of their probabilities. Assuming the failures to be random.

Calculating the last line of the table becomes: P4 = PA x PB

In the first line of the table, we have the case of perfection. Simultaneous operation is not interrupted, even though we know both A and B have a likelihood of failure in any one hour of operation.

Thus, the first line becomes: P1 = (1 – PA) x (1 – PB)

Which nicely approximates to P1 = 1, given that 1/1000 is tiny by comparison.

The cases where either A or B fails are in the middle of the table.

P2 = PA x (1 – PB) together with P3 = (1 – PA) x PB

Thus, using the same logic as above the probability of A or B failing is PA + PB

It gets even better if we consider the two systems to be identical. Namely, that probabilities PA and PB  are equal.

A double failure occurs at probability P2

A single failure occurs at probability 2P

So, two systems operating in parallel there’s a decreased the likelihood of a double failure but an increase in the likelihood of a single failure. This can be taken beyond an arrangement with two systems. For an arrangement with four systems, there’s a massively decreased likelihood of a total failure but four times the increase in the likelihood of a single failure. Hence my remark at the beginning. 

[Please let me know if this is in error or there’s a better way of saying it]