Resilience Engineering

Erik Hollnagel

Ph.D., Professor, Professor Emeritus

 

Resilience Engineering

Origins

‘Resilience’ is a term that has been used for a long time and in several different ways. It was first used in to describe a property of timber, and to explain why some types of wood were able to accommodate sudden and severe loads without breaking. Almost four decades later, a report to the Admiralty referred to a measure called the modulus of resilience as a means of assessing the ability of materials to withstand severe conditions.

Many years later, Holling (1973) referred to the resilience of an ecosystem as the measure of its ability to absorb changes and still exist. He further contrasted resilience with stability, defined as the ability of a system to return to its equilibrium state after a temporary disturbance, but also argued that resilience and stability were two important properties of ecological systems. This later led to a distinction between engineering resilience and ecological resilience. Engineering resilience considers ecological systems to exist close to a stable steady-state. Resilience is here the ability to return to the steady-state following a perturbation. Ecological resilience emphasizes conditions far from any stable steady-state, where instabilities can flip a system from one regime of behaviour into another. Resilience is here the system’s ability to absorb disturbances before it changes the variables and processes that control behaviour.

In the early 1970s, the term ‘resilience’ began to be used as a synonym for stress resistance in psychological studies of children. It soon became a frequently used term in psychology, and was many years later, in 2007, defined as: ‘The capacity to withstand traumatic situations and the ability to use a trauma as the start of something new’. At the beginning of the 21st century, it was picked up by the business community and used to describe the ability dynamically to reinvent business models and strategies as circumstances change.

The Resilience Engineering understanding of 'resilience'

As the above brief history shows, the thinking about resilience has typically referred to a dichotomy of sorts: on the one hand materials, systems, or situations where resilience was absent and where adverse outcomes therefore might happen, and on the other hand materials, systems or situations where resilient was present and where adverse outcomes could be avoided. This was also the case in the early 2000s when resilience engineering was proposed as an alternative (or as a complement) to the conventional view of safety. This led to early discussions about resilience versus robusteness, resilience versus brittleness, etc.

But resilience (or more accurately, the ability to perform in a resilient manner – although this is too long to write every time) is not about avoiding failures and breakdowns, i.e., it is not just the opposite of a lack of safety. When it was said, in 'Resilience Engineering: Concepts and Precepts' that ‘failure is the flip side of success’ the intention was not to propose a binary universe, but rather to point out that things that go wrong happen in (more or less) the same way as things that go right. (This has later been elaborated in 'The ETTO principle' and in 'Safety-I and Safety-II'.) This is by no means the so-called ‘new view’ – which by the way was not new at all even when it was touted as such – but rather the realisation that humans always try to do what they think is right in the situation. (Remember Mach’s dictum: “Erkenntnis und Irrtum fließen aus denselben psychischen Quellen; nur der Erfolg vermag beide zu scheiden.”)

The focus of resilience engineering is thus resilient performance, rather resilience as a property (or quality) or resilience in a 'X versus Y' dichotomy. This can be seen in how the definition of resilience has changed over the years.

In the first book (Resilience Engineering: Concepts and Precepts, 2006) the following definition was given. "The essence of resilience is therefore the intrinsic ability of an organisation (system) to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/or in the presence of a continuous stress."

This definition reflects the historical context by its juxtaposition of two states - one of stable functioning and one where the system has broken down. The definition is also limited to consider situations of threat, risk or stress.

In the fourth book (Resilience Engineering in Practice, 2010) - depnding on how one counts - the definition reads as this: "The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions."

In this definition the emphasis on risks and threats has been reduced, and the reference is instead to 'expected and unexpected conditions'. The focus has also changed from 'maintaining or regaining a dynamically stable state' to the ability to 'sustain requried operations'. The logical continuation of these developments is a definition like the following (not yet documented in a book):

A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.

The change in the definitions has been to broaden the scope of resilient performance. It is not just to be able to recover from threats and stresses, but rather to be able to perform as needed under a variety of conditions – AND TO RESPOND APPROPRIATELY TO BOTH DISTURBANCES AND OPPORTUNITIES.

The emphasis on opportunities is important for the change from protective safety to productive safety – and ultimately for the dissociation of resilience from safety, thereby leaving the sterile discussions and the stereotypes of the past behind. Resilience is about how systems perform, not just about how they remain safe. (And even here Safety-II would mean something quite different from Safety-I, but even Safety-II is just another step on the road ahead.) A system that is unable to make use of opportunities is not in a much better position than a system that cannot respond to threats and disturbances – at least not in the long term.

The above definition is probably not the last and final one. Although resilience engineering started as a contrast to conventional safety thinking (Safety-I), it should become something in its own right. Resilience engineering must free itself from the frame of reference that might have been of some value ten years ago (yet even that is doubtful), but which surely will impede any further development. Resilience engineering is about the characteristics of resilient performance per se, how we can recognise it, how we can assess (or measure) it, how we can improve it. The discussions should therefore focus on what resilience (or rather, resilient performance) IS, rather than on what it IS NOT.

The Management of 'resilience'

Resilience engineering looks at how the organisation functions as a whole. The four basic abilities (see Resilience Analysis Grid) are a natural starting point for understanding how an organisation functions: how it responds, how it monitors, how it learns, and how it anticipates. But the four abilities must be seen together rather than one by one. The top-down perspective that the integration of the four abilities provides also suggests a useful way to distinguish between four types of systems or organisations, which here are simply called systems of the first, the second, the third and the fourth kind.

Systems of the First Kind

It is a sine qua non for any system that it can react appropriately when something happens, not least if it is something unexpected. Failure to do so will sooner or later lead to the 'death' of the system. Systems that can react appropriately and therefore sustain their existence, are called systems of the first kind.

Reacting when something happens requires the abilities to monitor and to respond. Monitoring is needed to determine that the situation is such that some kind of reaction or intervention is required. And responding is necessary to implement the reaction. The two must necessarily go together. An system that passively reacts whenever something happens – whenever a situation passes a certain threshold – will by definition always be surprised and therefore always reactive. This may work as long as the frequency of events is so low that a response can be completed before a new is required. Unfortunately, only few systems in today’s world can enjoy that luxury. While systems of the first kind may survive, at least for a time, they are not really resilient.

Systems of the Second Kind

While the ability to respond is fundamental, it is also necessary to be able to modify the responses based on experience. Systems of the second kind are those that can manage something not only when it happens but also after it has happened. This means that the system can learn from what has happened, and can use this learning to adjust both how it monitors – what it looks for – and how it responds – what it does.

Learning is necessary in order to be sure that the system monitors the proper signs, signals, and symbols. Learning is also necessary to adjust the responses both in terms of what they are and in terms of when they are given (and for how long, etc.). Both types of learning are necessary if the context or environment in which the system works and in which it mus operate keeps changing. The higher the stability of the context or working environment is, the less is the need to change monitoring and responding, and the less therefore is the need to learn. But no environment is perfectly stable, even though systems sometimes seem to find excuses for avoiding to learn.

Systems of the Third Kind

One effect of monitoring is that a system may detect a developing situation in time and therefore be able to respond before it has become too serious (or before it has become too late). The dilemma of such management by exception is that an early intervention may seem appealing but be difficult to justify, while a late intervention is more difficult because the situation will have deteriorated further and become more complicated and possibly also more costly to control.

Systems of the third kind are those that can manage something before it happens, by analysing the developments in the world around and preparing itself as well as possible. A typical example of that is an organisation that tries to anticipate changes in customer needs, or in regulations (carbon emissions, for instance), or a growing number of refugees, and try to be ready for that. Responding before something happens requires the ability to anticipate and/or predict. Anticipation in turn requires indicators that can be used to make the predictions, i.e., indicators which somehow are correlated with future events (so-called leading indicators).

Anticipation makes it necessary to take a risk. Acting before there is something to respond to introduces the risk that the prediction / anticipation may have been wrong and that the effort therefore was wasted. When you respond after something has happened, this uncertainty does not exist. On the other hand, waiting for too long may require a larger response, hence be cost-ineffective in its own right.

Systems of the Fourth Kind

Systems of the third kind are able to respond, monitor, learn, and anticipate and may therefore seem to meet all the criteria to being called resilient – and be able to manage their resilience. Yet it is possible to become better still by considering not only what happens between the system and its environment (business, operation, etc.), but also what happens in the system rganisation itself. In systems of the fourth kind the anticipation includes the system itself – not only in the sense of monitoring itself or learning about itself, but considering how the world responds or changes when the system makes changes, how these responses may affect the changes, and so on. This is the recursive type of anticipation, and represents the pinnacle of resilience management.

References

Holling, C. S. (1973). Resilience and stability of ecological systems. Annual Review of Ecology and Systematics, 4, 1–23.