1st law of OT Cyber Security Risk

Sinclair IEC/ISA 62443

In my article The anatomy of plant design and the OT cyber-attack – Part 1, I challenged the readers by introducing a new cyber security law that I called the “1st law of OT Cyber Security Risk”. A small LinkedIn poll on the same topic resulted in about 80% support for the proposed cyber law, but some also expressed some reservations.

In this article, I want to discuss the topic in some more detail, along with reasons why I believe OT cybersecurity should comply with the 1st law at all times. But I also want to talk about a number of bottlenecks that make it difficult to meet the requirement of the 1st law. But before I start the discussion, let’s first show the 1st law again:

OT Cyber Security Risk

So, the 1st law essentially says that process safety risk sets the plant’s risk tolerance with regard to loss, independent of the cause of this loss. Therefore, OT cyber security risk should not exceed this risk tolerance! But the question is how, is the text of the 1st law accurate, and is it possible.

Process safety risk results from stochastic faults such as errors and mistakes – so random occurrences – and systematic faults also errors and mistakes but no predictable, so deterministic occurrences.

Cyber-attacks can force the same type of faults, with the same consequences. But also some issues are not addressed by either HAZOP or LOPA analysis.

OT Cyber Security Risk fig 1
Figure 1 – Process accident causes

Random/stochastic failures (e.g. electronic equipment failures) are made predictable by observing their mean time between failures (MTBF). There is an abundance of statistical data available in process safety to estimate event frequencies for a specific device or role. Stochastic failures happen very often within the industry, so the predictions become very reliable.

This statistical information on the failures is used by the layers of protection analysis (LOPA) method, a semi-quantitative risk assessment technique, to determine what safety measures are required. The LOPA method estimates a mitigated event frequency (MEF) that is used for comparison with a target mitigated event likelihood (TMEL), a maximum event frequency allowed for a specific impact level. This comparison verifies if the right safeguards are in place to meet the plant’s safety risk criteria. The same criteria the 1st OT cyber security law suggests to also adopt for the cyber security risk of the process installation.

LOPA specifies a risk reduction factor for each type of process safety safeguard, for example, an operator response to a process alarm is considered to reduce the event frequency by a factor of 10. For programmable/electronic safeguards the risk reduction factor is estimated using the MTBF and a test frequency of all the components that create the safeguard or more precisely in this case the safety instrumented function (SIF). These programmable/electronic safeguards are potential targets of a cyber-attack, now increasing their stochastic probability of failure by a factor related to the cyber resilience of the installation.

In high-impact environments, such as nuclear plants or high-risk chemical plants, LOPA is not accurate enough to justify this type of decision, so quantitative risk methods are used to estimate process safety risk. But for refineries and the average chemical plant, the LOPA technique is used for at least two decades now and proofed its value to reduce the number of process safety incidents. A repeatable structured and overall quantitative method to determine what measures are required to reduce the risk.

But today we have a third source of failure, the cyber-attack, a type of failure that is much harder to predict, if at all possible. This cyber-attack can force both the stochastic failures as well as the systematic failures and can cause failures not analyzed at all by process safety. Failures from a cyberattack are in principal deterministic and if we break down the analysis into smaller parts also predictable to some extent. But we have to be careful how we do this as there is almost no statistical data on cyberattacks available. We need another method, which I will discuss later.

So three sources of interference can cause failures and malfunctions of the process installation with potentially very serious consequences. These faults and malfunctions constitute what we call hazards (the potential for harm) for the production installation, process, and personnel. Examples of process hazards are loss of containment, thermal runaway reactions, chemical reactions, explosions, etc.

We can estimate the probability that this harm arises – the risk – and based upon the severity of this harm, we can specify criteria that set the risk tolerance. Risk tolerance is typically expressed as a maximum event frequency called the target mitigated event likelihood (TMEL).

The TMEL determines the upper limit for the risk, independent of its source. So the 1st OT cyber security law in essence states that accidents arising from a cyber-attack may not occur more frequently than we have set with the TMEL for process safety. If the TMEL for a thermal run-away reaction with a potential for multiple fatalities is set to once every 10.000 years, the chance that this run-away reaction results from an equipment failure from a cyber-attack must not occur more frequently.

So as a result of the 1st law, OT cybersecurity risk criteria are aligned with the process safety risk criteria. This means that the cyber incident scenario that causes the worst-case impact together with a high probability (loss event frequency) determines the required cybersecurity risk reduction.

Because the target of the cyber-attack is a process automation function/component the cyber security risk reduction also determines the level of risk in the security zone and as such the target security level for the zone.

However, there are some issues I need to address:

  • Process safety risk reduction is not exclusively realized with programmable/electronic safeguards, also physical safeguards are used. So risk reduction can be partially realized with hackable digital safeguards, and partially with non-digital controls that are unaffected by a cyber-attack. For example, the TMEL settings for a nuclear plant can be as challenging as event rates of 10E-07 or 10E-08 per annum because of the high societal risk. A WHO report estimated that approximately 600.000 people were impacted of which over 4000 died of the radiation. With such a high potential impact, risk tolerance is very low. Cyber security resilience cannot offer that level of risk reduction for programmable/electronic components. Neither can process safety use exclusively digital controls, so there would be a combination of non-digital controls, and possibly SIL 4 safety controllers to realize the risk tolerance limit. So the statement less than the “process safety risk” in the original 1st law needs to be revised in for example:
OT Cyber Security Risk 1st law
  • If we formulate the 1st law in this way we define an upper bound for the OT cyber security risk depending on the risk reduction realized with the functional safety controls. This upper bound can be compared with the upper bound set for process safety because LOPA provides us with this information for the loss scenarios. Another more fundamental issue is: “Is there a quantitative method available that uses sufficient and complete statistics to estimate an event frequency that we can use to compare the event probability of the cyber-attack to the TMEL”? If I formulate it in this way, the answer is no. However, if I don’t look at the event frequency of the cyber attack but at the event frequency of a successful cyber-attack, then different sources of relevant statistical information become available. How we can do this is the last topic in the article.
  • When we talk about cyber security risks often the formula Risk = Threat x Vulnerability x Impact is used. This formula can not be used in a quantitative risk assessment, we need to change it to a formula that is also used for determining quantitative terrorism risk.
Figure 2 - Risk formula OT Cyber Security Risk
Figure 2 – Risk formula

The key element here is the conditional probability P(S|A), which is the probability that the threat actor is successful if he/she attacks. This conditional probability partially represents the cyber resilience of the process automation system. This cyber resilience is what we need to quantify if we want to estimate an event frequency/likelihood that a specific cyber-attack executed by a specific threat actor causing a specific process or business loss scenario will succeed.

The probability that the threat actor will conduct a cyber-attack is separated from the system’s cyber resilience in this model allowing the use of a different source of statistical data. The p(A) value is based on a qualitative method taking several factors into account among which historical factors and regional factors, it is providing an offset for the overall likelihood similar to being used for methods determining terrorism threat levels. But for estimating the risk reduction provided by our security measures we can focus on the p(S|A) component.

The success that a threat actor succeeds is based on a number of factors: capability of the threat actor, opportunity, and exposure to a specific threat action (tactic, technology, procedure – TTP). The method assigns a specific threat action (TTP) an initiating event frequency (IEF) and amends this IEF for factors related to the system design (static exposure), factors related to the security management (dynamic exposure), factors related to the capability and opportunity of the threat actor, and the risk reduction the security measures provide.

Statistical data about the possible use of a specific TTP is available from the vulnerability databases, the way in which we can exploit a specific vulnerability is the TTP. Some TTPs are more common than others, we can say that threat actions based on common TTPs should have a higher IEF than threat actions based on TTPs that are less often available for the threat actor. Of course, we also need to consider what threat actor profile can successfully execute the TTP, for this he/she needs the capability and opportunity.

The risk reduction of the security measures is a function of their reliability and effectiveness, this can be modeled too using different factors to account for preventative and detective controls and a set of rules. This provides an OT cyber security risk estimation model very similar to LOPA for estimating the event frequency that a specific cyber-attack scenario will succeed. If we link the cyber-attack scenario causing the potential functional deviations of the targeted system element with the process or business loss scenario we have the event frequency of the cyber attack matched with the impact on the business/process installation. So we have linked technical cyber security risk with loss scenarios to have loss-based risk allowing us to justify cyber security decisions.

Is this method new? For some not involved in building greenfield process installations it might be, but for those challenged to meet risk criteria set by the governments it is not.

Governments set limits for individual risk, societal risk, and environmental risk for submitting plant licenses. These limits differ per country, just like the rules around risk reduction (ALARP or ALARA) differ.

There are some exceptions, geographical areas where this doesn’t play a role, for example, the US where regulations are formulated differently and differ by state. But in Europe, Asia, the Middle East, and South America governments published F-N diagrams to set these limits and the industry needs to meet them in those regions.

Because the risk reduction in process safety depends on OT security also OT security needs to ascertain that these levels are reached by providing sufficient protection for the process automation system. Not a popular requirement, but never the less a necessity for a safe production process.

The method is practiced for many years but has not been identified/described by the IEC 62443-3-2 which exclusively addresses OT cyber security risk as a qualitative process. However, a qualitative process doesn’t provide the event frequencies we need to justify that the security measures are sufficient to protect the process automation system, including the safety instrumented system, to meet the overall requirements set for process safety risk.

Therefore the 1st OT cyber security law explicitly states this requirement.

A complimentary guide to the who`s who in industrial cybersecurity tech & solutions

Free Download

Related