Terminology – OT Security Resilience and Robustness

Terminology – OT Security Resilience and Robustness
Sinclair

The OT security community is known for extensively deliberating on the terminology used, as evidenced by the abundance of written content solely on the use of the acronym CIA, which surpasses the capacity of a book. So let me throw a new pebble in the pond and open up the discussion on the use of the words resilience and robustness in OT security.
When we discuss the defensive capabilities of OT security, the term “resilience” is frequently employed, while “robustness” is used on occasion. Although both terms are interrelated, there are fundamental distinctions between them that are important to take into account. A security defence that is robust possesses distinct attributes from one that is resilient.

A graphic best explains the difference between the two.

Figure 1 – Robustness and resilience

Robustness refers to the ability of a system to perform its intended functions even under adverse conditions like a cyber attack. A robust system is one that can maintain its performance and functionality even when facing disruptions or unexpected events. Robustness involves the ability to prevent or mitigate the effects of disruptions, to maintain both functionality and performance.

Typical security investments in robustness are:

  • Hardening – reduce the attack surface, typically for an end node, so reducing the threat actors’ opportunity
  • Firewalls – reduce the attack surface by controlling access into the process automation network and potentially throttling excessive communication loads.
  • Patching – reduce the threat actors’ opportunities (the attack surface is not reduced!)
  • Anti-virus – isolate malware-infected files when they are written to disk
  • USB protection – prevent data transfer into the system, ideally combined with a check on malware because USB file transfer might be necessary.
  • Application control – prevent the installation and/or execution of non-authorized executables.
    Robustness focuses on preventative security measures, however, the ideal security defense as shown in #1 of the above figure doesn’t exist. Sooner or later a security breach (#2) takes place, that is if a threat actor is interested in the consequences of the attack. When such a breach happens we have to talk about the resilience of our defense.

Where for robustness we can focus exclusively on the OT security defense of the process automation system, for resilience we need a more holistic approach to OT security. This is because the process automation system has a tight relationship with what is happening in the physical process installation.

Typical results of a cyber-attack for the process installation can be:

  • Production downtime leading to financial loss
  • Safety hazards like explosions, chemical run-away reactions, leaks of toxic gas, loss of primary containment of tanks/vessels, and fires. Many of these potentially result in injury or death of plant personnel or even impact the public area, and causing environmental contamination.
  • Equipment damage, equipment might get overheated, and vessels, pipelines, or tubing might rupture leading to costly repairs or replacement or causing the above safety hazards.
  • Regulatory violations like environmental contamination or a violation of individual or societal risk criteria.
  • Supply chain disruptions, both within the plant as well as for suppliers and customers of the plant. This can result in delays, shortages, and increased costs.
  • Additionally, the cyber attack can cause a loss of intellectual property such as trade secrets, formulas, and processes. This can result in a competitive disadvantage for the chemical production process, loss of revenue, and damage to the company’s reputation.


When we discuss robustness we typically focus on what is happening after the initial security and how we can mitigate the impact of these security breaches.

Typical security investments in resilience are:

  • Detection – we need to know if our security is breached so we can organize a rapid response to mitigate the impact of the breach.
  • Incident response – detection is fine but we need to organize a response when the intrusion happens.
  • Disaster recovery – when our defense fails and our response wasn’t in time, we need to recover. Depending on the resilience of our defensive measures and the production process this takes time.
  • Network segmentation – preventing an external threat actor to have direct access to the most essential functions of a process automation system by gaining time to detect and block a security breach in progress. This of course requires that the network segmentation is correctly implemented so a threat actor is facing multiple barriers to progress its attack.
  • Data recovery – assuring that adequate back-ups are available to restore the system components of the process automation system.


However, there are other factors to take into account. As shown in the figure above, for the disruptive event described under #3, the process automation system is able to respond quickly, and normal operations can be restored in a relatively short period of time. This is likely to result in a limited financial loss that can be managed. The resilience under #4 is less because recovery time (and probably financial loss) is bigger.

When we look at resilience for example #4 we need to consider:

  • Detection and response time – I combine these two because an investment in a detection mechanism without an appropriate response capability is a wasted investment.
  • Recovery time – Just like detection and response need to be considered as a pair, also response and recovery are a pair. Containment and forensic analysis are fine, but these don’t restart the process. A quick recovery is essential to limit supply chain losses. Recovery time depends on multiple factors:
    • First of course the detection and response time, this should start the containment and potentially the eradication (if the threat actor has an active/passive presence in the system) tasks.
      o The time required for restoration varies depending on the installation. In cases where the process automation system is the main function that needs to be restored, such as in a paper mill or power grid, the production process can be brought back online relatively quickly. However, for a chemical plant or refinery, restoring the process automation system alone is unlikely to be sufficient. A more complex reconstitution process is required, which involves assessing the damage caused by the disruptive event, verifying and correcting setpoints, controller modes, control parameters, batch sequences, the pipeline network, and safeguards, and synchronizing them with the actual process state. Additionally, parts of the process may need to be tested to ensure proper functionality and process safety. The reconstitution process for a chemical plant, refinery, or gas treatment/transmission plant is more involved and time-consuming than for other installations, requiring a thorough and meticulous approach to ensure the safety of the production process before a restart.
      o Apart from reconstitution time we also need to consider the system start-up time and the long-term recovery time, the time it takes for the process to reach its normal performance levels.

So resilience requires a lot more to consider than just the process automation system, we can build in protection measures that limit the options of the threat actor to cause damage and facilitate the recovery process. For example, a pump with electric overload protection is more difficult to damage than a pump without such non-hackable protection. Resilience requires more than OT security.
So I hope I explained that resilience and robustness are two different, though interrelated concepts. Often I read articles that talk about the resilience of cyber defense, but in the article itself, the main topic is the robustness of this defense. High-level robustness is the left preventive controls part in our cyber-attack bowties, whereas resilience includes the right mitigative controls part of the bowtie. For a successful defense, we need both resilience and robustness, an exclusive focus on resilience ignores the preventative measures and an exclusive focus on robustness ignores the mitigative measures.
For a holistic approach, we need a cyber-physical risk assessment because we want to understand the relations between the potential cyber-attacks and the process consequences. Only then we have an overview of the hazards and we can work out a balanced and resilient cyber defense. Critical infrastructure needs to have a robust and resilient security defense.

A complimentary guide to the who`s who in industrial cybersecurity tech & solutions

Free Download

Related