Through the Lens of a Case Study: What It Takes to Be a Cyber-Physical Risk Analyst

Through the Lens of a Case Study: What It Takes to Be a Cyber-Physical Risk Analyst

I regularly cover the topic of cyber-physical risk analysis in my writings, and as a result, I’ve received numerous inquiries on LinkedIn about the path to becoming a cyber-physical risk analyst. To address this question more efficiently for me, I’ve decided to discuss the topic in an article, so I have a reference.

So, what exactly is the role of a cyber-physical risk analyst? Put simply, these professionals identify and manage the risks for manufacturing processes where computer-based automation and physical installations intersect. What skills and knowledge should they have? I could simply list them and wrap up the article, but to make things a bit more engaging, let’s walk through a simplified analysis to understand what expertise is needed and why it is needed. Imagine we’re tasked by plant management with assessing whether our plant’s risk tolerance criteria for safety and environmental impact are met in the face of an evolving threat landscape where cyber threats targeting our production infrastructure become more common due to increased global political tensions.

Where should we begin the analysis? To evaluate our cyber defense against risk tolerance criteria, a good starting point would be the plant’s process hazard analysis (PHA) documentation, for example the LOPA (Layer of Protection Analysis) sheet. If that’s not available, then the HAZOP (Hazard and Operability Study) sheet would be the next go-to document. If this documentation is also not available, we would have to conduct a failure mode and effect analysis (FMEA). But working in the chemical, refining, or offshore industry there is a good chance these documents are available. The process safety engineers use these PHA methods to identify the need for safeguards and select the number of safeguards sufficient to align with the company’s risk criteria. This requires them to  pinpoint safety hazards and define the necessary safety measures for the required risk reduction, an exercise of many months in most cases.

Normally, safety hazards are prioritized by their level of impact, this allows us to focus our cyber physical risk analysis on perhaps the top two or three impact categories. Normally this is selective enough to address all digital functions in the automation system that controls and safeguards the production process. Process safety engineers assign these impact level categories (which is sometimes called the target factor) to each process safety scenario they analyze, typically considering risk categories such as human safety, environmental damage, and financial implications, although additional categories may also be taken into account. The allocation of impact level categories serves to establish a TMEL (Target Mitigated Event Likelihood) value, denoting the maximum permissible event frequency for the scenario corresponding to a given level of impact. These impact levels are a combination of company-assigned (financial category) and regulator-assigned (human safety, environment categories) parameters.

Since risk tolerance criteria are directly tied to these impact levels, those that are defined by a regulator must be strictly observed irrespective of the incident’s causation. Notably, the omnipresence of cyber threats in today’s threat landscape poses a distinctive challenge for process installations, as conventional process safety criteria primarily considered accidental safety events rather than deliberate malicious events such as cyber attacks. From a regulatory perspective, the cause of an incident—whether an intentional cyber attack or a traditional accidental safety event—is not as relevant as the outcomes in terms of meeting established thresholds for acceptable risk. Regulators primarily focus on whether the safety and environmental standards are upheld, and the thresholds for acceptable risks, such as potential for fatalities, injuries, or environmental damage, are not exceeded, regardless of the cause. Let’s take a look at an excerpt from a LOPA sheet that describes such an accidental scenario.

 Functional process  deviationLoss impactTFSafety Instrumented FunctionBPCS   Control Action
CF.2.3.2aTotal Loss of Boiler Feed Water supply to Furnace during normal operation FT/FV-1X2-03Loss of steam drum level, over heating and failure of TLE’s Potential fatality, damage, and loss production < US$10MM5HV-1X2-51A/C HV-1X2-61A/C on Low Low steam drum level.   (2 out of 3) LT-1X2-03A/B/C  Disable / Close fuel control valves on low low steam drum level trip (BPCS)   Sensors shared with SIS.   LT-1X2-03A/B/C PV-1X2-51/61
Remark TLE – Thermocouple Level Elements, operate based on the principle of temperature difference. They sense the temperature change between steam and water, as there is a significant temperature difference between the two phases. By detecting this change, TLEs determine the water level within the steam drum, ensuring that the water level stays within safe operating limits to prevent problems such as boiler tube damage or, in the worst case, boiler explosions.
Table 1 – Example line extract from a LOPA sheet (a safety scenario view)

Figure 1 – Water tube boiler control example (a business level view).

In the table above, I show an excerpt from a LOPA sheet showing only one of the hundreds of analyzed safety scenarios for a production process. The task of the cyber-physical risk analyst entails transforming this safety scenario into a cyber attack scenario wherein the cyber attack results in the stated process deviation. Upon examination of the requisite professional skills and knowledge required thus far, the risk analyst primarily needs to understand how a plant is organized, who owns what information, so we may enlist the support of process safety, instrumentation, and process engineers to gather the applicable information for our analysis.

However, constructing a credible cyber attack scenario that causes a process deviation requires more effort, because to inflict damage, we must do so in a way that bypasses the safeguards. For this, we need to understand the process and process equipment, which requires us to know the basics of process engineering such as, for example, “reading” a P&ID and understanding what a boiler is and does. A cyber physical risk analyst will typically “think” in scenarios and determine the risk for such a scenario.

From the safety scenario in Table 1, I created a small, simplified piping and instrumentation diagram (P&ID) showing the valves and sensors discussed in the scenario and giving us an idea how a boiler works.  We need this if we want to construct a credible attack that could succeed for our analysis, just hacking the system and closing the feedwater valve (which is the failure scenario in the process safety analysis) will most likely not cause the damage. Both the operator and the safety systems would quickly spot the loss of feedwater issue and bring the boiler either manually or automated to a safe condition by (among other actions for preventing downstream and upstream impact) stopping the feedwater pump and the boiler heating.

In cyber physical risk analysis, we need to work scenario based and estimate their likelihood of success for the attacker or looking at the other side of the coin the probability of defense failure for the defender, for comparison with the plant’s risk tolerance criteria. Be aware that almost every attack scenario against a process installation requires some orchestration, like closing a feedwater valve combined with hiding the consequences for the feedwater level from the process operator and the automated safety functions, impacts the risk estimate. Additionally, we need to consider steps toward automation functions that could cause these process deviations. Essentially, we must aggregate the risks from multiple threat actions to estimate the overall risk. This calls for numerical analysis, hence the need for a form of quantitative risk analysis.

In our scenario, it is noted that the level sensor readings are shared between the process control function and the safety function. This design decision is not unlikely because, from an accidental perspective (process safety), the likelihood of losing three level transmitters simultaneously is considered low. Additionally, the available space to mount level transmitters for the steam drum is limited. However, sharing them presents us as risk analysts with the opportunity to construct an attack scenario where we can simultaneously prevent the safety system from shutting down the boiler and the operator from detecting a too-low water level in the boiler by manipulating the level sensors. Of course, we can achieve the same by separately attacking the control function and the safety function, but this estimation would result in a lower likelihood and would therefore not be relevant for our check against the risk criteria.

So from a skills and knowledge perspective a risk analyst needs to understand the business processes, which, in our case as cyber-physical risk analysts, necessitates understanding the process installation (such as for example a polyethylene process), and the various units (distinct sections or divisions within the overall process installation, for example, the boiler) and process equipment (the machinery, devices, instruments, and tools utilized within each process unit, such as pumps or valves) that facilitate its operation. This understanding requires not a detailed grasp but at least a foundational level, allowing us to engage with the plant’s subject matter experts (SMEs) to verify the feasibility of an attack plan.

However, a cyber-physical risk analyst is typically only capable of performing well with a limited set of processes that they understand thoroughly. Switching between processes with completely different characteristics, such as from a petrochemical process to car manufacturing or even to a power grid, is not easy. Although much of the analysis process would be similar, adapting would require time and new knowledge.

Risk analysts who have been working in a specific field for some time quickly recognize that the automation between production processes doesn’t vary significantly when looked at it from a high level. While there may be different types of boilers or reactors—for instance, one plant may use a tubular reactor for continuous polyethylene production, while another employs an autoclave reactor making it a polyethylene batch process—the automation and safeguarding principles of these processes share many similarities. However, we must be aware that the number of relevant attack scenarios differ, in general batch processes offer more attack opportunities than continuous processes.

Once we have devised our attack scenarios against the installation, our next step is to establish the relationship with the digital functions of the process automation system. This is very typical for a bottom-up approach to risk; a top-down approach would begin by analyzing the digital functions, hoping to achieve a granularity that matches the process scenarios. Experience has shown that this is much harder to accomplish. The ISA 62443-3-2 document outlines a top-down approach, which is unfortunate because a standard should not attempt to dictate how to address the issue. Standards shouldn’t be prescriptive, but more performance based in my opinion.

The attack on the installation will be executed by targeting the functions of the process automation system. So, we need to identify those automation functions that can directly (with a single attack action) initiate the process deviation. This necessitates detailed knowledge of the automation system and its numerous dependencies and internal trust relationships. While the valve positioner may seem like our prime target to stop the feedwater supply, its suitability as digital target depends on how it interfaces with the controller. It could be network-connected or directly linked to a controller’s I/O interface cards. Such a link to the I/O card can be wired with an analog signal or can be a digital signal, or a hybrid signal such as used in field equipment using the HART technique where the digital signal is superimposed on the analog signal. The positioner can be a simple device (just controlling the position of a valve’s actuator and, consequently, the opening or closing of the valve itself) with no advanced functions or a smart device with multiple analytic functions (such as valve static stiction analysis, or stroke time measurement) that could be exploited. Thus, understanding the exposure of the target is crucial knowledge for the risk analyst.

Moreover, there exist numerous trust relationships and dependencies within an automation system, which can differ between products and product releases, affording threat actors similar opportunities as those presented by the valve positioner. For instance, once access to an operator station is gained, a threat actor can open / close a valve by issuing commands that propagate from the operator station through the controller to the transmitter fully based on trust between these functions. Therefore, attacking components such as the operator station, a DCS server, a controller, or an engineer station could yield similar results as directly targeting the valve positioner. Alternatively, if maintenance functions such as an instrument asset management system is implemented, attacking this function could also be used if the field equipment would be managed by it. Frequently we can increase the number of targets by considering the exchange of the commands over the network, message injection, message alteration, message sink holes, or replay attacks can also be an option for the threat actor. The various dependencies and relationships in a process automation system are very important for correctly evaluating the risk. Today’s automation systems are tightly coupled and can result in that data changes made at corporate level cause changes at the installation level. This leads to numerous digital attack scenarios targeting the process automation functions that must be addressed. The risk analyst must have a good knowledge of process automation to understand all the threat actors opportunities.

Figure 2 – Infrastructure fragment (a physical view)

The risk analyst requires tools that analyze all these variations typically using a cyber threat repository with hundreds of attack scenarios for each automation system component, based on the vendor, software version, threat actor profile, and exposure. Such highly specialized tooling allows the risk analyst and threat analyst roles to be distinct, enabling better management of the scope and detail required for a cyber-physical risk assessment and a consistent risk estimation in different projects executed by different analysts.

To prepare this activity the risk analyst must convert the “physical” automation system infrastructure (build from physical assets – Figure 2) into a “logical” automation system infrastructure (build from functions – Figure 3) to be able to clearly identify the scope of the risk analysis (which components and protocols are in scope) so we can identify the dependencies, trust relations, and the potential differences in functional behavior (deviation from design intent) as result of the cyber attack. Such a logical representation is called a scope diagram, it documents all the functions in the analysis and the various channels (protocols) used which we must “test” with our counter factual risk analysis engine for sufficient resilience of our defenses.

Figure 3 – A scope diagram, a logical view.

The counter factual risk analysis engine provides us with a probability of defense failure if attacked for each attack scenario and different threat actor profile for each component in the scope diagram.  A scope diagram differs from an infrastructure document because a scope diagram has a function focus and multiple functions can reside in a single physical computing component, additionally we can combine different physical components in a single functional element for the purpose of risk analysis.

The cyber-physical risk analyst should understand the various attack strategies and techniques to accurately analyze and explain the results of a risk estimation and, as a next step, aggregate these results for individual system functions and channels into an overall risk value along a constructed attack path in an orchestrated attack that requires to breach multiple functions along the path exploiting different weaknesses.

Constructing such attack paths is the responsibility of the risk analyst, who must consider the attack objectives threat actors need to achieve, to realize their goal and assess what the overall likelihood of success / defense failure would be. Such a risk evaluation process employs a risk aggregation technique known as Rings of Protection Analysis (ROPA) describing how to combine risks between components. ROPA assesses the dependencies between components and various attack paths to decide whether to multiply, add, or take the highest conditional probability for the overall risk calculation.

Therefore, cyber-physical risk experts need to understand cyber attack scenarios, be capable of constructing attack paths, and grasp risk aggregation techniques, use various approaches to risk assessments, and understand their benefits and limitations. They also need the skills to report on risk. This is a challenging task because there are many different objectives for conducting a risk analysis, requiring them to choose between multiple risk assessment techniques (top-down, bottom-up, hybrid) and different estimation methods (qualitative, quantitative, hybrid). Additionally they must understand the data required for these assessments and how to collect and analyze this data.

Additionally, the task of a cyber-physical risk analyst is to facilitate the assessment process, thus requiring experience as a consultant capable of organizing teams and leading workshops with subject matter experts to complete the task effectively. So this role is not simple; it demands several years of experience in the OT security field, ideally within an environment where all these disciplines can be learned. Suitable environments to learn the profession might include a large asset owner with a specialized staff team focusing on cyber-physical risk, a large automation vendor that combines the implementation of process automation systems with specialized security services like cyber-physical risk assessments, or an EPC contractor. Given the specialized nature of this role, there are not many companies that fit the bill, making it a role that cannot be easily learned.

However, it’s important to remember that risk analysis always requires an understanding of the business process, as the impact and frequency of potential losses are crucial factors. A cyber attack itself does not directly harm individuals; instead, the risk level is determined by deviations in the business process combined with the design of the installation. General descriptions such as ‘loss of view’ or ‘loss of control’ are not very meaningful unless they are directly linked to specific process deviations and safety scenarios.

I hope this article has shed some light on the types of knowledge and skills required to become a cyber-physical risk analyst. Unfortunately, there is no single course or bootcamp that can fully prepare you to the appropriate knowledge level. While some master classes are available, much of your development as a risk analyst will be shaped by your career path.

Related