Growing need to implement effective post-incident recovery strategies in evolving OT, ICS environments

Growing need to implement effective post-incident recovery strategies in evolving OT, ICS environments

Post-incident recovery strategies are vital in OT and ICS settings to lessen cybersecurity disruptions and reduce incident impacts. These environments, common in critical fields like energy, healthcare, and transportation, need strong plans for continuous operations and to prevent the cascading effects of cyber events. 

A significant approach is having specific response procedures and trained teams for both cybersecurity and operations. This setup helps quickly detect, contain, and manage incidents. Moreover, it’s crucial to establish detailed backup and recovery processes for key systems and data, as tested backups can notably cut downtime and data loss in case of breaches. Adding redundancy and failover mechanisms within OT/ICS (operational technology/industrial control system) structures can boost system strength and minimize interruptions. Redundant components provide alternative operation paths if the main system is compromised. 

Segmenting and isolating networks can restrict cyber threat movement, contain incidents, and curb further harm. Continuous monitoring and threat intelligence integration play crucial roles in post-incident recovery. Detecting anomalies and emerging threats in real time aids in swift and proactive responses to cyber events. Updating recovery plans based on past incidents ensures ongoing effectiveness and resilience in OT/ICS settings. 

Industrial Cyber reached out to cybersecurity experts to explore prevalent post-incident recovery strategies utilized by organizations in OT/ICS environments after cyberattacks. They also address the evolution of the threat landscape for OT/ICS environments in recent years and its influence on post-incident recovery endeavors.

“Mature organizations will apply lessons learned to improvements in cyber defenses and incident response plans/playbooks,” Paul Shaver, global practice leader at Mandiant’s Industrial Control Systems/Operational Technology Security Consulting practice – Google Cloud, told Industrial Cyber. “Less mature organizations will focus on disaster recovery, recovering/rebuilding and hardening systems, and building an IR capability.”

Shaver noted an increase in ransomware events bleeding into OT from IT and an increase in exploited zero days in firewalls and remote access tools leading to a slightly higher rate of compromise in OT environments. “Ransomware-focused threat actors intentionally targeting backup and recovery systems. Nation-state threat actors clean up their tools and malware, wipe systems, and use living-off-the-land techniques to deter or avoid detection and forensics.”

Kai Thomsen, director of global incident response services at Dragos, highlighted that the best recovery strategy is offline backups, and the most important security control to implement post-incident is ICS-aware network security monitoring. 

“The threat landscape against OT is evolving in volume as well as level of sophistication,” Thomsen told Industrial Cyber. “State actors are focusing on developing access to critical infrastructure environments to leverage cyber capabilities in times of conflict while criminal actors are increasingly targeting OT owners/operators, especially in the manufacturing vertical with ransomware as it is much more likely that a ransom will be paid if operations are disrupted.”

Oleg Vusiker, CTO at Salvador Technologies
Oleg Vusiker, CTO at Salvador Technologies

Many OT organizations still rely on manual backup processes stored on external disks or backup servers that are always online and vulnerable to cyberattacks, Oleg Vusiker, CTO at Salvador Technologies told Industrial Cyber. “Moreover, the management of backups is often manual, relying on cumbersome spreadsheets and inventory tracking. This manual approach not only increases the risk of errors but also hampers the ability to achieve reasonable Recovery Point Objectives (RPOs).” 

He added that utilizing cloud or network-based full image backups may only sometimes be feasible due to their high consumption of OT network bandwidth. “This bandwidth is primarily reserved for critical OT communications among components like PLCs and DCS. The threat landscape for OT/ICS environments has evolved significantly in recent years due to IT/OT convergence, cyber attackers increasingly target backups, which are crucial for recovery. Having a reliable backup significantly eases the recovery process after a cyber-attack.”

The executives examine how the NIST Cybersecurity Framework (CSF) informs and shapes post-incident recovery strategies in OT/ICS environments. Additionally, they analyze emerging technologies and best practices that are enhancing post-incident recovery in OT/ICS environments.

Shaver mentioned a risk-based approach to identify and prioritize critical assets and a recovery plan that focuses on the most urgent areas for restoration. “Communication and coordination between IT, OT, and business units. Development of OT/ICS-specific incident response and disaster recovery plans. Align with OT/ICS standards such as NIST SP 800-82 and IEC 62443.”

He added that emerging technologies may not be supported by current technology in many brownfield environments. Some promising capabilities include digital twins, immutable backups, micro-segmentation evolving to zero trust capability, and leveraging AI/ML in the investigation process.

Shaver also pointed out that best practices are well documented in NIST standards and IEC 62443 and include building defensible architectures; building and testing collaborative incident response capabilities, including ICS/OT specific IR plans including backup and recovery; implementing logging, monitoring, and network visibility specific to OT environments; establishing secure remote access capabilities; building asset and vulnerability management programs; and maintaining critical spare equipment. 

“NIST 800/82r3 provides high-level guidance on how to implement a practical recovery strategy for an OT environment. However, virtually every OT environment is unique and thus often requires a lot of tailoring for incident response and recovery plans to be effective,” according to Thomsen. “A practical recovery strategy for an OT environment has a great deal of complexity that is very difficult to prescribe solutions for in detail. Therefore it takes a combination of Framework with experienced professionals to make it practical.”

Vusiker said that the NIST CSF influences post-incident recovery strategies in OT/ICS environments by emphasizing the importance of reliable backups for data restoration, via a structure of isolation, detection, cleaning, and restoration. “It guides organizations in developing efficient recovery plans that prioritize the swift restoration of critical systems from backups, minimizing downtime and operational disruptions.” 

He pointed out that some emerging technologies and best practices for improving post-incident recovery in OT/ICS environments include air-gapped backups for enhanced data security, proactive Business Continuity Planning (BCP) to mitigate risks beforehand, and executing regular recovery exercises to ensure readiness and effectiveness.

The executives explore how organizations prioritize their post-incident recovery efforts to minimize disruptions to critical infrastructure. They also examine the role that regulations and compliance standards play in shaping post-incident recovery strategies for OT/ICS environments.

Shaver listed critical asset identification, business impact analysis (BIA), recovery plan development, backup and redundancy, and tabletop exercises. He further focused on prioritization during recovery to include rapid triage that can quickly assess the incident’s scope, impact, and containment needs and prioritize restoring the most critical assets and services; and a phased approach to break down the recovery process into smaller, manageable phases with clear goals.

He added functionality over perfection to focus first on restoring core functionality to minimize downtime. Lastly, he listed forensics vs. restoration so that balancing evidence preservation and system restoration requires careful decision-making amidst pressure. “Collaborate with law enforcement or forensics experts to determine the optimal course of action.”

On the role that regulations and compliance standards play, Shaver highlighted mandatory reporting/investigation and penalties; and demonstrating due diligence. “Many regulations mandate that incidents affecting critical infrastructure must be reported to regulatory bodies and potentially government agencies within specific timeframes and non-compliance can lead to fines or sanctions.”

He added that organizations often need to demonstrate that reasonable cybersecurity measures and recovery procedures are in place, even if an incident occurred. “Adherence to compliance standards, even during recovery, can act as a defense during legal scrutiny or regulatory inquiries arising from an incident.”

Thomsen said, “From what we have seen in OT assessments and incident response, many organizations still do not properly prioritize and do not have a full overview of their crown jewels or a disaster recovery or business continuity plan that walks administrators and operators through recovery procedures.” 

He added that industries with strong safety and disaster recovery regulations like offshore operations, oil and gas, and electric tend to have good recovery plans.

“Organizations prioritize detection and prevention cybersecurity tools to minimize disruptions to critical infrastructure, often allocating fewer resources to post-incident response activities,” according to Vusiker. “Consequently, the average recovery time from cyber events is very long. Regulations and compliance standards, like ISA/IEC 62443, play a crucial role in shaping recovery strategies for OT/ICS environments, emphasizing the importance of reliable backups for data restoration and ensuring rapid response and adherence to industry best practices.”

The executives discuss how organizations evaluate the efficacy of their post-incident recovery plans and make necessary adjustments. They examine common challenges and obstacles that organizations encounter during the recovery process from cyberattacks in OT/ICS environments and explore strategies to overcome these hurdles.

“Continual Improvement process by performing tabletop exercises and applying lessons learned (based on NIST CSF) to the overall IR capabilities,” Shaver said. “Legacy systems, reliance on third parties, extended downtime, and specialized skills shortages can all be addressed through proactive planning processes, continual improvement, and testing.”

Thomsen pointed out that many organizations have no or very limited visibility into their OT networks and thus lack the most important control to obtain and maintain situational awareness. “We suggest organizations use tabletop exercises to test their IR plans and implement the SANS Five Critical Controls to build a more effective defense post-incident. That includes good visibility to detect any future compromise and to more quickly investigate.” 

Organizations assess post-incident recovery plans through regular evaluations and practical exercises, incorporating lessons learned into their business continuity plans (BCP), Vusiker said. “Challenges in OT/ICS recovery stem from the complexity of industrial systems, often consisting of legacy equipment. Additionally, reliance on manual processes and outdated technologies for backup and recovery can slow down response times and elevate the risk of errors.”

Addressing the long-term implications of successful post-incident recovery efforts on an organization’s overall cybersecurity resilience in OT/ICS environments, Shaver said that leveraging the lessons learned to capitalize on the investments made in improving cyber defense and incident response capabilities and gaining the support of leadership and boards for long term investments in cyber resilience.

“Reducing mean time to recovery is the most important metric for OT incident response and recovery efforts, followed by identifying the root cause of the compromise and establishing proper security controls to eliminate or mitigate this root cause,” Thomsen mentioned. “Establishing network visibility into their OT environments that understand industrial communication and threat behavior is key for improving resilience. With situational awareness, operators are able to significantly reduce their mean time to recovery.”

“Successful post-incident recovery efforts in OT/ICS environments have a significant impact on an organization’s cybersecurity resilience,” Vusiker said. “By swiftly addressing cyber incidents, organizations can refine their incident response procedures, and increase confidence in their ability to protect critical infrastructure. This involves implementing lessons learned from past incidents and recovery exercises, enhancing communication and collaboration among teams, and investing in technologies that facilitate quicker recovery times,” he concluded.

A complimentary guide to the who`s who in industrial cybersecurity tech & solutions

Free Download

Related