Offline Backups Are Not Enough: Building a Recovery System for PLCs, HMIs, and Controller Configurations

Standard

If your OT backup strategy ends at “we have copies,” you do not have a recovery plan.

You have a hope archive.

In ICS environments, recovery is not just about having a file stored offline. The real question is whether your team can restore the right controller logic, HMI project, firmware version, network settings, licenses, dependencies, and configuration state under pressure.

That is where many plans fail.

A resilient OT recovery program needs more than backups. It needs:

1. Version-controlled PLC and HMI projects
Know what changed, when it changed, who approved it, and which version is production-valid.

2. Offline and protected recovery copies
Backups must be isolated from ransomware, accidental overwrites, and unauthorized modification.

3. Firmware and dependency mapping
A controller file may be useless if the required firmware, engineering software, drivers, or vendor tools are missing.

4. Tested restoration workflows
If restoration has never been rehearsed, the first real incident becomes the test.

5. Role-aware procedures
Operators, engineers, IT, vendors, and incident responders need clear responsibilities before an outage begins.

6. Network and device configuration recovery
Switches, firewalls, remote access appliances, historian connectors, and controller settings are part of the recovery chain.

The goal is not to prove that backups exist.

The goal is to prove that production can be safely restored.

In OT, recovery readiness is measured in validated restore capability, not storage capacity.

AI-Accelerated Ransomware in OT: When Attackers Stop Encrypting and Start Disrupting Operations

Standard

The next OT ransomware threat is not just smarter malware.

It is an attacker using AI to understand your plant faster than your own incident team can respond.

For years, ransomware in industrial environments was mostly treated as an IT problem that spilled into OT: encrypted workstations, locked servers, delayed production, and recovery pressure.

That model is changing.

With LLMs, attackers no longer need deep domain expertise to interpret maintenance manuals, vendor documentation, alarm logic, operating procedures, or engineering notes. AI can help them move from “we got access” to “we understand how this process works” much faster.

That changes the risk equation.

The future concern is not only data theft or encryption. It is process-aware disruption:

• Manipulating sequencing or setpoints
• Targeting safety-adjacent systems
• Timing attacks around maintenance windows
• Disrupting batch quality instead of stopping production
• Using stolen documentation to pressure operators with credible threats

In OT, context is power. AI gives attackers a shortcut to context.

This means OT leaders should prepare for ransomware operators that are less dependent on specialist knowledge and more capable of operational impact.

Key questions to ask now:

• What plant documentation is exposed, overshared, or poorly controlled?
• Can our incident team interpret OT process impact as quickly as an AI-assisted attacker can?
• Do our playbooks cover disruption scenarios beyond encryption?
• Are engineering workstations, vendor access, and backup procedures tested under realistic attack conditions?
• Can we isolate safely without creating more operational risk?

Ransomware defense in OT can no longer be only about restoring files.

It must be about preserving control, safety, and operational continuity when the attacker understands the process.

CISA’s AI-in-OT guidance, translated into a practical checklist for security leaders

Standard

Most teams read CISA guidance like a PDF to file away.
Treat it like an architecture spec: if you can’t point to the control in your OT network, you don’t have “AI security” — you have AI exposure.

Here’s a lightweight checklist to turn AI-in-OT principles into implementable controls:

1) Asset + data inventory
– Where are AI models running (edge gateway, historian tier, cloud)?
– What OT data feeds them (tags, logs, images), and where does it leave the plant?

2) Data handling controls
– Classify OT data; define allowed uses (training vs inference).
– Minimize retention; encrypt in transit/at rest; restrict exports.

3) Model and pipeline access
– Separate service accounts; least privilege; MFA for consoles.
– Signed artifacts; controlled model promotion (dev/test/prod).

4) Network segmentation
– Place AI components in a dedicated zone.
– Limit flows to required protocols/ports; one-way where feasible.

5) Monitoring + detection
– Log model access, prompts/inputs, outputs, and admin actions.
– Alert on abnormal data pulls, sudden model changes, new egress paths.

6) Supplier and integration risk
– Require SBOM/model provenance; patch SLAs; remote access controls.
– Validate connectors to PLC/HMI/historian; document trust boundaries.

7) Safety and fail-safe behavior
– Define what the AI can and cannot actuate.
– Ensure manual override; graceful degradation to known-safe mode.

8) Incident response for AI in OT
– Run playbooks for: data exfil, model tampering, prompt injection, drift.
– Pre-stage rollback models; isolate the AI zone without halting operations.

If you had to prove AI-in-OT security in 30 minutes, which of these would you struggle to evidence?

Why LLMs still don’t belong in OT/ICS pen tests (and what to automate instead)

Standard

Why LLMs still don’t belong in OT/ICS pen tests (and what to automate instead)

Hot take: the biggest risk isn’t that LLMs will miss vulnerabilities. It’s that they’ll make you overconfident and move faster than your safety controls can tolerate.

OT/ICS testing is not a web app sprint.
You are working in safety- and uptime-critical environments where:
– A wrong assumption can trigger downtime
– “Probably safe” actions can create real-world impact
– Context lives in diagrams, vendor quirks, and plant procedures, not in prompts

Where LLMs are risky in OT/ICS:
– AI-led exploitation: hallucinated commands, wrong protocol details, unsafe payloads
– Autonomous decision-making: chaining actions without understanding process state
– “Confident” triage: misranking findings when risk is process-dependent

What to automate instead (high leverage, low blast radius):
– Pre-engagement: scope drafting, rules of engagement, outage windows, asset lists
– Documentation: turning notes into clean test evidence, timelines, and reports
– Data wrangling: log parsing, packet metadata summaries, config diffing
– Test readiness: checklists, safety gates, runbooks, peer-review prompts
– Comms: stakeholder updates, change-control language, finding summaries

Principle: use AI to accelerate preparation and clarity, not to drive actions on live control networks.

If you are building or buying “AI for OT security,” ask one question:
What stops the model from doing something unsafe when it is almost right?

A Practical Reading of CISA Guidance for Using AI in OT: Controls You Can Implement This Quarter

Standard

Most teams treat CISA guidance like a PDF to acknowledge — the advantage goes to the ones who turn it into vendor contract clauses, model/data boundaries, and OT-specific monitoring on day one.

CISA’s AI guidance is only useful when it becomes concrete policies, procurement requirements, and technical guardrails that reduce attack surface.

A practical checklist you can implement this quarter for AI in OT:

1) Data boundaries
– Classify OT data and explicitly define what can/can’t leave the site
– Prohibit training on your telemetry by default; allow only with written approval
– Require encryption in transit and at rest; define retention and deletion SLAs

2) Access and identity
– Separate AI tooling accounts from operator engineering accounts
– Enforce MFA, least privilege, and time-bound access for vendors
– Log every model prompt, action, and data access path (and where possible, block high-risk actions)

3) OT monitoring and detection
– Add AI-related telemetry to your OT SOC use cases: new outbound flows, new service accounts, unusual historian queries
– Monitor for model-driven changes to setpoints, logic, recipes, or alarm thresholds

4) Procurement and contracts
– Contractually require SBOMs, vulnerability disclosure timelines, and patch SLAs
– Define model update controls: change notice, rollback plan, and validation in a test environment
– Require documented data lineage and a clear boundary between customer data and vendor training data

5) Supply chain and architecture
– Prefer on-prem or tightly scoped edge deployments for sensitive environments
– Segment AI components like any other critical OT asset; restrict egress by default

If you’re adopting AI in OT this year, which of these is hardest in your environment: data boundaries, monitoring, or vendor contract language?

Why LLMs Still Don’t Belong in OT/ICS Pen Tests (Yet): Reliability, Safety, and Liability Gaps

Standard

The hottest AI demos break in the one place you can’t afford “close enough”. If your pen test plan can’t be defended in a safety review or an audit, it’s not an OT pen test. It’s a lab experiment.

OT/ICS testing is different because the outcome isn’t just “data loss”. It can be downtime, damaged equipment, environmental impact, or safety incidents.

Where LLMs still fall short for OT/ICS pen tests:

1) Reliability
LLMs can hallucinate protocol behavior, device capabilities, CVE applicability, or remediation steps. In enterprise IT, that’s wasted time. In OT, it can drive unsafe actions.

2) Determinism and traceability
Assessments need repeatable steps, evidence, and clear provenance. “The model suggested…” is not a defensible control narrative.

3) Safety-first constraints
OT testing requires strict change control, defined stop conditions, and an understanding of process state. LLMs don’t inherently reason about physical consequence or operational context.

4) Liability and accountability
When guidance is wrong, who owns the risk: the tester, the vendor, the model provider? In regulated or safety-critical environments, that ambiguity is unacceptable.

AI still has a role, just not as the decision-maker.
Use LLMs to accelerate low-consequence work: summarizing vendor docs, drafting test plans for human review, parsing logs, mapping findings to standards, generating reporting language.

But keep final calls human-led: what to probe, how far to go, when to stop, and what is safe to recommend.

If you’re building AI for OT security, the bar isn’t “helpful”. It’s defensible, deterministic, and safe under audit.

From IT AD to historian ransomware: the dual-homing pivot path most teams don’t model end-to-end

Standard

If your historian can talk both ways, assume an attacker will use it as a router.

Here’s the pivot path I see repeatedly when incidents cross from IT into OT:

1) AD compromise (IT)
– Phished creds or token theft lands an attacker on a workstation/server.
– They enumerate AD, find service accounts, remote management paths, and “who talks to the historian.”

2) Lateral movement to the historian (the choke point)
– The historian is trusted, always-on, and connected to everything that matters.
– Dual-homed networking or shared credentials turns it into the bridge.

3) Ransomware on the historian = encrypted visibility
– Even before PLCs are touched, operations lose trending, alarms, reports, and context.
– Recovery is slow because historians often sit outside normal backup discipline.

4) Pivot into OT
– From the historian host, attackers reuse credentials, remote tools, or open routes to reach engineering workstations, HMIs, jump hosts, and OT management services.

Three places to stop this early:
A) Kill the credential chain
– Separate identity boundaries for OT, no AD trust shortcuts, rotate and scope service accounts, remove shared local admin.

B) Break the network bridge
– True segmentation between IT and OT, tightly controlled conduits, deny-by-default, and avoid dual-homed “convenience” paths.

C) Make the historian resilient
– One-way data transfer patterns where possible (data diode / brokered replication), immutable backups, and tested restore procedures.

Most teams model IT ransomware and OT safety separately. The historian is where those stories merge.

Where does your historian live in the trust model: a sensor, or a router?

Modern web-based HMIs: Do they really add attack surface—or just make the existing one visible?

Standard

Hot take: the “web HMI = more insecure” claim is usually an architecture problem, not a technology problem.

Browser-based HMIs don’t magically create new risk. They often expose the risk you already had in thick clients: weak identity, flat networks, slow patching, and unclear ownership.

If you’re evaluating a web HMI, don’t debate web vs. native. Ask what you are actually deploying.

Key design choices that determine real-world risk:

1) Identity and access
– Central IdP, MFA for remote access
– Role-based access, least privilege
– Separate operator vs engineer privileges

2) Session handling
– Short-lived tokens, rotation, timeouts
– No shared accounts, no “always logged in” kiosks without compensating controls

3) Network exposure
– No direct internet path to OT
– DMZ, reverse proxy, allow-listing
– Remote access via VPN/ZTNA with device posture

4) Update and vulnerability cadence
– Who patches what, and how fast
– SBOM, dependency scanning, signed builds
– Documented rollback and maintenance windows

5) Observability
– Central logs, auth events, configuration change trails
– Alerting that someone actually reads

Modernization is not the risky part. Unclear boundaries are.

If you want a quick gut check: show me your auth model, network zones, and update process and I’ll show you your risk.

What’s the hardest part for your team today: identity, network segmentation, or patching?

In OT, risk isn’t a buzzword: operationalize (threat × vulnerability × asset) into a weekly prioritization loop

Standard

Most OT programs fail because they rank vulnerabilities, not risk.

Flip it: start with your assets and credible threats, then decide which vulnerabilities actually matter enough to fix this week.

When every finding is “critical,” nothing gets done. The backlog becomes political, and engineering, IT, and operations debate severity instead of impact.

A simple, repeatable model breaks the stalemate:
Risk = Threat × Vulnerability × Asset

Turn that into a weekly loop:
1) Asset: Pick the top systems that keep product moving and people safe (not everything).
2) Threat: Agree on the few credible scenarios that could realistically hit those assets (not theoretical CVSS fear).
3) Vulnerability: Only then map weaknesses that enable those scenarios.
4) Score: Use a consistent 1–5 scale for each factor. Multiply. Rank.
5) Commit: Fix the top 5–10 items this week. Everything else waits.
6) Review: Capture what changed in the environment, threats, or compensating controls and rescore next week.

Outcome: shared language across OT engineering, IT security, and operations, and a prioritized plan tied to real-world impact.

If your OT backlog feels permanent, stop asking “Which vulnerabilities are worst?”
Start asking “Which asset-threated paths are most likely and most damaging this week?”

ISA/IEC 62443 for SIS: stop treating Safety Systems as “off-limits” and start applying security levels like an engineering spec

Standard

Contrarian take: the safest SIS is the one you can still patch, monitor, and validate.

Too many SIS environments get a security pass because they’re “safety-critical.”
That logic is backwards.

If a cyber event can change logic, blind diagnostics, or disrupt comms, your safety case is now conditional on security you didn’t specify.

ISA/IEC 62443 gives a practical way out: define Security Levels at SIS boundaries and turn risk talk into engineering requirements.

What that looks like in practice:
– Define SIS zones/conduits explicitly (SIS controller, engineering workstation, diagnostics, vendor remote access)
– Assign target SL based on credible threat capability, not comfort level
– Translate SL into design requirements: segmentation, authentication, hardening, logging, backup/restore, update strategy
– Make it testable: FAT/SAT cybersecurity checks, periodic validation, evidence for MOC and audits
– Assign ownership: who maintains accounts, patches, monitoring, and exception handling

Security levels aren’t bureaucracy. They’re how you prove the safety function still holds under cyber conditions.

If your SIS is “off-limits” to security engineering, it’s also off-limits to assurance.

How are you defining SIS security boundaries and target SLs today?