Designing Patch SLAs That Security and Operations Both Accept

Patch SLA design for security and operations

The patch SLA negotiation between security and operations teams is one of the most predictable organizational conflicts in enterprise IT. Security wants 24 hours for Critical CVEs because the threat intel says attackers begin exploitation within that window. Operations wants at least 72 hours because they need to schedule a change window, run a test deployment in staging, get manager approval, notify affected business units, and coordinate with the DBA team if the patch touches anything near the database tier. Both positions are reasonable. Both positions are frequently irreconcilable using the same SLA tier for every system.

The solution isn't to split the difference at 48 hours and call it done. That produces an SLA that security considers too slow for genuinely critical threats and that operations considers unreliably achievable for complex infrastructure. The better design is a multi-axis SLA model that independently accounts for vulnerability severity, asset criticality, and deployment complexity — producing an SLA that's tight where it needs to be and realistic where it doesn't.

Why Single-Tier SLAs Fail

A flat "Critical CVEs remediated in 72 hours" SLA creates two failure modes. The first is the obvious one: a CVE that's being actively mass-exploited with a working Metasploit module against your public-facing authentication service should not be on the same 72-hour clock as a Critical CVE in an internal batch processing tool with no network exposure. The threat landscape doesn't operate on a flat severity model, and your SLA shouldn't either.

The second failure mode is harder to see: flat SLAs calibrated to the worst-case deployment scenario (complex, production, highly-constrained) are too slow for the easy cases that make up most of the patch queue. If your Critical SLA is 72 hours because the database team needs that much lead time, and most of your Critical CVEs are on Linux web servers that can be patched in under 30 minutes, you've systematically delayed your easiest remediations to accommodate your hardest ones. That's backwards.

The Two-Axis Model

The most practical improvement over flat SLAs is a two-axis model that combines vulnerability severity tier with asset criticality tier. Severity determines the urgency band (how fast must this be addressed at all?). Asset criticality determines the deployment path (how much process overhead is required?).

Severity tiers, based on enriched risk scores rather than raw CVSS: Critical-Active (KEV-listed or actively exploited), Critical-High (CVSS 9.0+ with public exploit), High (CVSS 7.0-8.9 or CVSS 9.0+ with no exploit), Medium (CVSS 4.0-6.9), Low (below 4.0). The key distinction between Critical-Active and Critical-High is that KEV-listed CVEs demand immediate response regardless of CVSS score, while CVSS-only Criticals can follow standard process.

Asset criticality tiers, defined by the operations team based on actual deployment constraints: Tier 1 (internet-facing production, can be patched with automated deployment, no manual approval required), Tier 2 (internal production, requires change ticket and 4-hour notice to application owner), Tier 3 (production with specialized constraints: databases, payment systems, regulatory scope requiring multi-party approval). The tiers aren't about importance — they're about how long the deployment process realistically takes.

SLA Matrix: An Example

Combining these axes produces a matrix where every cell has a defined SLA that reflects both the urgency of the threat and the realistic deployment timeline. A practical example:

Critical-Active / Tier 1: 4 hours. Automated deployment, no change window required. PatchGuard deploys immediately with post-patch health check. Critical-Active / Tier 2: 12 hours. Change ticket auto-created, application owner notified, deployment on next available window within 12 hours. Critical-Active / Tier 3: 24 hours. Emergency change process triggered. Multi-party approval compressed under emergency escalation path.

Critical-High / Tier 1: 24 hours. Automated deployment during configured maintenance window. Critical-High / Tier 2: 48 hours. Standard change process with 24-hour advance notification. Critical-High / Tier 3: 72 hours. Standard change process with full review cycle.

High / Tier 1: 7 days. Scheduled batch patching. High / Tier 2: 14 days. Monthly patch cycle with standard change process. High / Tier 3: 21 days. Monthly patch cycle with extended review for specialized systems.

This matrix produces 24+ distinct SLAs, each reflecting the intersection of real threat urgency and real operational constraints. Security teams can accept it because the most dangerous threats get the fastest response. Operations teams can accept it because complex systems still get reasonable timelines for non-emergency patches.

Handling SLA Conflicts in Practice

The matrix resolves most individual patch decisions automatically — the SLA for a given patch is determined by the CVE's severity tier and the affected asset's criticality tier, with no judgment required. The remaining conflicts cluster in two scenarios: when a Critical-Active CVE falls on a Tier 3 system where the compressed 24-hour emergency change process is still operationally difficult, and when multiple Critical patches need to go to the same system simultaneously (sequencing question, not timeline question).

For the first scenario — emergency changes on constrained systems — having a pre-approved emergency change template in your change management system is essential. ServiceNow and JIRA Service Management both support emergency change categories with abbreviated approval workflows. The emergency template should require: CISO or delegate approval (one person, not a committee), a defined rollback plan, and a post-change review within 24 hours. When PatchGuard detects a KEV-listed CVE on a Tier 3 asset, it auto-creates the emergency change ticket with this template, pre-populated with the CVE details and affected asset list, so the approval process can proceed without manual ticket creation overhead.

For the second scenario — multiple critical patches competing for the same change window — the sequencing decision should be driven by risk score, not by patch release date or ticket creation order. Patch the highest-enriched-risk CVE first. If two CVEs have similar risk scores, patch the one with the simpler rollback path first (you want clean rollback options if something goes wrong early in the window).

Getting Organizational Buy-In

The SLA matrix is only effective if both security and operations teams have ownership of their respective axes. Security owns severity tier definitions — what makes something Critical-Active versus Critical-High is a threat intelligence judgment that operations shouldn't override. Operations owns asset criticality tier assignments — what makes a system Tier 1 versus Tier 3 is a deployment complexity judgment that security shouldn't override.

Present the matrix as a joint product of both teams' requirements, not as something security dictated to operations or vice versa. The framing that works: "Security determined how fast threats need to be addressed based on real-world exploitation data. Operations determined how fast deployments can happen based on real operational constraints. This matrix is where those two inputs meet." Neither team is being asked to ignore their constraints — they're being asked to document them systematically so that the intersection is clear.

Annual review of the matrix is important. Asset criticality tiers should be revisited as infrastructure changes: a system that was Tier 3 last year because it lacked automated deployment infrastructure may now be Tier 1 after a CI/CD pipeline was built. Severity tier definitions should be revisited as threat intelligence practices evolve. The matrix is a living operational document, not a permanent policy.

Measuring Compliance

Track SLA compliance per cell in the matrix, not just overall compliance. An organization with 95% overall SLA compliance could be achieving that by consistently meeting the easy-case SLAs (High / Tier 1) while routinely missing the hard-case ones (Critical-Active / Tier 3). Disaggregated reporting shows where the process is failing and makes the improvement work specific rather than generic.

PatchGuard's SLA dashboard tracks time-to-patch by severity tier and asset criticality tier, showing both the distribution of actual patch times and the percentage within SLA for each matrix cell. Teams that review this weekly find it much easier to identify specific bottlenecks — a particular approver who's slow on emergency changes, a patch automation gap for specific OS types, a change management process that's adding 18 hours of latency to Tier 2 deployments — than teams who track only aggregate SLA compliance numbers.