Managing Incidents

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.

When to Use

Apply this skill when:

Setting up incident response processes for a team
Designing on-call rotations and escalation policies
Creating runbooks for common failure scenarios
Conducting blameless post-mortems after incidents
Implementing incident communication protocols (internal and external)
Choosing incident management tooling and platforms
Improving MTTR and incident frequency metrics

Core Principles

Incident Management Philosophy

Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.

Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.

Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.

Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.

Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.

Severity Classification

Standard severity levels with response times:

SEV0 (P0) - Critical Outage

Impact: Complete service outage, critical data loss, payment processing down
Response: Page immediately 24/7, all hands on deck, executive notification
Example: API completely down, entire customer base affected

SEV1 (P1) - Major Degradation

Impact: Major functionality degraded, significant customer subset affected
Response: Page during business hours, escalate off-hours, IC assigned
Example: 15% error rate, critical feature unavailable

SEV2 (P2) - Minor Issues

Impact: Minor functionality impaired, edge case bug, small user subset
Response: Email/Slack alert, next business day response
Example: UI glitch, non-critical feature slow

SEV3 (P3) - Low Impact

Impact: Cosmetic issues, no customer functionality affected
Response: Ticket queue, planned sprint
Example: Visual inconsistency, documentation error

Severity Classification Decision Tree

Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3

Incident Roles

Incident Commander (IC)

Owns overall incident response and coordination
Makes strategic decisions (rollback vs. debug, when to escalate)
Delegates tasks to responders (does NOT do hands-on debugging)
Declares incident resolved when stability confirmed

Communications Lead

Posts status updates to internal and external channels
Coordinates with stakeholders (executives, product, support)
Drafts post-incident customer communication
Cadence: Every 15-30 minutes for SEV0/SEV1

Subject Matter Experts (SMEs)

Hands-on debugging and mitigation
Execute runbooks and implement fixes
Provide technical context to IC

Scribe

Documents timeline, actions, decisions in real-time
Records incident notes for post-mortem reconstruction

Role Assignment by Severity:

SEV2/SEV3: Single responder
SEV1: IC + SME(s)
SEV0: IC + Communications Lead + SME(s) + Scribe

On-Call Management

Rotation Patterns

Primary + Secondary:

Primary: First responder
Secondary: Backup if primary doesn't ack within 5 minutes
Rotation length: 1 week (optimal balance)

Follow-the-Sun (24/7):

Team A: US hours, Team B: Europe hours, Team C: Asia hours
Benefit: No night shifts, improved work-life balance
Requires: Multiple global teams

Tiered Escalation:

Tier 1: Junior on-call (common issues, runbook-driven)
Tier 2: Senior on-call (complex troubleshooting)
Tier 3: Team lead/architect (critical decisions)

Best Practices

Rotation length: 1 week per rotation
Handoff ceremony: 30-minute call to discuss active issues
Compensation: On-call stipend + time off after major incidents
Tooling: PagerDuty, Opsgenie, or incident.io
Limits: Max 2-3 pages per night; escalate if exceeded

Incident Response Workflow

Standard incident lifecycle:

Detection → Triage → Declaration → Investigation
  ↓
Mitigation → Resolution → Monitoring → Closure
  ↓
Post-Mortem (within 48 hours)

Key Decision Points

When to Declare: When in doubt, declare (can always downgrade severity)

When to Escalate:

No progress after 30 minutes
Severity increases (SEV2 → SEV1)
Specialized expertise needed

When to Close:

Issue resolved and stable for 30+ minutes
Monitoring shows all metrics at baseline
No customer-reported issues

Communication Protocols

Internal Communication

Incident Slack Channel:

Format: #incident-YYYY-MM-DD-topic-description
Pin: Severity, IC name, status update template, runbook links

War Room: Video call for SEV0/SEV1 requiring real-time voice coordination

Status Update Cadence:

SEV0: Every 15 minutes
SEV1: Every 30 minutes
SEV2: Every 1-2 hours or at major milestones

External Communication

Status Page:

Tools: Statuspage.io, Instatus, custom
Stages: Investigating → Identified → Monitoring → Resolved
Transparency: Acknowledge issue publicly, provide ETAs when possible

Customer Email:

When: SEV0/SEV1 affecting customers
Timing: Within 1 hour (acknowledge), post-resolution (full details)
Tone: Apologetic, transparent, action-oriented

Regulatory Notifications:

Data Breach: GDPR requires notification within 72 hours
Financial Services: Immediate notification to regulators
Healthcare: HIPAA breach notification rules

Runbooks and Playbooks

Runbook Structure

Every runbook should include:

Trigger: Alert conditions that activate this runbook
Severity: Expected severity level
Prerequisites: System state requirements
Steps: Numbered, executable commands (copy-pasteable)
Verification: How to confirm fix worked
Rollback: How to undo if steps fail
Owner: Team/person responsible
Last Updated: Date of last revision

Best Practices

Executable: Commands copy-pasteable, not just descriptions
Tested: Run during disaster recovery drills
Versioned: Track changes in Git
Linked: Reference from alert definitions
Automated: Convert manual steps to scripts over time

Blameless Post-Mortems

Blameless Culture Tenets

Assume Good Intentions: Everyone made the best decision with information available.

Focus on Systems: Investigate how processes failed, not who failed.

Psychological Safety: Create environment where honesty is rewarded.

Learning Opportunity: Incidents are gifts of organizational knowledge.

Post-Mortem Process

1. Schedule Review (Within 48 Hours): While memory is fresh

2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document

3. Meeting Facilitation:

Timeline walkthrough
5 Whys Analysis to identify systemic root causes
What Went Well / What Went Wrong
Define action items with owners and due dates

4. Post-Mortem Document:

Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
Distribution: Engineering, product, support, leadership
Storage: Archive in searchable knowledge base

5. Follow-Up: Track action items in sprint planning

Metrics and Continuous Improvement

Key Incident Metrics

MTTA (Mean Time To Acknowledge):

Target: < 5 minutes for SEV1
Improvement: Better on-call coverage

MTTR (Mean Time To Recovery):

Target: < 1 hour for SEV1
Improvement: Runbooks, automation

MTBF (Mean Time Between Failures):

Target: > 30 days for critical services
Improvement: Root cause fixes

Incident Frequency:

Track: SEV0, SEV1, SEV2 counts per month
Target: Downward trend

Action Item Completion Rate:

Target: > 90%
Improvement: Sprint integration, ownership clarity

Continuous Improvement Loop

Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘

Tool Selection

Incident Management Platforms

PagerDuty:

Best for: Established enterprises, complex escalation policies
Cost: $19-41/user/month
When: Team size 10+, budget $500+/month

Opsgenie:

Best for: Atlassian ecosystem users, flexible routing
Cost: $9-29/user/month
When: Using Atlassian products, budget $200-500/month

incident.io:

Best for: Modern teams, AI-powered response, Slack-native
When: Team size 5-50, Slack-centric culture

Status Page Solutions

Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)

Anti-Patterns to Avoid

Delayed Declaration: Waiting for certainty before declaring incident
Skipping Post-Mortems: "Small" incidents still provide learning
Blame Culture: Punishing individuals prevents systemic learning
Ignoring Action Items: Post-mortems without follow-through waste time
No Clear IC: Multiple people leading creates confusion
Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
Hands-On IC: IC should delegate debugging, not do it themselves

Implementation Checklist

Phase 1: Foundation (Week 1)

Define severity levels (SEV0-SEV3)
Choose incident management platform
Set up basic on-call rotation
Create incident Slack channel template

Phase 2: Processes (Weeks 2-3)

Create first 5 runbooks for common incidents
Set up status page
Train team on incident response
Conduct tabletop exercise

Phase 3: Culture (Weeks 4+)

Conduct first blameless post-mortem
Establish post-mortem cadence
Implement MTTA/MTTR dashboards
Track action items in sprint planning

Phase 4: Optimization (Months 3-6)

Automate incident declaration
Implement runbook automation
Monthly disaster recovery drills
Quarterly incident trend reviews

Platform Engineering - Platform reliability and self-service reducing incidents
Implementing GitOps - GitOps enables fast rollback during incidents
Testing Strategies - Comprehensive testing prevents incidents

When to Use​

Core Principles​

Incident Management Philosophy​

Severity Classification​

SEV0 (P0) - Critical Outage​

SEV1 (P1) - Major Degradation​

SEV2 (P2) - Minor Issues​

SEV3 (P3) - Low Impact​

Severity Classification Decision Tree​

Incident Roles​

Incident Commander (IC)​

Communications Lead​

Subject Matter Experts (SMEs)​

Scribe​

On-Call Management​

Rotation Patterns​

Best Practices​

Incident Response Workflow​

Key Decision Points​

Communication Protocols​

Internal Communication​

External Communication​

Runbooks and Playbooks​

Runbook Structure​

Best Practices​

Blameless Post-Mortems​

Blameless Culture Tenets​

Post-Mortem Process​

Metrics and Continuous Improvement​

Key Incident Metrics​

Continuous Improvement Loop​

Tool Selection​

Incident Management Platforms​

Status Page Solutions​

Anti-Patterns to Avoid​

Implementation Checklist​

Phase 1: Foundation (Week 1)​

Phase 2: Processes (Weeks 2-3)​

Phase 3: Culture (Weeks 4+)​

Phase 4: Optimization (Months 3-6)​

Related Skills​

References​

When to Use

Core Principles

Incident Management Philosophy

Severity Classification

SEV0 (P0) - Critical Outage

SEV1 (P1) - Major Degradation

SEV2 (P2) - Minor Issues

SEV3 (P3) - Low Impact

Severity Classification Decision Tree

Incident Roles

Incident Commander (IC)

Communications Lead

Subject Matter Experts (SMEs)

Scribe

On-Call Management

Rotation Patterns

Best Practices

Incident Response Workflow

Key Decision Points

Communication Protocols

Internal Communication

External Communication

Runbooks and Playbooks

Runbook Structure

Best Practices

Blameless Post-Mortems

Blameless Culture Tenets

Post-Mortem Process

Metrics and Continuous Improvement

Key Incident Metrics

Continuous Improvement Loop

Tool Selection

Incident Management Platforms

Status Page Solutions

Anti-Patterns to Avoid

Implementation Checklist

Phase 1: Foundation (Week 1)

Phase 2: Processes (Weeks 2-3)

Phase 3: Culture (Weeks 4+)

Phase 4: Optimization (Months 3-6)

Related Skills

References