Managing Incidents
Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.
When to Use
Apply this skill when:
- Setting up incident response processes for a team
- Designing on-call rotations and escalation policies
- Creating runbooks for common failure scenarios
- Conducting blameless post-mortems after incidents
- Implementing incident communication protocols (internal and external)
- Choosing incident management tooling and platforms
- Improving MTTR and incident frequency metrics
Core Principles
Incident Management Philosophy
Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.
Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.
Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.
Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.
Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.
Severity Classification
Standard severity levels with response times:
SEV0 (P0) - Critical Outage
- Impact: Complete service outage, critical data loss, payment processing down
- Response: Page immediately 24/7, all hands on deck, executive notification
- Example: API completely down, entire customer base affected
SEV1 (P1) - Major Degradation
- Impact: Major functionality degraded, significant customer subset affected
- Response: Page during business hours, escalate off-hours, IC assigned
- Example: 15% error rate, critical feature unavailable
SEV2 (P2) - Minor Issues
- Impact: Minor functionality impaired, edge case bug, small user subset
- Response: Email/Slack alert, next business day response
- Example: UI glitch, non-critical feature slow
SEV3 (P3) - Low Impact
- Impact: Cosmetic issues, no customer functionality affected
- Response: Ticket queue, planned sprint
- Example: Visual inconsistency, documentation error
Severity Classification Decision Tree
Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO → Is major functionality degraded?
├─ YES → Is there a workaround?
│ ├─ YES → SEV1
│ └─ NO → SEV0
└─ NO → Are customers impacted?
├─ YES → SEV2
└─ NO → SEV3
Incident Roles
Incident Commander (IC)
- Owns overall incident response and coordination
- Makes strategic decisions (rollback vs. debug, when to escalate)
- Delegates tasks to responders (does NOT do hands-on debugging)
- Declares incident resolved when stability confirmed
Communications Lead
- Posts status updates to internal and external channels
- Coordinates with stakeholders (executives, product, support)
- Drafts post-incident customer communication
- Cadence: Every 15-30 minutes for SEV0/SEV1
Subject Matter Experts (SMEs)
- Hands-on debugging and mitigation
- Execute runbooks and implement fixes
- Provide technical context to IC
Scribe
- Documents timeline, actions, decisions in real-time
- Records incident notes for post-mortem reconstruction
Role Assignment by Severity:
- SEV2/SEV3: Single responder
- SEV1: IC + SME(s)
- SEV0: IC + Communications Lead + SME(s) + Scribe
On-Call Management
Rotation Patterns
Primary + Secondary:
- Primary: First responder
- Secondary: Backup if primary doesn't ack within 5 minutes
- Rotation length: 1 week (optimal balance)
Follow-the-Sun (24/7):
- Team A: US hours, Team B: Europe hours, Team C: Asia hours
- Benefit: No night shifts, improved work-life balance
- Requires: Multiple global teams
Tiered Escalation:
- Tier 1: Junior on-call (common issues, runbook-driven)
- Tier 2: Senior on-call (complex troubleshooting)
- Tier 3: Team lead/architect (critical decisions)
Best Practices
- Rotation length: 1 week per rotation
- Handoff ceremony: 30-minute call to discuss active issues
- Compensation: On-call stipend + time off after major incidents
- Tooling: PagerDuty, Opsgenie, or incident.io
- Limits: Max 2-3 pages per night; escalate if exceeded
Incident Response Workflow
Standard incident lifecycle:
Detection → Triage → Declaration → Investigation
↓
Mitigation → Resolution → Monitoring → Closure
↓
Post-Mortem (within 48 hours)
Key Decision Points
When to Declare: When in doubt, declare (can always downgrade severity)
When to Escalate:
- No progress after 30 minutes
- Severity increases (SEV2 → SEV1)
- Specialized expertise needed
When to Close:
- Issue resolved and stable for 30+ minutes
- Monitoring shows all metrics at baseline
- No customer-reported issues
Communication Protocols
Internal Communication
Incident Slack Channel:
- Format:
#incident-YYYY-MM-DD-topic-description - Pin: Severity, IC name, status update template, runbook links
War Room: Video call for SEV0/SEV1 requiring real-time voice coordination
Status Update Cadence:
- SEV0: Every 15 minutes
- SEV1: Every 30 minutes
- SEV2: Every 1-2 hours or at major milestones
External Communication
Status Page:
- Tools: Statuspage.io, Instatus, custom
- Stages: Investigating → Identified → Monitoring → Resolved
- Transparency: Acknowledge issue publicly, provide ETAs when possible
Customer Email:
- When: SEV0/SEV1 affecting customers
- Timing: Within 1 hour (acknowledge), post-resolution (full details)
- Tone: Apologetic, transparent, action-oriented
Regulatory Notifications:
- Data Breach: GDPR requires notification within 72 hours
- Financial Services: Immediate notification to regulators
- Healthcare: HIPAA breach notification rules
Runbooks and Playbooks
Runbook Structure
Every runbook should include:
- Trigger: Alert conditions that activate this runbook
- Severity: Expected severity level
- Prerequisites: System state requirements
- Steps: Numbered, executable commands (copy-pasteable)
- Verification: How to confirm fix worked
- Rollback: How to undo if steps fail
- Owner: Team/person responsible
- Last Updated: Date of last revision
Best Practices
- Executable: Commands copy-pasteable, not just descriptions
- Tested: Run during disaster recovery drills
- Versioned: Track changes in Git
- Linked: Reference from alert definitions
- Automated: Convert manual steps to scripts over time
Blameless Post-Mortems
Blameless Culture Tenets
Assume Good Intentions: Everyone made the best decision with information available.
Focus on Systems: Investigate how processes failed, not who failed.
Psychological Safety: Create environment where honesty is rewarded.
Learning Opportunity: Incidents are gifts of organizational knowledge.
Post-Mortem Process
1. Schedule Review (Within 48 Hours): While memory is fresh
2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document
3. Meeting Facilitation:
- Timeline walkthrough
- 5 Whys Analysis to identify systemic root causes
- What Went Well / What Went Wrong
- Define action items with owners and due dates
4. Post-Mortem Document:
- Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
- Distribution: Engineering, product, support, leadership
- Storage: Archive in searchable knowledge base
5. Follow-Up: Track action items in sprint planning
Metrics and Continuous Improvement
Key Incident Metrics
MTTA (Mean Time To Acknowledge):
- Target: < 5 minutes for SEV1
- Improvement: Better on-call coverage
MTTR (Mean Time To Recovery):
- Target: < 1 hour for SEV1
- Improvement: Runbooks, automation
MTBF (Mean Time Between Failures):
- Target: > 30 days for critical services
- Improvement: Root cause fixes
Incident Frequency:
- Track: SEV0, SEV1, SEV2 counts per month
- Target: Downward trend
Action Item Completion Rate:
- Target: > 90%
- Improvement: Sprint integration, ownership clarity
Continuous Improvement Loop
Incident → Post-Mortem → Action Items → Prevention
↑ ↓
└──────────── Fewer Incidents ─────────────┘
Tool Selection
Incident Management Platforms
PagerDuty:
- Best for: Established enterprises, complex escalation policies
- Cost: $19-41/user/month
- When: Team size 10+, budget $500+/month
Opsgenie:
- Best for: Atlassian ecosystem users, flexible routing
- Cost: $9-29/user/month
- When: Using Atlassian products, budget $200-500/month
incident.io:
- Best for: Modern teams, AI-powered response, Slack-native
- When: Team size 5-50, Slack-centric culture
Status Page Solutions
Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)
Anti-Patterns to Avoid
- Delayed Declaration: Waiting for certainty before declaring incident
- Skipping Post-Mortems: "Small" incidents still provide learning
- Blame Culture: Punishing individuals prevents systemic learning
- Ignoring Action Items: Post-mortems without follow-through waste time
- No Clear IC: Multiple people leading creates confusion
- Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
- Hands-On IC: IC should delegate debugging, not do it themselves
Implementation Checklist
Phase 1: Foundation (Week 1)
- Define severity levels (SEV0-SEV3)
- Choose incident management platform
- Set up basic on-call rotation
- Create incident Slack channel template
Phase 2: Processes (Weeks 2-3)
- Create first 5 runbooks for common incidents
- Set up status page
- Train team on incident response
- Conduct tabletop exercise
Phase 3: Culture (Weeks 4+)
- Conduct first blameless post-mortem
- Establish post-mortem cadence
- Implement MTTA/MTTR dashboards
- Track action items in sprint planning
Phase 4: Optimization (Months 3-6)
- Automate incident declaration
- Implement runbook automation
- Monthly disaster recovery drills
- Quarterly incident trend reviews
Related Skills
- Platform Engineering - Platform reliability and self-service reducing incidents
- Implementing GitOps - GitOps enables fast rollback during incidents
- Testing Strategies - Comprehensive testing prevents incidents