Incident Management

Master plan for effective incident response, on-call practices, and post-incident learning. This skill covers incident lifecycle management, communication patterns, runbook development, and building resilient on-call culture.

Status: 🟢 Master Plan Available

Key Topics

Incident Lifecycle: Detection, triage, response, resolution, post-mortem, follow-up actions
Severity Classification: Severity levels, escalation criteria, response SLAs, stakeholder communication
On-Call Practices: Rotation schedules, handoffs, alert fatigue prevention, burnout avoidance
Incident Communication: Status updates, stakeholder notifications, internal/external communication
Runbook Development: Standardized procedures, troubleshooting guides, automated remediation
Post-Mortem Culture: Blameless retrospectives, action items, knowledge sharing, learning from failure
Incident Tools: Alert routing, incident tracking, war rooms, communication platforms
Chaos Engineering: Proactive failure injection, resilience testing, game days

Primary Tools & Technologies

Incident Management: PagerDuty, Opsgenie, VictorOps (Splunk On-Call), Incident.io
Communication: Slack, Microsoft Teams, Zoom, dedicated incident channels
War Rooms: Zoom, Google Meet, dedicated Slack channels with workflows
Runbooks: Confluence, Notion, GitHub Wiki, PagerDuty Runbooks
Post-Mortems: Jira, Linear, GitHub Issues, dedicated post-mortem templates
Monitoring: Datadog, New Relic, Grafana, CloudWatch, Prometheus
Log Analysis: Splunk, ELK Stack, Loki, CloudWatch Logs
Chaos Engineering: Chaos Monkey, Gremlin, Litmus Chaos, ChaosBlade

Integration Points

Observability: Monitoring and alerting trigger incidents
GitOps Workflows: Automated rollbacks via GitOps
Building CI/CD Pipelines: Pipeline failures trigger incidents
Kubernetes Operations: K8s issues and automated remediation
Security Operations: Security incidents and response
Platform Engineering: Platform-assisted incident response
Testing Strategies: Post-incident test creation
Infrastructure as Code: Infrastructure changes during incidents

Use Cases

Setting up on-call rotation and escalation policies
Creating runbooks for common incidents
Designing incident communication workflows
Implementing automated incident detection
Running blameless post-mortems
Building incident response playbooks
Reducing alert fatigue and false positives
Conducting chaos engineering experiments
Training new engineers on incident response

Decision Framework

Severity classification:

SEV-1 (Critical): Total outage, data loss, security breach, immediate response
SEV-2 (High): Degraded service, some users affected, response within 1 hour
SEV-3 (Medium): Limited impact, workarounds available, response within 4 hours
SEV-4 (Low): Minor issues, scheduled fixes, response within 1 business day

On-call rotation design:

Follow-the-sun: 24/7 coverage, regional teams, no night shifts
Primary/Secondary: Primary on-call, secondary escalation, shared load
Weekend rotation: Separate weekend rotation, compressed schedules
Compensation: On-call pay, time off, recognition, burnout prevention

Alert design principles:

Actionable: Every alert requires human action
Contextual: Include relevant debugging information
Prioritized: Not all alerts are equal, route accordingly
Aggregated: Group related alerts, reduce noise
Tested: Validate alert accuracy, reduce false positives

Runbook structure:

Symptoms: What the user/system is experiencing
Impact: Scope and severity of the issue
Diagnosis: How to confirm the issue
Resolution: Step-by-step fix procedures
Escalation: When and who to escalate to
Prevention: Long-term fixes and action items

Post-Mortem Best Practices

Blameless culture:

Focus on system failures, not individual mistakes
Assume good intentions and competence
Identify systemic issues and process gaps
Create psychological safety for honest discussion

Post-mortem template:

Summary: Brief incident description
Timeline: Detailed event sequence with timestamps
Root Cause: Technical and process failures
Impact: Users affected, revenue impact, SLA breaches
Resolution: What fixed the issue
Action Items: Concrete follow-up tasks with owners
Lessons Learned: What went well, what needs improvement

Follow-up actions:

Track action items to completion
Share learnings across teams
Update runbooks and documentation
Implement preventive measures
Schedule follow-up review

On-Call Health Metrics

Alert volume: Alerts per shift, trending over time
Alert accuracy: True positive rate, false alarm percentage
Response time: Time to acknowledge, time to resolve
Incident frequency: Incidents per week/month, by severity
Burnout indicators: Rotation balance, weekend incidents, after-hours load
Runbook coverage: Percentage of incidents with runbooks
Post-mortem completion: Percentage of incidents with post-mortems

Key Topics​

Primary Tools & Technologies​

Integration Points​

Use Cases​

Decision Framework​

Post-Mortem Best Practices​

On-Call Health Metrics​