On-call burnout is the most underreported retention risk in MSP teams. Engineers who routinely get paged three times a night eventually leave. The good news: with the right schedule design, on-call can be genuinely manageable — and sometimes even fair.
The problem with most MSP on-call schedules
Most MSP on-call schedules look like this: one engineer's personal phone number is hardcoded into every monitoring tool. When they're on vacation, nothing changes. When they leave the company, the alerts go nowhere.
This isn't a schedule — it's learned helplessness. Here's how to build something better.
Principle 1: Separate primary and secondary coverage
The first on-call engineer should handle the initial response. The secondary exists only if the primary doesn't respond within a defined window (typically 10–15 minutes). This single change eliminates most of the “duplicate page” problem where multiple engineers scramble for the same alert.
Concretely:
- Primary: paged immediately on critical alerts
- Secondary: paged if primary doesn't acknowledge within 12 minutes
- Manager / escalation: paged if secondary doesn't acknowledge within 20 minutes
Principle 2: Rotation length matters more than you think
Common rotation lengths and their trade-offs:
- Daily rotation: High context-switching, but no one is on-call for more than a day. Works for small teams.
- Weekly rotation: Most common. Good balance of continuity and fairness. The standard recommendation for teams of 4+.
- Bi-weekly: Reduces handoff overhead but increases individual burden. Not recommended.
Principle 3: Build override culture from day one
Life happens. Engineers get sick, have family emergencies, and take vacations. Your schedule system must make it trivially easy to swap a shift — without requiring a manager to manually update every monitoring tool.
Good on-call tooling allows any engineer to:
- Request a shift swap with another team member
- Set a temporary override (e.g., “I'm covering for Sarah on Thursday”)
- Extend or shorten their shift window without manager involvement
The handoff ritual
End of every on-call shift, require a written (or Slack) handoff that covers:
- Any unresolved incidents or active investigations
- Clients that had issues in the past 24 hours
- Anything the incoming engineer should watch for
Principle 4: Pay people fairly for on-call
This is the most important principle and the one most ignored. On-call is real work. If your engineers are expected to respond to pages at 3am without additional compensation, you're borrowing against their goodwill.
Common structures that work:
- Flat weekly stipend for being on-call rotation (e.g., $200/week)
- Hourly rate for time actually spent responding (e.g., $50/hr after hours)
- Comp time: a day off for every weekend on-call
What good looks like
A well-run MSP on-call program has:
- Fewer than 5 pages per night shift on average (anything more = alert configuration problem)
- Average acknowledgement time under 10 minutes
- Zero instances of “no one got paged” due to schedule gaps
- Engineers who willingly take on-call shifts because it's manageable and compensated
Getting here takes tooling, process, and culture working together. The tooling is the easy part — AlertFlow handles the scheduling, routing, and escalation automatically. The process and culture require you.