All posts
Team Management10 min read

Building an On-Call Schedule Your Team Won't Hate

MT
Mike Torres
Engineering Lead · November 1, 2024

On-call burnout is the most underreported retention risk in MSP teams. Engineers who routinely get paged three times a night eventually leave. The good news: with the right schedule design, on-call can be genuinely manageable — and sometimes even fair.

The problem with most MSP on-call schedules

Most MSP on-call schedules look like this: one engineer's personal phone number is hardcoded into every monitoring tool. When they're on vacation, nothing changes. When they leave the company, the alerts go nowhere.

This isn't a schedule — it's learned helplessness. Here's how to build something better.

Principle 1: Separate primary and secondary coverage

The first on-call engineer should handle the initial response. The secondary exists only if the primary doesn't respond within a defined window (typically 10–15 minutes). This single change eliminates most of the “duplicate page” problem where multiple engineers scramble for the same alert.

Concretely:

  • Primary: paged immediately on critical alerts
  • Secondary: paged if primary doesn't acknowledge within 12 minutes
  • Manager / escalation: paged if secondary doesn't acknowledge within 20 minutes

Principle 2: Rotation length matters more than you think

Common rotation lengths and their trade-offs:

  • Daily rotation: High context-switching, but no one is on-call for more than a day. Works for small teams.
  • Weekly rotation: Most common. Good balance of continuity and fairness. The standard recommendation for teams of 4+.
  • Bi-weekly: Reduces handoff overhead but increases individual burden. Not recommended.
Rule of thumb: If your team has fewer than 4 engineers, weekly rotation means someone is on-call every other week. That's too frequent. Hire before you set rotation — or use a tier-based system where nights and weekends are handled by a rotation subset.

Principle 3: Build override culture from day one

Life happens. Engineers get sick, have family emergencies, and take vacations. Your schedule system must make it trivially easy to swap a shift — without requiring a manager to manually update every monitoring tool.

Good on-call tooling allows any engineer to:

  • Request a shift swap with another team member
  • Set a temporary override (e.g., “I'm covering for Sarah on Thursday”)
  • Extend or shorten their shift window without manager involvement

The handoff ritual

End of every on-call shift, require a written (or Slack) handoff that covers:

  • Any unresolved incidents or active investigations
  • Clients that had issues in the past 24 hours
  • Anything the incoming engineer should watch for

Principle 4: Pay people fairly for on-call

This is the most important principle and the one most ignored. On-call is real work. If your engineers are expected to respond to pages at 3am without additional compensation, you're borrowing against their goodwill.

Common structures that work:

  • Flat weekly stipend for being on-call rotation (e.g., $200/week)
  • Hourly rate for time actually spent responding (e.g., $50/hr after hours)
  • Comp time: a day off for every weekend on-call

What good looks like

A well-run MSP on-call program has:

  • Fewer than 5 pages per night shift on average (anything more = alert configuration problem)
  • Average acknowledgement time under 10 minutes
  • Zero instances of “no one got paged” due to schedule gaps
  • Engineers who willingly take on-call shifts because it's manageable and compensated

Getting here takes tooling, process, and culture working together. The tooling is the easy part — AlertFlow handles the scheduling, routing, and escalation automatically. The process and culture require you.

Try AlertFlow free for 14 days

No credit card required.

Start free trial