SRE book notes: Being On-Call
Being on-call is a critical duty that many operations and engineering teams must undertake in order to keep their services reliable and available.
These are the notes from Chapter 11: Being On-Call from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems.
The quantity of on-call can be calculated by the percent of time spent by engineers on on-call duties. The quality of on-call can be calculated by the number of incidents that occur during an on-call shift.
We strongly believe that the "E" in "SRE" is a defining characteristic of our organization, so we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.
To make sure that the engineers are in the appropriate frame of mind to leverage the latter mindset, it’s important to reduce the stress related to being on-call.
The “latter mindset” mentioned here refers to the way of thinking on being rational, focused, and deliberate cognitive functions.
Stress hormones like cortisol and corticotropin-releasing hormone (CRH) are known to cause behavioral consequences—including fear—that can impair cognitive functions and cause suboptimal decision making.
It’s always good to see content like this supported by scientific research.
Heuristics are very tempting behaviors when one is on-call. For example, when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.
The most important on-call resources are:
Clear escalation paths
Well-defined incident-management procedures
A blameless postmortem culture
Mistakes happen, and software should make sure that we make as few mistakes as possible. Recognizing automation opportunities is one of the best ways to prevent human errors.
Noisy alerts that systematically generate more than one alert per incident should be tweaked to approach a 1:1 alert/incident ratio. Doing so allows the on-call engineer to focus on the incident instead of triaging duplicate alerts.
Being out of touch with production for long periods of time can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are discovered only when an incident occurs.