These are the notes from Part 3: Practices from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
We can characterize the health of a service—in much the same way that Abraham Maslow categorized human needs—from the most basic requirements needed for a system to function as a service at all to the higher levels of function—permitting self-actualization and taking active control of the direction of the service rather than reactively fighting fires.
Without monitoring, you have no way to tell whether the service is even working; you want to be aware of problems before your users notice them.
on-call support is a tool we use to achieve our larger mission and remain in touch with how distributed computing systems actually work (and fail!)
Once we understand what tends to go wrong, our next step is attempting to prevent it, because an ounce of prevention is worth a pound of cure. Test suites offer some assurance that our software isn’t making certain classes of errors before it’s released to production;
The details of the solution you choose to implement are necessarily specific to your service and your organization. Responding effectively to incidents, however, is something applicable to all teams.
Building a blameless postmortem culture is the first step in understanding what went wrong (and what went right!)
By coincidence, along with the SRE book, I started reading another book—or should I say listening—called Nonviolent Communication.
TLDR, according to the Wikipedia page, nonviolent communication “is a method designed to increase empathy and improve the quality of life of those who utilize the method and the people around them”.
It couldn’t be at a better time!
I believe using this method to write the post-mortems could be of great help to eliminate the blame on individuals and focus on the systemic issues that need to be addressed.
An interesting reading about capacity planning: https://www.usenix.org/system/files/login/articles/login_feb15_07_hixson.pdf