SRE book notes: Monitoring Distributed Systems
This chapter offers guidelines for what issues should interrupt a human via a page
These are the notes from Chapter 6: Monitoring Distributed Systems from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
you should never trigger an alert simply because "something seems a bit weird."
Effective alerting systems have good signal and very low noise.
to keep noise low and signal high, the elements of your monitoring system that direct to a pager need to be very simple and robust. Rules that generate alerts for humans should be simple to understand and represent a clear failure.
Your monitoring system should address two questions: what’s broken, and why?
The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause.
"What" versus "why" is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.
an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
Saturation can be defined as:
How "full" your service is.
saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours."
It’s important that decisions about monitoring be made with long-term goals in mind. Every page that happens today distracts a human from improving the system for tomorrow