SRE book notes: Introduction to Site Reliability Engineering
Site Reliability Engineering, How Google Runs Production Systems
Incentivized by my manager at GitLab, Rachel Nienaber, I’m taking notes from the book Site Reliability Engineering, How Google Runs Production Systems, and decided to share some quotes I find more interesting here, and eventually some comments with my thoughts and perspectives as well.
This is the first post of a series, so stay tuned. You’re welcome to interact via comments, I’d love to know your thoughts.
Without further ado, here are the notes from the first chapters:
when systems are “reliable enough,” we instead invest our efforts in adding features or building new products.
even though a small organization has many pressing concerns and the software choices you make may differ from those Google made, it’s still worth putting lightweight reliability support in place early on, because it’s less costly to expand a structure later on than it is to introduce one that is not present.
the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.
In my own experience, a seldom trait of companies is to worry about maintenance, be it the quality of the systems, or the cost of keeping everything running.
Do they need a cultural shift? Someone, to defy the status quo? Better prepared professionals? More knowledge? Braveness? A bit of all the previous options?
please bear the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it.
Hope is not a strategy!
I love this one!
As Murphy’s law states: “If anything can go wrong, it will”
SRE is what happens when you ask a software engineer to design an operations team.
In practice, SREs are also engineers, they do not just maintain and keep the systems running, but they also build them.
More below.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
Therefore, Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc.
Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.
In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.
Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
The book is full of good content for thought. It’s not just about Google. The ideas and practices presented are valuable to all software engineers out there.
I’m enjoying every single chapter. Keep an eye on new publications, because there are more to come regularly as I progress in my reading.