Discover more from Bit Maybe Wise
SRE book notes: Embedding an SRE to Recover from Operational Overload
Remember that your job is to make the service work
These are the notes from Chapter 30: Embedding an SRE to Recover from Operational Overload from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
Your job while embedded with the team is to articulate why processes and habits contribute to, or detract from, the service's scalability.
Remember that your job is to make the service work, not to shield the development team from alerts.
After scoping the dynamics and pain points of the team, lay the groundwork for improvement through best practices like postmortems and by identifying sources of toil and how to best address them.
Sort the team fires into toil and not-toil. When you're finished, present the list to the team and clearly explain why each fire is either work that should be automated or acceptable overhead for running the service.
Your first goal for the team should be writing a service level objective (SLO), if one doesn't already exist. The SLO is important because it provides a quantitative measure of the impact of outages, in addition to how important a process change could be.
Once your embedded assignment concludes, you should remain available for design and code reviews. Keep an eye on the team for the next few months to confirm that they're slowly improving their capacity planning, emergency response, and rollout processes.
As you might have guessed already, the quotes with recommendations above are not just applicable to SREs but any experienced engineer that finds himself/herself in a position that needs to mentor or guide one or multiple teams at the company.
If you haven’t been into this situation before, pay attention to them as they might come in handy when the time comes.
Thanks for reading Bit Maybe Wise! Subscribe for free to receive new posts.