These are the notes from Chapter 27: Reliable Product Launches at Scale from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
Experience has demonstrated that engineers are likely to sidestep processes that they consider too burdensome or as adding insufficient value—especially when a team is already in crunch mode, and the launch process is seen as just another item blocking their launch. For this reason, LCE must optimize the launch experience continuously to strike the right balance between cost and benefit.
LCE = Launch Coordination Engineering
In a large organization, engineers may not be aware of available infrastructure for common tasks (such as rate limiting). Lacking proper guidance, they’re likely to re-implement existing solutions. Converging on a set of common infrastructure libraries avoids this scenario, and provides obvious benefits to the company: it cuts down on duplicate effort, makes knowledge more easily transferable between services, and results in a higher level of engineering and service quality due to the concentrated attention given to infrastructure.
Reimplementation of existing solutions is also common in companies where there is a heterogenous set of tools and programming languages. This is a problem that’s usually solved by sticking to a small subset of technologies.
Sticking to a unique programming language or tool can be an option, as it will lay a common ground for all the teams, but needs to be weighed well. As the company grows, different problems and solutions will need to comply with different requirements.
Pick a small subset of tools intentionally chosen, where its trade-offs are well considered. Hype is not a strategy.
When products are successful far beyond any early estimates, and their usage increases by more than two orders of magnitude, keeping pace with their load necessitates many design changes. Such scalability changes, combined with ongoing feature additions, often make the product more complex, fragile, and difficult to operate. At some point, the original product architecture becomes unmanageable and the product needs to be completely rearchitected. Rearchitecting the product and then migrating all users from the old to the new architecture requires a large investment of time and resources from developers and SREs alike, slowing down the rate of new feature development during that period.
A Collection of Best Practices for Production Services