SRE book notes: Managing Critical State, Distributed Consensus for Reliability

Remember that the most important measure of performance is client perception

Feb 09, 2023

These are the notes from Chapter 23: Managing Critical State: Distributed Consensus for Reliability from the book Site Reliability Engineering, How Google Runs Production Systems.

brown game pieces on white surface — Photo by Markus Spiske on Unsplash

This is a post of a series. The previous post can be seen here:

Bit Maybe Wise

SRE book notes: Addressing Cascading Failures

These are the notes from Chapter 22: Addressing Cascading Failures from the book Site Reliability Engineering, How Google Runs Production Systems. This is a post of a series. The previous post can be seen here: Because cascading failures are hard to predict, the testing strategies are the most insightful part of this cha…

3 years ago · 1 like · Hercules Merscher

This post is rather short, as the chapter dives into many complex topics such as the CAP theorem, and consensus algorithms such as Raft and Paxos, thus any attempt of summarizing them in a few sentences can be damned to failure. The chapter and the references are a must-read if you intend to manage state in a distributed manner.

In fact, many distributed systems problems turn out to be different versions of distributed consensus, including master election, group membership, all kinds of distributed locking and leasing, reliable distributed queuing and messaging, and maintenance of any kind of critical shared state that must be viewed consistently across a group of processes. All of these problems should be solved only using distributed consensus algorithms that have been proven formally correct, and whose implementations have been tested extensively. Ad hoc means of solving these sorts of problems (such as heartbeats and gossip protocols) will always have reliability problems in practice.

When making decisions about location of replicas, remember that the most important measure of performance is client perception

Whenever you see leader election, critical shared state, or distributed locking, think about distributed consensus: any lesser approach is a ticking bomb waiting to explode in your systems.

Bit Maybe Wise

SRE book notes: Managing Critical State, Distributed Consensus for Reliability

Remember that the most important measure of performance is client perception

Discussion about this post