Name: Seeking SRE: Conversations about Running Production Systems at Scale
Rating: 4.18 (4 reviews)
ISBN: 9781491978825

Oleg

11 reviews2 followers

August 21, 2019

After the first book, which described how Google runs the production system, the term SRE became hype and mainstream. Thus, there is no surprise that 2nd book followed and focused on technical implementation based on real examples of the topics mentioned in the first one, such as configuration management, right SLO/SLI definition. The opinions in the industry have been split from "Wow, let do it" to "it's only big companies like Google can implement such grade of production sanity of the processes".
As usual, the truth is in the middle.
The book "Seeking SRE" is very good attempt to understand what is behind of SRE term, and how is adopted among others key players such Amazon, Facebook, Netflix, Dropbox, which run the production systems at a global scale. Besides, the technical aspects, there are chapters dedicated to human aspects of being SRE, which cover skill, burnout, mindset, work-life balance, role in the organization. Those could be applied to any team/persons running production systems at scale despite the title.
The book consists of a set of interviews with SRE or people with a similar mindset (PE (product Engineers at Facebook). They tell their stories of successes, failures, and learning of running production systems.
It comes with no surprise, that other successful companies have a set of practices allowing to run production systems within defined SLA. The key factor is that all of engineers, managers, business, developers are speaking one "language" when talking about the reliability of production systems. That allows quickly to understand what can be improved and this process is continuous and based on data (metrics). The example of such metrics flow is below:

Untitled picture.png
The digging into production metrics is rated very high and it's considered as a very important practice. They believe, that organizations must not aim for 100% availability but build resilient, scalable and highly available systems instead. The systems which can recover to the last known state within seconds or minutes.
From this believe raised Chaos Engineering, which tests if production service can recover fast with no to minimal impact on users. To achieve that, usually, there is a central engineering department. It analyzes business requirements and decides how to implement it. In contrast to other big EE organizations, the design of the implementation of such requirements are done by SREs (teams, who are RUNNING production systems and know best how to implement that). Of course, achieving the consolidation around processes and tooling similar to Google or Facebook is something which may be considered as "overkill". In this light, the transformation story of Spotify is amazing and shows that the company goes through continuous learning processes: from central Ops team(s) to the Ops-in-Squad-model to "Golden Path" model. Golden Path aims to offer the easiest way to bring new applications/services into production, to remove operational burning and to bring the consistency and still keeping the dev teams to be responsible for their code in production. To keep the central organization on track and prevent the turning into bottleneck and showstopper, the feedback is crucial. The set of metrics are defined and reviewed to comprehend if the velocity, productivity, people satisfactions, etc. increasing over time.

Another big pillar of SRE philosophy is the definition of SLIs/SLOs (Service Level Indicators and Service Level Objectives) and way how it's defined compare to the classical approach - define the desired system state in terms of business needs and commitment from all involved teams to maintain it. There is a great chapter dedicated to it from Amazon SRE.
The relation between DevOps and SRE is carefully studied in the book.
Few quotes are below:
"At PayPal, we believe that site reliability engineers are both the ultimate enablers as well as the ideal practitioners of DevOps. "

"DevOps is for newer teams to Agile that need to improve tooling and culture. SRE is for established Agile teams that are looking to improve uptime, monitoring, sanity, and peace of mind. "

"DevOps can and should be implemented in every organization adopting Agile or an iterative way of working so as to close the feedback loop cycle to product owners or those owning the backlog of features to ensure service management waste is known and addressed. SRE’s are compatible with organizations that have or seek to foster an engineering culture and eliminate waste and find efficiencies with engineering outcomes. Both can coexist at the same time in the same organization when the organization is big enough or old enough. "

I'd like to finish this short book review with sharing two practices, both have dedicated chapters and the potential benefits of it. The first one is blameless postmortem. The key factor here is shared organization-wide and everyone can read and learn from it. Ideally, it has one template to simplify the writing and reading.
The second one is the keeping of documentation in the code repository. This enables faster search of needed runbooks or documentation. It reduces the number of outdated documents, and adds comprehensive versioning and reviewing. In other words, the documentation should be treated in the same way as a code.
Thanks for reading

Seeking SRE: Conversations about Running Production Systems at Scale

David N. Blank-Edelman

About the author

David N. Blank-Edelman

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion