Organizations big and small have started to realize just how crucial system and application reliability is to their business. They've also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace. Site Reliability Engineering (SRE) is a proven approach to this challenge.
SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O'Reilly book that described Google's creation of the discipline and the implementation that's allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.
Listen as engineers and other leaders in the field
Different ways of implementing SRE and SRE principles in a wide variety of settings How SRE relates to other approaches such as DevOps Specialties on the cutting edge that will soon be commonplace in SRE Best practices and technologies that make practicing SRE easier The important but rarely explored human side of SRE David N. Blank-Edelman is the book's curator and editor.
After the first book, which described how Google runs the production system, the term SRE became hype and mainstream. Thus, there is no surprise that 2nd book followed and focused on technical implementation based on real examples of the topics mentioned in the first one, such as configuration management, right SLO/SLI definition. The opinions in the industry have been split from "Wow, let do it" to "it's only big companies like Google can implement such grade of production sanity of the processes". As usual, the truth is in the middle. The book "Seeking SRE" is very good attempt to understand what is behind of SRE term, and how is adopted among others key players such Amazon, Facebook, Netflix, Dropbox, which run the production systems at a global scale. Besides, the technical aspects, there are chapters dedicated to human aspects of being SRE, which cover skill, burnout, mindset, work-life balance, role in the organization. Those could be applied to any team/persons running production systems at scale despite the title. The book consists of a set of interviews with SRE or people with a similar mindset (PE (product Engineers at Facebook). They tell their stories of successes, failures, and learning of running production systems. It comes with no surprise, that other successful companies have a set of practices allowing to run production systems within defined SLA. The key factor is that all of engineers, managers, business, developers are speaking one "language" when talking about the reliability of production systems. That allows quickly to understand what can be improved and this process is continuous and based on data (metrics). The example of such metrics flow is below:
Untitled picture.png The digging into production metrics is rated very high and it's considered as a very important practice. They believe, that organizations must not aim for 100% availability but build resilient, scalable and highly available systems instead. The systems which can recover to the last known state within seconds or minutes. From this believe raised Chaos Engineering, which tests if production service can recover fast with no to minimal impact on users. To achieve that, usually, there is a central engineering department. It analyzes business requirements and decides how to implement it. In contrast to other big EE organizations, the design of the implementation of such requirements are done by SREs (teams, who are RUNNING production systems and know best how to implement that). Of course, achieving the consolidation around processes and tooling similar to Google or Facebook is something which may be considered as "overkill". In this light, the transformation story of Spotify is amazing and shows that the company goes through continuous learning processes: from central Ops team(s) to the Ops-in-Squad-model to "Golden Path" model. Golden Path aims to offer the easiest way to bring new applications/services into production, to remove operational burning and to bring the consistency and still keeping the dev teams to be responsible for their code in production. To keep the central organization on track and prevent the turning into bottleneck and showstopper, the feedback is crucial. The set of metrics are defined and reviewed to comprehend if the velocity, productivity, people satisfactions, etc. increasing over time.
Another big pillar of SRE philosophy is the definition of SLIs/SLOs (Service Level Indicators and Service Level Objectives) and way how it's defined compare to the classical approach - define the desired system state in terms of business needs and commitment from all involved teams to maintain it. There is a great chapter dedicated to it from Amazon SRE. The relation between DevOps and SRE is carefully studied in the book. Few quotes are below: "At PayPal, we believe that site reliability engineers are both the ultimate enablers as well as the ideal practitioners of DevOps. "
"DevOps is for newer teams to Agile that need to improve tooling and culture. SRE is for established Agile teams that are looking to improve uptime, monitoring, sanity, and peace of mind. "
"DevOps can and should be implemented in every organization adopting Agile or an iterative way of working so as to close the feedback loop cycle to product owners or those owning the backlog of features to ensure service management waste is known and addressed. SRE’s are compatible with organizations that have or seek to foster an engineering culture and eliminate waste and find efficiencies with engineering outcomes. Both can coexist at the same time in the same organization when the organization is big enough or old enough. "
I'd like to finish this short book review with sharing two practices, both have dedicated chapters and the potential benefits of it. The first one is blameless postmortem. The key factor here is shared organization-wide and everyone can read and learn from it. Ideally, it has one template to simplify the writing and reading. The second one is the keeping of documentation in the code repository. This enables faster search of needed runbooks or documentation. It reduces the number of outdated documents, and adds comprehensive versioning and reviewing. In other words, the documentation should be treated in the same way as a code. Thanks for reading
Excellent book. Some of the later chapters go too deep into technologies and scaling issues but all the first chapters are totally worth it. I have a lot of notes to start some discussions in the office.
Too much mess, the book doesn’t have single flow ( because each author writes by his way) make the content is horribly inconsistent. Some chapters are good but others are so boring. A lot of repetition, and a lot of side talks.
This book is a collection of over 30 essays on Site Reliability Engineering (SRE) by a collection of different authors.
Some of the essays are very well-informed and genuinely insightful. Others are not. Many essays are just the authors projecting their very narrow experience across a much vaster industry. For instance, some ideas in some of these essays only apply to big tech companies. Some of the more interesting essays come from authors with broader experience.
Part of the reason I picked up this book was to learn more about SRE outside of Google. I learned that there is no common definition of SRE but there are some rhyming themes. There are plenty of ideas to engage with (some better than others) and reading this book has definitely changed (and hopefully improved) the way I think about operating software.
I think the book would have really benefited by some more discerning curation. If you took out the weaker half of these essays the book would be far better and would be closer to something I feel I could really recommend to others. As it stands it's a real mixed bag and a massive drag to read it all.