Rate this book

Site Reliability Engineering: How Google Runs Production Systems

Name: Site Reliability Engineering: How Google Runs Production Systems
Rating: 4.21 (274 reviews)
ISBN: 9781491951170

Betsy Beyer, Chris Jones, Jennifer Petoff

Rate this book

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient--lessons directly applicable to your organization.

This book is divided into four sections: Introduction--Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples--Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices--Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systemsManagement--Explore Google's best practices for training, communication, and meetings that your organization can use

GenresTechnologyProgrammingComputer ScienceTechnicalSoftwareNonfictionEngineering

552 pages, Kindle Edition

First published April 16, 2016

1918 people are currently reading

7919 people want to read

About the author

Betsy Beyer

9 books36 followers

Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

1,203 (41%)

4 stars

1,188 (41%)

3 stars

430 (14%)

2 stars

57 (1%)

1 star

14 (<1%)

Displaying 1 - 30 of 274 reviews

Simon Eskildsen

215 reviews1,146 followers

May 7, 2016

Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, and each chapter stands too much on its own—building from first principles each time, instead of leveraging the rest of the book. This makes the book much longer than it needs to be. Furthermore, it tries to be both technical and non-technical—this confuses the narrative of the book, and it ends up not excelling at either of them. I would love to see two books: SRE the technical parts, and SRE the non-technical parts. Overall, this book is still a goldmine of information to a 5/5—but it is exactly that, a goldmine that you'll have to put a fair amount of effort into dissecting to retrieve the most value from, because the book's structure doesn't hand it to you—that's why we land at a 3/5. When recommending this book to coworkers, which I will, it will be chapters from the book—not the book at large.

Mircea

69 reviews11 followers

May 31, 2016

Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them actually) whose sole connection is that they wrote a chapter for this book. F the reader, F structure, F focusing on the core of the issue. Let's just dump a stream of consciousness kind of junk and after that tell everyone how hard it is and how we care about work life balance. Again, boring and in general you're gonna waste your time reading this (unless you want to know what borg, chubby and bigtable are)

Michael Scott

778 reviews157 followers

April 23, 2016

Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or used to think, a few years back) about its datacenters. If you're interested in this topic, you have to read this book. Period.

Structure
The book is divided into four main parts, each comprised of several essays. Each essay is authored by what I assume is a Google engineer, and edited by one of Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. (I just hope that what I didn't like about the book can be attributed to the editors, because I really didn't like some stuff in here.)

In Part I, Introduction, the authors introduce Google's Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Truly impressive achievement, no doubt about it!) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a simple service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter.

In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly (why not super-linearly as well?!?!) with services, yet can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic).

In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering int he SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engs, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. Overall, palatable material, but spread too thin and with too much much overlap with prior related work of a decade ago, especially academic, and not much new insight.

What I liked

I especially liked Part II, which in my view is one of the best introductions to datacenter management available today to the students of this and related topics (e.g., applied distributed systems, cloud computing, grid computing, etc.)
Some of the topics addressed, such as risk and team practices, are rather new for many in the business. I liked the approach proposed in this book, which seemed to me above and beyond the current state-of-the-art.
Topics in reliability (correlated failures, root-cause analysis) and scheduling (overload management, load balancing, architectural issues, etc.) are currently open in both practice and academia, and this book emphasizes in my view the dearth of good solutions but for the simplest of problems.
Many of the issues related to automated monitoring and incident detection could lead in the future to better technology and much innovation, so I liked the prominence given to these topics in this book.

What I didn't like
I thoroughly disliked the statements claiming by omission that Google has invented most of the concepts presented in the book, which of course in the academic world would have been promptly sent to the reject pile. As an anecdote, consider the sentence Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it!. I'll skip the discussion about who is the originator of the term SRE, and focus on the meat of this statement. By omission, it makes the reader think that Google, through its Ben Treynor Sloss, is the first to understand the importance of reliability for datacenter-related systems. In fact, this has been long-known in the grid computing community. I found in just a few minutes explicit references from Geoffrey Fox (in 2005, on page 317 of yet another grid computing anthology, "service considers reliable delivery to be more important than timely delivery"), Alexandru Iosup (in 2007, on page 5 of this presentation, and again in 2009, in this course, "In today’s grids, reliability is more important than performance!"). Of course, this notion has been explored for the general case of services much earlier... anyone familiar with air and especially space flight? The list of concepts actually not invented at Goog but about which the book implies to the contrary goes on and on...

I also did not like some of the exaggerated claims of having found solutions for the general problems. Much remains to be done, as hiring at Google in these areas continues unabated. (There's also something called computer science, whose state-of-the-art indicates the same.)

compsci-tech

Sebastian Gebski

1,217 reviews1,400 followers

April 26, 2016

Very uneven. Exactly what you should expect of a book in which is chapter is a separate essay written by a separate group of people :) Chapters can be grouped into following categories:

a* solid knowledge, not really fascinating, but useful, some Google inside stories
b* fairly solid knowledge, boring due to massive repetitions or being too general
c* exciting stuff that is useless for you, because you're not Google (but still, it's exciting ;>)
d* exciting stuff that you actually may use outside of Google, sometimes with neat warstories

Sadly, it's more b than a & more c than d. But it doesn't change my opinion that this book is actually worth reading - it's one of the few books for the topic, it's based on actual engineering perspective of a very interesting company that operates in a massive scale, it's massively influenced by this organization's culture. Even typical Software Engineers (especially junior ones) should read it to learn that software delivery & maintenance is so much more than just simple development.

One last remark to conclude: sorry if I made a false impression, but this is NOT a technical book. It's far more about processes, communication, attitude & mindset than actual technology running under the hood.

Dimitrios Zorbas

28 reviews10 followers

July 26, 2017

I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them.

It helped me wrap my head around concepts for which I used to rely on intuition.
I've shaped processes and created template documents (postmortem / launch coordination checklist)
for work based on this book.

devops

Michael Koltsov

116 reviews70 followers

March 3, 2017

I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge.

For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealing any Google’s secrets (do they really have any secrets?) But it’s a great start even if you don’t need the scale of Google but want to write robust and failure-resilient apps.

Technical solutions, dealing with the user facing issues, finding peers, on-call support, post-mortems, incident-tracking systems – this book has it all though, as chapters have been written by different people some aspects are more emphasized than the others. I wish some of the chapters had more gory production-based details than they do now.

My score is 5/5

James Stewart

38 reviews6 followers

July 21, 2016

Loads of interesting ideas and thoughts, but a bit of a slog to get through.

The approach of having different members of the team write different sections probably worked really well for engaging everyone, but it made for quite a bit of repetition. It also ends up feeling like a few books rolled into one, with one on distributed systems design, another on SRE culture and practices, and maybe another on management.

alper

210 reviews62 followers

October 14, 2024

“Kördüğüm olmuş bir kodla canlıda baş başa kalmak. Bir SRE’in dramı”. Evet, bu kitap Türkiye’de yazılsaydı böyle olurdu başl��ğı...

Kitaba “accelerate” kafası ile yaklaştım. Canlıya akan yapıda sürecin son adımı. Ama tabii ki sadece son adımı olarak değerlendirmiyoruz. Epey geriden sürece dahil olan bir organizasyon birimi SRE. Ne kadar erken devreye girdiklerine bağlı olarak da o kadar sağlıklı çalışan bir yapı. Bir sorumluluk almak durumunda değil. Ama onu belirlemek de bir şey. :) Aslında gerçek devops’un, google’ın SRE kurgusunda olduğunun da güzel bir anlatımı kitap.

Tüm platform bileşenlerine tavsiye ederim. Mimaridir, SRE’dir, Devops’dur…

Şunu demezsem içimde kalır: Ya adamların "postmortem reading clubs"ı var ya. Daha ne olsun. :) Post mortem yapıyorlar, o yapılır eyv. ama bunları yazıyorlar, bir de oturup bunlardan herkes ders çıkarsın diye oturup okuyorlar. Angarya anlaşılmasın sürecin önemli bir adımı olduğu anlaşılsın diye de stajer, junior değil herkese yazdırıyorlar. (Ben direktörlere yazdırırıdım bu arada. Zaten bir işleri yok, yatıryorlar) Genel olarak bu incelikte ele alırsan tüm süreçleri epey ilerlersin. Şaşıracak bir durum yok tabii.

Ha şu eleştiri yapılabilir kitaba yine accelerate açısından bakarak. Kitapta microservice’lere sonlarda değilniliyor ama pek az. Ben değineyim: Microservice dünyasında steam-aligned team bileşenlerine buradan ne kadar çok sorumluluk aktarabilirsen o kadar sağlıklı ilerlersin. Enabler olun arkadaşlar siz. Ekibin yetkinliğini arttırın. Öteki türlü bu yeni dünyanın cognitive load (bilişsel yükü) hepimize fazla. Google ölçeğinde ise hayal edemiyorum. Yani kitap 2016 basımı. Açıkçası birkaç yeni bölüm ekleseler şahane olur. Ama tabii sevdiğim tarz, yani metodolojiler üzerinden ilerleyen yaklaşımı ile, güncel olmadığnı iddia edemem.

Tekrar tekrar okunulası bölümlere bakalım,

Chapter 3 - Embracing Risk: Kitapta SRE'yi genel manada tanıtan bölüm bu.
Chapter 5 - Eliminating Toil: Düzenli şekilde sistemi “Toil”den arındırmalı. Böylece SRE ekiplerinin daha stratejik işlere odaklanmasına olanak tanımış oluruz.
Chapter 15 - Postmortem Culture: Learning from Failure: Sarı sitede de bu iş güzel yönetilirdi. Selam olsun.

Gerçek mühendislik bu dediğim 20-27 bölümler:

Chapter 20 - Load Balancing in the Datacenter: Weighted round robin güzel bir anlatımı.
Chapter 21 - Handling Overload: Seni sevdim “Lame Duck”
Chapter 23 - Managing Critical State: Distributed Consensus for Reliability: Bu konuları “Designing Data-Intensive Applications” güzel ele alır. Burada da iyi.
Chapter 24 - Distributed Periodic Scheduling with Cron: Paxos işleniyor. Açıkçası ben zookeeper ile işimi görüyorum. 🙈🙈 Ölçek önemli tabii. Onlar için nokta atışı çözüm bizim için “over engineering” olabilir.
Chapter 26 - Data Integrity: What You Read Is What You Wrote: Backup stratejilerini çok beğendim. Ufkum açıldı. Accelerate süreçlerime dahil ettim.
Chapter 27 - Reliable Product Launches at Scale: Canary, Feature flags…

Chapter 32 - The Evolving SRE Engagement Model: Burada sistemi nasıl esnek yapılandırdıklarını bir kez daha görüyoruz. Güzel bölüm. Her ekibin, her business’ın SRE ihtiyacı farklı şekillenebilir. Ona göre yapılandırmak doğru olan. Yoksa ders çıkarılan “Launch Coordination Engineering’de olduğu gibi geçişleri tıkayan bürokrasilere bir halka daha eklenir. Amacımız neydi? Canlıya doğru akan yapının pürüzlerini ortadan kaldırmak, o akışın sorunsuz, güvenilir bir şekilde devam etmesini sağlamak. Engellemek değil. Yeni darboğazlar koymak hiç değil.

Şununla bitirelim, bizde olsa nasıl başlardı diye girmiştim lafa. Onlarda nasıl başlıyor onunla bitirelim:

"Hope is not a strategy."

accelerate

Alexander Yakushev

49 reviews38 followers

May 5, 2018

This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company.
The text quality is top-notch, the book is written with clarity in mind and thoroughly edited.
I'd rate the content itself at four stars. But the book deserves the fifth star because it is a superb example of a material that gives you the precise understanding of how some company (or its division) operates inside. Apparently, Google can afford to expose such secrets while not many other companies can, but we need more low-BS to-the-point books like this to share and exchange the experience of running the most complex systems (that is, human organizations) efficiently.

management software-engineering

Regis Hattori

150 reviews12 followers

December 23, 2019

This book is divided into five parts: Introduction, Principles, Practices, Management, and Conclusions.

I see a lot of value in the first two parts for any people involved in software development. It convinces us about the importance of the subject with very good arguments, no matter if you are a software engineering, a product manager or even a user. This part deserves 5 stars

After some chapters of the Practices part, the conclusion I made is that this part of the book may only be useful if you are facing a specific problem and are looking for some insights but not to read end-to-end. Some examples are too specific for Google or similar companies that have not the same budget, skills, and pre-requisites.

In general, 3 stars is fair, but I will rate as 4 because I really liked the first 2 parts.

devops infrastructure sre

dantelk

220 reviews20 followers

June 17, 2025

This one if hard to evaluate. Sometimes, especially at the first quarter, I was thinking "I will give only two stars to this book". It was repetitive and monotonous, and focused on abstract or bureaucratic stuff which I yawned a lot. The mid chapters are much more technical, and I took a lot of notes. The last sections were also not interesting for a software developer like me. Of course, when you smell the book, it is perfumed with a little bit of vanity. However it also shows from a high level, the world's arguably most prestigious IT company executes its SRE business not very very different from many others.

Some concepts such as Paxos and Raft are already hard to understand, and explained not as good as some other classic books on those topics (Data Intensive Applications).

I think with a good editor, this book could have delivered much better, in a more focused fashion, with less pages. I wouldn't say I hated the book, but there was much room to improve.

Being good engineers is something, and authoring a book is something else.

3-4 stars. I would recommend reading it. But don't worry about skipping chapters.

Alex Palcuie

9 reviews51 followers

March 18, 2017

I think this is the best engineering book in the last decade.

favorites

Vlad Romanenko

33 reviews4 followers

September 3, 2021

Very useful and fundamental work for SRE discipline. Unsurprisingly a chunk of the book is quite Google specific.

available-to-read tech

Romain

932 reviews58 followers

July 30, 2021

Il s’agit du livre de référence dans le domaine, celui qui a lancé et donné son nom – enfin je crois – à la discipline visant à mettre le software engineering au service de la production ou des opérations. Avant cela, il y avait les développeurs en charge de concevoir les applications et ceux que l’on appelait les administrateurs s’occupaient de les déployer et de les superviser en production. Le problème avec ce modèle est que les uns ne connaissent rien – ou presque – au travail des autres et le résultat était au mieux chaotique et au pire donnait lieu à des querelles assez animées qui se transformaient vite en guerre de tranchées. Ça semble aberrant, mais c’était le modèle et c’est cette fracture que la mouvance DevOps s’est efforcée de résorber. Difficile de différencier clairement les deux mouvances, mais je dirais – ce n’est que mon point de vue – que le DevOps peut être atteint en constituant des équipes mixtes (développeurs + administrateurs) alors que les SRE possèdent des compétences mixtes (développement et administration), et mettent l’une au service de l’autre, l’Infrastructure as Code en est le parfait exemple.

Ce livre est une collection d’articles de plusieurs auteurs relatant divers aspects de leur discipline mise en pratique au sein de Google. Pour cette raison, le livre manque un peu de consistance et les articles sont inégaux de par leur intérêt ou leur format. Certains sont fondamentaux, comme ceux consacrés au monitoring, alors que d’autres valent plus pour leur côté retour d’expérience. Il est plus qu’intéressant pour comprendre un choix d’une technologie de comprendre les raisons qui ont poussé Google à développer l’outil Borg[1] dont la version open source n’est autre que l’une des technologies dont on parle le plus en ce moment, Kubernetes. Revenir à la genèse d’un tel outil permet d’en comprendre le but ultime, pouvoir faire tourner du code de façon complètement indépendante de l’infrastructure, et ce besoin semble parfaitement louable lorsque l’on dispose comme Google d’un nombre colossal de serveurs qui peuvent être parfois hétérogènes – je me doute que c’est au moins le cas pour des générations différentes. J’ai trouvé d’autres articles moins intéressants, mais ce ne sera pas forcément le cas pour tout le monde, c’est en fonction de son expérience, de ses attentes et de ses intérêts. Tout le monde n’est pas Google et n’a pas les mêmes problématiques à résoudre, mais il est toujours intéressant de connaître la façon dont elles ont été adressées, pour tenter de comprendre la démarche et de peut-être, très modestement s’en inspirer. Un livre de référence que l’on peut – qu’il vaut mieux – lire par petits bouts d’autant plus qu’il est disponible en intégralité en ligne – merci Google.

----

[1] Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines ((source)[https://research.google/pubs/pub43438/]).

Également publié sur mon blog.

c_s_chaos c_s_ops technique

Tomas

Author 2 books11 followers

December 25, 2017

This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how majestic their scale happens to be. After reading this book, I am absolutely sure I would never ever want to work for Google.

Chris

45 reviews23 followers

December 7, 2016

There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments.

Mehdi Home

52 reviews12 followers

August 20, 2023

A must-read for every software engineer whether interested in SRE and DevOps or not!

computer

Bjoern Rochel

401 reviews83 followers

August 27, 2019

A little disclaimer: My review here is more about the concept and organizational parts than the pure technical aspects. Mostly because I manage engineering teams nowadays and these areas are the more important ones for me. This book contains also a lot of technical information on how to implement SRE that I would highly recommended for interested software engineers.

One aspect I liked in particular about SRE is the Error Budget concept, Googles way to manage the age old conflict between product and engineering on how to distribute development efforts around non functional requirements and especially technical debt on one side and new features on the other side. The data driven approach and consequently the depersonalization of this debate seems very sane and professional to me.

I also liked their emphasis on training, simulation and careful on-boarding for SREs. For me this is still an area where the majority of the industry has plenty room for improvement. Looking at what Google does here makes the rest of us look like f***ing amateurs.

Another thing that I’m almost guaranteed to steal is the idea of establishing a Production Readiness Review to ensure reliability of new products and features from multiple angles (design, security, capacity, etc.).

What I’m still trying to wrap my head around is whether having dedicated SRE teams are a good idea (in contrast to a you-build-it-you-run-it approach where every delivery team effectively owns the responsibility to reach the defined SLA/Os). A principle that I like a lot is to give engineers a lot of freedom but to also make them accountable for their decisions and the software they produce. Separating out production - fitness into a separate group/team sounds like it goes into the opposite direction. I can imagine that several factors play into this (standardization, active tech/stack management, skill availability, etc.) and certainly Google has carefully evolved it to where it is now, but my initial reaction for this idea was negative.

Overall a very good resource that I will come back to

2018 2019

SeyedMostafa Meshkati

69 reviews28 followers

October 12, 2021

I have to say, this book is my new Engineering Bible!
It has up and downs in some chapters, some are related and sensible to someone, some may not, depending on your experience and situation, but overall, it's a masterpiece.

In persian:
این کتاب عملا تبدیل به کتاب مقدس من تو بحثای مهندسی میشه.
کتاب با توجه به فصل‌هاش، بالاپایین‌های مختلفی داره، یه سری چیزها ممکنه برای بعضیا ملموس باشه، یه سری چیزا نباشه، احتمالا به خاطر شرایطی که توش بودیم و تجربیاتمون. اما به صورت کلی میشه گفت این کتاب یک اثر فوق‌العاده‌است.

favorites technical

Anita

13 reviews

March 30, 2024

If I needed to write a review in one sentence, it would be: How do you operate at scale!

Some things in the book could be more varied and exciting, but it's worth reading! It contains Google insider stories, lessons learned, and processes implemented! If you've been in the industry for a long time, many of the things in it are common sense, so you might end up nodding a lot while reading or simply finding it boring because it's nothing new.

Liviu Costea

29 reviews2 followers

November 16, 2019

A lot of food for thought, a book that became a reference in the field. The only problem is the wide coverage, you might find some chapters very niche, like not everybody cares how to build layer 4 load balancer.
Highly recommended if you are following devops approaches.

devops

Vít Listík

4 reviews3 followers

December 25, 2018

I like the fact that it is written by multiple authors. Everything stated in the book seems so obvious but it is so sad to read it because it is not yet an industry standard. A must read for every SRE.

Amir Sarabadani

77 reviews53 followers

December 24, 2019

It's basically a looong advertisement for google with some useful information inside while it should be other way around.

best-software-engineering-books

Saurav

155 reviews19 followers

October 10, 2025

While there is a lot of insight the book could benefit from a bit more structure.

The contents must have been ground-breaking when the book was initially released, but they are well-known among SREs across the industry.

It is a decent read though for someone wanting some overview of how things are managed in scale.

Jonas Minelga

25 reviews

December 21, 2021

Very long and detailed book. Information in it is extremely valuable, but i think Google is one of like 2-3 companies in the world, were all of that can be used. I think for broader audience it is too detailed in some parts, duplicate info in others, and slightly difficult to read. But overall, book provides a lot of amazing insights and provides many ideas.

Ahmad hosseini

326 reviews73 followers

April 3, 2017

What is SRE?
Site Reliability Engineering (SRE) is Google’s approach to service management.
An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Typical SRE activities fall into the following approximate categories:
• Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work.
• System engineering: Involves configuring production systems, modifying configuration, or documenting systems in a way that products lasting improvements from a one-time effort.
• Toil: work directly to running a service that is repetitive, manual, etc.
• Overhead: Administrative work not tied directly to running a service.

Quotes
“Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.” – Brain Redman
“Ways in which things go right are special cases of the ways in which things go wrong.” – John Allspaw

About book
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you.
“Essential reading for anyone running highly available web services at scale.” – Adrian Cockcroft, Battery Ventures, former Netflix Cloud Architect

programming software-engineering

David

93 reviews5 followers

July 16, 2018

The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most useful. On the other hand, the chapter on simplicity was indeed simplistic, and the chapter on data integrity was (surprisingly) disappointing.

The good: there's a lot of excellent information in this book. It's a comprehensive, thoughtful overview for anybody entering the world of distributed systems, cloud infrastructure, or network services. Despite a few misgivings, I'm pretty on board with Google's approach to SRE. It's a very thoughtful approach to the problems of operating production services, covering topics ranging from time management, prioritization, onboarding, plus all the technical challenges in distributed systems.

The bad: The book gets religious (about Google) at times, and some of it's pretty smug. This isn't a big deal, but it's likely to turn off people who've seen from experience how frustrating and unproductive it can be when good ideas about building systems become religion.

Somtochiama

17 reviews1 follower

March 11, 2024

Okay. It took me a while to get through this book. Good material all round and touches on base SRE principles. Just keep it in mind that some parts might not apply if you are not at Google scale.

Luke Amdor

8 reviews9 followers

October 16, 2017

Some really great chapters especially towards the beginning and the end. However, I feel like it could have been edited better. It meanders a lot.

Amr

48 reviews13 followers

March 8, 2020

The book is great in terms of getting more understanding of google’s SRE culture. But I got to a place where it became irrelevant to me to continue the book so I decided to drop it.