Betsy Beyer Quotes (Author of Site Reliability Engineering)

“team size should not scale directly with service growth.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

1 likes

Like

“Hope is not a strategy.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

tags: engineering

1 likes

Like

“When a team must allocate a disproportionate amount of time to resolving tickets at the cost of spending time improving the service, scalability and reliability suffer.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

1 likes

Like

“When standard operating procedures break down, they’ll need to be able to improvise fully.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Antagonistic neighbors Other processes (often completely unrelated and run by different teams) can have a significant impact on the performance of your processes. We’ve seen differences in performance of this nature of up to 20%. This difference mostly stems from competition for shared resources, such as space in memory caches or bandwidth, in ways that may not be directly obvious.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“For non-Byzantine failures, the minimum number of replicas that can be deployed is three — if two are deployed, then there is no tolerance for failure of any process. Three replicas may tolerate one failure.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“One could even claim that without SLOs, there is no need for SREs.”
― Betsy Beyer, The Site Reliability Workbook: Practical Ways to Implement SRE

0 likes

Like

“In order to base these decisions on objective data, the two teams jointly define a quarterly error budget based on the service’s service level objective, or SLO (see Chapter 4). The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Often, sheer force of effort can help a rickety system achieve high availability, but this path is usually short-lived and fraught with burnout and dependence on a small number of heroic team members.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“How can we harness the enthusiasm and curiosity in our new hires to make sure that existing SREs benefit from it?”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“John is the newest member of the FooServer SRE team. Senior SREs on this team are tasked with a lot of grunt work, such as responding to tickets, dealing with alerts, and performing tedious binary rollouts.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Hope is not a strategy”
― Betsy Beyer

tags: strategy

0 likes

Like

“SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Don’t be afraid to provide white glove customer support for early adopters to help them through the onboarding process. Sometimes automation also entails a host of emotional concerns, such as fear that someone’s job will be replaced by a shell script. By working one-on-one with early users, you can address those fears personally, and demonstrate that rather than owning the toil of performing a tedious task manually, the team instead owns the configurations, processes, and ultimate results of their technical work. Later adopters are convinced by the happy examples of early adopters.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions).”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“And taking the historical view, who, then, looking back, might be the first SRE? We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE.5 In her own words, “part of the culture was to learn from everyone and everything, including from that which one would least expect.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“To this end, Google always strives to staff its SRE teams with a mix of engineers with traditional software development experience and engineers with systems engineering experience.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Remember that the code path you never use is the code path that (often) doesn’t work.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Making the jump from a previous company or university, while changing job roles (from traditional software engineer or traditional systems administrator) to this nebulous Site Reliability Engineer role is often enough to knock students’ confidence down several times. For more introspective personalities (especially regarding questions #2 and #3), the uncertainties incurred by nebulous or less-than-clear answers can lead to slower development or retention problems.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture. Nevertheless, as the old adage goes, a complex system that works necessarily evolved from a simple system that works. Chapter 9, Simplicity, goes into this topic in detail.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),3 throttling some requests, or designing the system so that it isn’t faster under light loads.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“When an engineer with years of familiarity in a problem space begins designing a product, it’s easy to imagine a utopian end-state for the work. However, it’s important to differentiate aspirational goals of the product from minimum success criteria (or Minimum Viable Product). Projects can lose credibility and fail by promising too much,”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“An upfront investment in SRE training is absolutely worthwhile, both for the students eager to grasp their production environment and for the teams grateful to welcome students into the ranks of on-call.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be? Does this additional revenue offset the cost of reaching that level of reliability?”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Investing up front in the education and technical orientation of new SREs will shape them into better engineers. Such training will accelerate them to a state of proficiency faster, while making their skill set more robust and balanced.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“It’s important to establish credibility by delivering some product of value in a reasonable amount of time. Your first round of products should aim for relatively straightforward and achievable targets — ones without controversy or existing solutions. We”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

Betsy Beyer > Quotes

Books by Betsy Beyer

See a Problem?

Preview — Site Reliability Engineering by Betsy Beyer

See a Problem?

Preview — The Site Reliability Workbook by Betsy Beyer

See a Problem?

Preview — Site Reliability Engineering by Betsy Beyer