Site Reliability Engineering is a relatively young discipline focused on treating operations as a software problem. Because it is so young, the SRE knowledge base is still growing. The goal is to make this book short, light and fun, but most importantly relevant. Each chapter in this book describes a Site Reliability Engineering concept in a short and easily digestible way. The chapters in this book aim to provide every software engineer with information that can be used to increase the reliability of the systems they work on.
Topics observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.
These concepts have been at the core of my personal SRE journey, and my hope is that you will find them valuable too!