Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests. Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions. Learn how to - Use the scientific method to evaluate hypotheses using controlled experiments - Define key metrics and ideally an Overall Evaluation Criterion - Test for trustworthiness of the results and alert experimenters to violated assumptions - Build a scalable platform that lowers the marginal cost of experiments close to zero - Avoid pitfalls like carryover effects and Twyman's law - Understand how statistical issues play out in practice.
1. Great introduction of terminology. I've learned and solidified my knowledge on terms like: - Overall evaluation criterion or quantitative measures of the experiment's objectives which can be a combination of variables or a balance scorecard - Randomization units: although usually the user, there's a variety of others that can be used
2. Twyman's law and trustworthiness of experiments: Twyman's law has been encountered in different forms but generally holds that any figure that looks interesting or different is usually wrong. There are many reasons here: 2.1 Misinterpretation of statistical results: - Experiments can have a lack of statistical power meaning that your experiment is underpowered (i.e. your sample size is too small) to detect the effect size . - Misinterpretation of p values: P value is the probability of obtaining a result equal or more extreme than what we observed assuming that the null hypothesis is true. - Peeking at p-values: If you are running an experiment, you could continuously look at p values for significant results. You are however doing multiple hypothesis testing and you are greatly increasing the risk of a type 1 error - Multiple hypothesis test: This is a generationalization of the above point, but will appear once you are starting to look at multiple metrics, segments of the population or multiple iterations of an experiment 2.2 Threats to internal validity of experiments - Violations of stable unit treatment value assumption (SUTVA) which states that experiment units do not interfere with one another and behavior is impacted by their own variant assignment. This can be violated in the case of social networks, skype or communication tools, two-sided marketplaces - Survivorship bias: Analyzing users who have been active for some time can be inherently different than new users - sample ratio mismatch (SRM): if the ratio of users between the variants is not close to the designed ratio, the experiment suffers from a sample ratio mismatch. 2.3 Threats to external validity: - Day of the week and seasonality effects: Users can behave differently on weekends than on weekdays or on holidays than on non-holidays. You should therefore always capture the full weekly cycle in an experiment. - Primacy effects: When a change is introduced, users may need time to adopt as they are primed to the old feature - Novelty effects: When you introduce a new feature, it attracts users who want to try it out. A treatment effect will therefore appear at first, but disappear over time. 2.4 Simpson's paradox: if an experiment has multiple iterations and two or more periods with different percentages assigned to the variants, combining the results can result in directionally incorrect estimates of the treatment effect. For instance, treatment may be better than control in both phases, but worse overall when the two periods are combined.
3. Interesting content on how to run concurrent experiments: One way to do so is to create multiple experiment layers on top of each other and ensure orthogonality of experiments across layers by assigning users to buckets with layers IDs. You can take this further by introducing a full factorial platform design. However, this may no avoid potential collisions where certain treatments from two different experiments give users a poor experience if they coexist. One way to solve for this is, to introduce a nested or constraints-based platform deisign. In a nested design, system parameters are partitioned into layers so that users cannot see 2 experiments from the same layer at the same time.
4. Good content on metric taxonomies - Goal (success, north star) metrics vs drivers metrics vs guardrail metrics (metrics that protect the business and metrics that assess the trustworthiness and validity of results). Apart from that, you could also use diagnosis or debut metrics for indications of whether there's a problem.
5. Institutional memory and meta-analysis: Through storing results and mining them from all previously done experiments, one can contribute to the experiment culture in the company, establish experiment best practices, help guide future innovations and use results for metric innovation. For instance, having a catalog of what worked and what didn't work is highly valuable to avoid repeating mistakes, but you can also distill from it information which experiments are most effective at moving key metrics.
6. Observational causal studies when controlled experiments are not possible. Here a variety of methodologies is possible such as interrupted time series using bayesian structural time series, regression discontinuity design, instrumented variables and natural experiments (e.g. Vietnam war draft lotery) and difference in differences.
7. Variance estimation: It's pivotal to properly measure the variance of experiments as overestimated variance leads to false negatives and underestimated variance leads to false positives. There's a few situations in which the variance formula changes: 7.1 Delta vs Delta %: The percent delta variance formula is var (Yt / Yc) 7.2 Outliers have a big impact on the mean and variance however the impact on the variance tends to outweight, the impact on the mean. Outliers increase the risk on false negatives. 7.3 We want to improve our ability to detect a treatment effect also called power or sensitivity. There's various ways to increase variance: - Create evaluation metrics with smaller variance: number of searches has higher variance than the number of searchers. Purchase amount has higher variance than purchase (boolean) - Transform a metric through capping, binarization or log transformation - Use triggered analysis - Use stratification, control-variates or cuped. - Randomize at more granular units: For instance go from user to session or pages
8. The A/A test: The idea here is simple as you split users into two groups like a regular A/B test but make B identical to A. If the system is operating correctly, then in repeated trials about 5% of the time a given metric should be statistically significant with a p-value less than 0.05. You can use A/A tests for several purposes: - Ensure that type 1 errors are controlled - Validate if any bias exists between control and treatment users
9. Sample ratio mismatch and trust-related guardrail metrics: This may happen if your treatment and control have a different ratio of users than specified in the design. The causes are the following: - Buggy randomization: Think about ramp-up procedures, exlusions etc. - data pipeline issues - Residual effects: Sometimes after fixing a bug, you fail to re-randomize the experiment and that bug may have been serious enough for users to abandon. - Bad trigger conditions: Think about redirects here which usually have a substantial impact on the user experience and may generate some loss - Trigger conditions based on attributes that will likely change because of the experiment 10. Leakage and interference between variants: Interference or leakage can arise through direct and indirect connections: - Direct connections: Like in a social network, units can be directly connected as friends or if they visited the same physical space at the same time - Indirect connections: Two units can have an indirect connection because of certain latent variables. Think about Airbnb here where if you improve the conversion for treatment users, you are reducing inventory which has an impact on controlls.
Solutions here can be related to isolating treatment units from controls: 1. Splitting shared resources: You can split your ad budget between both variants 2. Geo-based randomization: You can randomize across regions to isolate interference between treatment and control. This may lead to reduced sample size and lower power 3. Time-based randomization: 4. Network-cluster randomization: You can construct. clusters of nodes that are close and then use this cluster as a mega unit and randomize them independently into treatment or control groups
11. Measuring long-term treatment effects: Generally we want to generalize or extrapolate the effect of our experiments. If we can do this, we can take this long-term effect into account in the decision-making process. Following reasons are why treatments effect may differ between short and long-term: - User-learned effects: Users can learn and adapt to a change: for instance product crashes are a terrible user experience that can accumulate. Other Points here can relate to novelty effects, primacy effects etc. - Network effects - Ecosystem change: - Competitive landscape: If your competition launches the same feature, the value of the feature may decline - Concept drift: Relates to machien learning models
Solutions here may be: 1. Long-running experiments 2. Cohort analysis 3. Post-period analysis 4. Time-staggered treatments 5. Holdback and reverse experiments:
This book is well organized and easy to read. It shows us thoroughly end-to-end process of how to design a controlled experiment and ignore statistical pitfalls
at its best as a semi-organized dump of tech-specific domain knowledge for people who want to run online experiments. there is some particurly useful information about good practices like A/A tests and ramp up periods -- it would interesting to try to port some of this knowledge back to domains where experimentation is harder and people don't have quite as many reps under their belt.
that said, the book reads like a list of bullet-points that was expanded half-heartedly into a book. the examples in particular are numerous but often appear out of nowhere and go largely unexplained as to their importance. there are a lot of typos, and while i trusted the authors when they highlighted statistical issues as important to deal with, i was much less convinced that their recommended solutions were appropriate. a lot of the recommendations in this book are backed heavily by citations (to the point of excess at times), but many of these citations are self-citations back to business writing by kohavi or others, which makes them a bit hard to trust
This is a very in-depth book that almost explains every corner about how run a trustworthy A/B testing. The book illustrates the ideal approach in various contexts. And it also lists many common misconceptions and issues that might lead to incorrect A/B testing result.
I've run A/B testing for more than eight years. I still learned a lot and hated myself, that why not read this book earlier. And I think I might need to re-evaluate some experiments I had conducted before.
It’s not an easy book, in my opinion, especially for the last few chapters. The book doesn't explain many details in most advanced topics. It just list keywords and references that guide you to the right direction. It’s better to have a book club if you want to understand the content thoroughly. In some parts in the book, I read many times, checked some reference, or discussed with friends in the book club and finally understood it.
Some topic covered in this book about running an experiment to verify the lift of a new feature. • Which is the best choice for randomize unit(xx-sticky): page view, session, or user. What’s the limitation or caveat of these choices in different contexts. • When running a A/B testing in Ad campaigns or e-commerce platform, there’re shared resource for different variants – budget of campaigns or commodity stocks. One variant might consume all the resource, and other variants will not be able to use those resource. This lead to interference between variants and might get an overestimated lift. • When it’s hard to run a A/B test, what are alternatives? How to eliminate bias on these methods.
Read for a work book club. Pretty solid primer on online experimentation - covers a good deal of statistics in the beginning and then peppers with case studies (mostly Google, Microsoft) throughout the literature. Subject matter is inherently kind of dry, so I found myself skimming/getting bored at times. But, if you're looking to learn basics of A/B testing, this is a good start. More importantly though, is applying this in a practical setting. I think I wouldn't have learned anything from this book if I had not already done some of the applied work before/during.
The HiPPO on the cover represents the “Highest Paid Person's Opinion” and the book is about making decisions in industry based on experiments instead. It includes Twyman's Law and much more and is the text to read for online A/B tests. It does have some of the character of class notes from a busy prof, with details left in other papers, and the occasional misattribution (quote page 153) and error (“due to the increase in power that comes from the increase in power” page 170). Indispensable, but some assembly required.
Excellent book covering all aspects of ab testing. It includes discussions for both product managers but also specialists. The last part delves in the mathematical/statistical concepts.
# Review Easily the best book I've read this year, and I can't imagine anything topping it. The book distills ~ 20 years of hard earned knowledge on running large experiments at internet companies.
The author's are pioneers of the internet age who were instrumental in building the large scale experimentation platforms at Microsoft, Google and LinkedIn. A big reason why the book is so valuable is that the authors worked in industry, which constrains their approach to be practical.
The practicality is really nice, because the author's hit the main points, telling you things like:
- What's the deal with online experiments? - What are the major components? - What are the big problems in practice? - How can I approach solving them? - What are some references if I want to learn more?
Much appreciated!
## Other notes - Prediction gets more emphasis than experiments, "especially now". Both have value, but this book helped me conceptualize the differences. I now see machine learning as more of a treatment, or a new product, or even a technique to enhance existing products. Getting trustworthy inferences is a separate endeavor.
- The skill it takes to build a large scale experiment system, get it working properly, and scale it across a large company, is phenomenal. As a statistician, I'm so happy companies like Google and Microsoft brought in in talented folks like this to carry the torch of statistical thinking into the 21st century.
This is the best book on A/B testing out there! It is a comprehensive overview of all the major challenges and concepts a practitioner will encounter in the real world. It offers solutions, well developed frameworks and references for follow up reading.
The first half is a must read for anyone (PMs, designers, engineers, analysts) on teams that use (or want to start) A/B testing. The second half is a must read for analysts, engineers and data scientists building the systems and running analyses.
Even with five years of experience, I still learned a lot from this book. Highly recommend!
The book is terrible, it has absolutely no practical value
1) writing style is bad, it reads like a collection of bullet points 2) there is no hard data to illustrate what the authors preach about 3) 30-40% of the book are references
Seriously, basic how-to articles on towardsdatascience or stackoverflow are way more practical than this piece of shit
Seems that authors just wanted some extra money Otherwise, I have no idea how they are able to put their names on this crap
The book gives a short math description of A/B testing and provides a lot of examples with different pitfalls. Most of them are really interesting and difficult to catch.
This book is excellent. While plenty has been written around RCTs in general, there are fewer resources specific to online experimentation. I also thought they provided the right amount of technical depth. They weren’t afraid to discuss concepts like variance and power but it did not feel like I was reading a statistics textbook. They also covered a lot of ground in a short amount of space, including instrumentation, web performance, sample ratio mismatch (SRM), observational causal studies, spillover, and long-term effects.
Arguably the most valuable contribution is the guidance in selecting the right metric or combination of metrics, which they call the OEC (p. 103). The metric must be movable enough that a change could be detected, but also have a causal impact on the desired outcome. Your company’s total revenue is not movable, because it’s highly unlikely that a small usability change would produce a significant difference, but has a direct causal link to the desired outcome (or in this case, *is* the desired outcome). Stock price is an even less movable metric. On the flipside, “number of listing photos viewed” is very movable based on one UX change but is unlikely to have a large causal impact on revenue. It’s important to use a metric like booking conversion rate that is somewhere in the sweet spot. As an example of an OEC, you could use a weighted combination of engagement (e.g., # of listing views), revenue, and an ease-of-use metric like time-to-checkout.
Few books distill so much information and experience into 245 pages. I will definitely be keeping this on my desk as a reference in future roles.
Generally enjoyed this book. The authors clearly have deep expertise in their areas. I think it's a "must read" if you're responsible for analytics in your product(s) whether your an individual contributor or a senior manager with analytics or data science team members as direct reports.
Where it might fall down a little...
* While this book certainly isn't intended to be a statistics primer, where they do go deep in that area they should perhaps offer just a few more sentences of background on concepts. (E.g., the common t stats and so on.) As well, while they do offer examples in all manner of things, they should offer a few more when they go into depth on certain theories. That being said, they mostly do this and it's hugely helpful in understanding concepts.
* This seems like a "pro tips" book that offers depth a lot of teams just won't have. If you're at a large corp with major resources and tons of data, this can likely help you play the game at a much higher level. What's missing though, is more "How To" at a practical level. I'm not saying they should tell you how to set up your basic Google Analytics or Tableau or whatever; that likely wouldn't be practical. But I would have appreciated a bit more depth in terms of practical setup as opposed to theory than jumping to match.
Again, in spite of my few minor criticisms, I found a lot of value in it. Worst case, in the few areas I struggled with it, there was enough here that I was able to search or use AI to fill in any blanks or more fully explain concepts I struggled with.
This book offers a huge amount of practical insight into the practice of running A/B tests to manage and improve a web service, from practitioners who have been doing it for years. However, it is expressed through a set of loosely organized bulleted lists, with many points described imprecisely, with a few elliptic phrases or references. There's no unified notation or mathematical framework, though one can piece things together with some knowledge of statistics. The benefit here of course is that this book tells the stories that more organized and mathematically unified statistics treatments don't about the actual domain area, including the many ways in which even a relatively simple web environment differs from the idealized randomized controlled trial, with sources of bugs or failures that can invalidate what should be one of the simplest possible statistical analyses, and practical methods for detecting and avoiding them, plus the looser management context of deciding what to measure and why, and how. Reading through this, it felt like a lot of the difficulties have to do with the fact that web traffic arrives randomly in time, a fact which A/B framing abstracts from: I suspect that a lot of the situations to which they give clever names here are in fact documented in the survival analysis and point process literature, which seems like a fruitful source for methods and formalisms for the area.
Really great, practical, and interesting overview of A/B testing. Good refresher for data scientists, and some fun stories about business decisions and real experiments. Minus 1 star just by nature of using so much academic citations - but I understand that adds some legitimacy I just prefer reading a more natural flowing book than an academic paper.
Quotes: - "Shipping code is, in fact, an experiment. It may not be a controlled experiment, but rather an inefficient sequential test where one looks at the time series; if key metrics (e.g. revenue, user feedback) are negative, the feature is rolled back…. If you are willing to ship something to 100% of users, shipping to 50% with the intent of going to 100% as an experiment should be fine" - "If you incorrectly estimate the variance, then the p-values and confidence interval will be incorrect, making your conclusions from the hypothesis test wrong. Overestimated variance leads to false negatives and underestimated variance leads to false positives."
Awesome book on large scale experimentation. Covers a lot of ground starting with the basics of conducting a standard A/B test to measuring long term impacts such as learning effects. Authors are applied statisticians who led experimentation at Microsoft, Google and LinkedIn. There’s no other book like this and it’s a must for anyone involved in digital product experimentation- product managers, analysts and data scientists. The writing style is terse and to the point. There’s a ton of info packed into the 250 pages and has a large references section. Some sections are pretty technical but only the most important results are highlighted. I can see myself returning to this book and references within often.
While the beginning was frenetic and somewhat puzzling as it was obviously written by a group of people, it slowly starts to raise questions you normally don't ask. Like long-term effects, learning effects, reverse testing, A/A testing to make sure your setup isn't leaking noise and errors. I like it a lot.
The only thing I would ask to improve is the math part, some parts could be explained more. I had some algebra 10 years ago, but I'd appreciate a more down-to-earth approach, to say the least. Nevertheless, this book is a solid milestone, you can only dive deeper from here on. Just ask the right questions and this book will guide you.
"Trustworthy Online Controlled Experiments" touches on pretty much everything you'd need to be aware of to design and analyze online experiments (power analysis, biases, metrics). Worth reading for anyone working in or trying to find a job in experimentation, but I'd also recommend reading some of the cited materials in the references if you're curious to understand anything more deeply. The book is mostly high-level -- so if you really want to understand experimentation beyond the baseline you need to be useful, you should also explore other resources.
A good book about experiments for your apps and services, what you can learn from them, and how to do them.
With a focus on how to start doing them in your company and how to build the necessary culture and tools.
Some details for PMs, some for engineers, most for both. Read it all together (organize a bookclub) to get inspired and try it yourself.
Introduces different types of bases and how to avoid them. Explains that experiments are not silver bullets. You should still do them, but you still might (and will) be wrong. Get used to it, learn from it and move forward.
Excellent introduction to running A/B tests on web sites and other consumer internet applications, covering several important pitfalls that the inexperienced would otherwise be likely to make. Anyone who is responsible for either working on experiment systems or designing experiments and interpreting the results should read this unless you already have extensive experience.
I wish this book had been available back when I was creating new A/B test systems... but of course the research it depends on hadn't been done yet.
I may have been somewhat bored because I kinda knew most of the info from work, but I also think the book itself is a bit dry.
There were some interesting chapters, specifically the one on observational causal studies, the two chapters on the actual statistics behind A/B tests(though they don’t go very deep at all), and the section about alternatives to holdouts.
Wish they went more in depth on the stats and made the book more interesting overall, but I guess it’s supposed to be more like a manual than a casual read.
Excellent summary of all important aspects running A/B test. My team are running more than hundred experiments per quarter, we hit pretty very corners of A/B test scenerios. The book covers almost all of them and provide practical insight on what can be done to improve it. Strongly recommend to anyone running amount experiments, the valuable book can save you a lot of time to learn all the tricks, pitfalls.
The basics are excellent and the first parts of the book were enjoyable and catchy to read. The more technical aspects of it were harder to follow for me (but I don't have enough statistics knowledge). There is definitely something to learn from this book if you're running online experiments.
I can't say that the book is not well written, it might be that my lack of knowledge that made the last parts arduous to read, hence the 4 subjective stars.
This is a book packed with lots of information. It discusses details I never thought would matter, such as how to code the assignment of users between treatment and control groups in coding. I would think it's a one-liner code, but the authors show that when it comes to online experiments, even such a seemingly trivial thing will matter. I only wish the book could be a little more structured between chapters.
Great book. If your company does A/B testing (if it doesn't - why not?) - very important read to understand how it works, and, more importantly, how it can break. First chapters are quite accessible to different readers, though obviously the more experience with statistics and probability theory you have - the better. I can't genuinely evaluate last part, aimed at advanced readers - I just understand too little of it.
A book about A/B testing from leading tech giants' specialists, with focus on web solutions.
It is a bible of A/B testing, the most practical book on the market, a reference guide on any practical issue with testing. Almost every paragraph contains a reference to some scientific article or book, for deeper studying of a issue.
Moreover, it brilliantly covers product analytics in general as well. So one might say that this is one of the best books on product analytics as well.
This is a very good and informative book. I usually encounter some of the difficulties and pitfalls when running online experiments so it's definitely good reference for what the industry practices are. However, it is a very technical and academic research fact-heavy book. I wouldn't say that the book is particularly inspiring or engrossing to read.