Great book showing different aspects and clues of delivering software to production. Many software architects are "short sighted" and thinks about make a software which passes unit/integration tests and are failing shortly after release books describes such cases shows good approaches and different ways to deal with the problem and testing like chaos monkey.
quotes:
Armed with a thread dump, the application is an open book, if you know how to read it. You can deduce a great deal about applications for which you’ve never seen the source code. You can tell: What third-party libraries an application uses What kind of thread pools it has How many threads are in each What background processing the application uses What protocols the application uses (by looking at the classes and methods in each thread’s stack trace)
As much as RMI made cross-machine communication feel like local programming, it can be dangerous because calls cannot be made to time out. As a result, the caller is vulnerable to problems in the remote server.
The amazing thing is that the highly stable design usually costs the same to implement as the unstable one.
A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing. This is what most people mean by “stability.” It’s not just that your individual servers or applications stay up and running but rather that the user can still get work done.
The more tightly coupled the architecture, the greater the chance this coding error can propagate. Conversely, the less-coupled architectures act as shock absorbers, diminishing the effects of this error instead of amplifying them.
events that caused the failure is not independent. A failure in one point or layer actually increases the probability of other failures. If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.
precise about these chains of events: Fault A condition that creates an incorrect internal state in your software. A fault may be due to a latent bug that gets triggered, or it may be due to an unchecked condition at a boundary or external interface. Error Visibly incorrect behavior. When your trading system suddenly buys ten billion dollars of Pokemon futures, that is an error. Failure An unresponsive system. When a system doesn’t respond, we say it has failed. Failure is in the eye of the beholder...a computer may have the power on but not respond to any requests. Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate. At each step in the chain of failure, the crack from a fault may accelerate, slow, or stop.
caused a remote problem to turn into downtime. One way to prepare for every possible failure is to look at every external call, every I/O, every use of resources, and every expected outcome and ask, “What are all the ways this can go wrong?” Think about the different types of impulse and stress that can be applied:
Tight coupling allows cracks in one part of the system to propagate themselves—or multiply themselves—across layer or system boundaries. A failure in one component causes load to be redistributed to its peers and introduces delays and stress to its callers. This increased stress makes it extremely likely that another component in the system will fail. That in turn makes the next failure more likely, eventually resulting in total collapse. In your systems, tight coupling can appear within application code, in calls between systems, or any place a resource has multiple consumers.
A butterfly style has 2N connections, a spiderweb might have up to , and yours falls somewhere in between.
One wrinkle to watch out for, though, is that it can take a long time to discover that you can’t connect. Hang on for a quick dip into the details of TCP/IP networking. Every architecture diagram ever drawn has boxes and arrows, similar to the ones in the following figure. (A new architect will focus on the boxes; an experienced one is more interested in the arrows.)
You have to set the socket timeout if you want to break out of the blocking call. In that case, be prepared for an exception when the timeout occurs.
Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. It checked validity by executing a SQL query like “SELECT SYSDATE FROM DUAL.”
Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it’s still alive. If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection.
The most effective stability patterns to combat integration point failures are Circuit Breaker and Decoupling Middleware.
Hunt for resource leaks. Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.
Stop cracks from jumping the gap. A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.
Scrutinize resource pools. A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.
If you are running in the cloud, then autoscaling is your friend. But beware! It’s not hard to run up a huge bill by autoscaling buggy applications.
Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated.
That’s why I advocate supplementing internal monitors (such as log file scraping, process monitoring, and port monitoring) with external monitoring. A mock client somewhere (not in the same data center) can run synthetic transactions on a regular basis. That client experiences the same view of the system that real users experience. If that client cannot process the synthetic transactions, then there is a problem, whether or not the server process is running.
If you find yourself synchronizing methods on your domain objects, you should probably rethink the design. Find a way that each thread can get its own copy of the object in question. This is important for two reasons. First, if you are synchronizing the methods to ensure data integrity, then your application will break when it runs on more than one server. In-memory coherence doesn’t matter if there’s another server out there changing the data. Second, your application will scale better if request-handling threads never block each other.
One elegant way to avoid synchronization on domain objects is to make your domain objects immutable.
When the time comes to alter their state, do it by constructing and issuing a “command object.” This style is called “Command Query Responsibility Separation,” and it nicely avoids a large number of concurrency issues.
In object theory, the Liskov substitution principlestates that any property that is true about objects of a type T should also be true for objects of any subtype of T. In other words, a method without side effects in a base class should also be free of side effects in derived classes. A method that throws the exception E in base classes should throw only exceptions of type E (or subtypes of E) in derived classes.
Libraries are notorious sources of blocking threads, whether they are open-source packages or vendor code. Many libraries that work as service clients do their own resource pooling inside the library. These often make request threads block forever when a problem occurs. Of course, these never allow you to configure their failure modes, like what to do when all connections are tied up waiting for replies that’ll never come.
A blocked thread is often found near an integration point. These blocked threads can quickly lead to chain reactions if the remote end of the integration fails. Blocked threads and slow responses can create a positive feedback loop, amplifying a minor problem into a total failure. Remember This Recall that the Blocked Threads antipattern is the proximate cause of most failures.
Use proven primitives. Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue: it isn’t. Any library of concurrency utilities has more testing than your newborn queue. Defend with Timeouts. You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid infinite waits in function calls; use a version that takes a timeout parameter. Always use timeouts, even though it means you need more error-handling code.
Autoscaling can help when the traffic surge does arrive, but watch out for the lag time. Spinning up new virtual machines takes precious minutes. My advice is to “pre-autoscale” by upping the configuration before the marketing event goes
Self-denial attacks originate inside your own organization, when people cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Send mass emails in waves to spread out the peak load. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.
Too often, though, the shared resource will be allocated for exclusive use while a client is processing some unit of work. In these cases, the probability of contention scales with the number of transactions processed by the layer and the number of clients in that layer. When the shared resource saturates, you get a connection backlog. When the backlog exceeds the listen queue, you get failed transactions. At that point, nearly anything can happen. It depends on what function the caller needs the shared resource to provide. Particularly in the case of cache managers (providing coherency for distributed caches), failed transactions lead to stale data or—worse—loss of data integrity.
When a bunch of servers impose this transient load all at once, it’s called a dogpile. (“Dogpile” is a term from American football in which the ball-carrier gets compressed at the base of a giant pyramid of steroid-infused flesh.)
A pulse can develop during load tests, if the virtual user scripts have fixed-time waits in them. Instead, every pause in a script should have a small random delta applied.
Dogpiles force you to spend too much to handle peak demand. A dogpile concentrates demand. It requires a higher peak capacity than you’d need if you spread the surge out. Use random clock slew to diffuse the demand. Don’t set all your cron jobs for midnight or any other on-the-hour time. Mix them up to spread the load out. Use increasing backoff times to avoid pulsing. A fixed retry interval will concentrate demand from callers on that period. Instead, use a backoff algorithm so different callers will be at different points in their backoff periods.
We can implement similar safeguards in our control plane software: If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system. Apply hysteresis. (See Governor.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off. When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot. Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances. Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.
A quick failure allows the calling system to finish processing the transaction rapidly. Whether that is ultimately a success or a failure depends on the application logic. A slow response, on the other hand, ties up resources in the calling system and the called system.
Memory leaks often manifest via Slow Responses as the virtual machine works harder and harder to reclaim enough space to process a transaction.
More frequently, however, I see applications letting their sockets’ send buffers getting drained and their receive buffers filling up, causing a TCP stall. This usually happens in a hand-rolled, low-level socket protocol, in which the read routine does not loop until the receive buffer is drained.
Many APIs offer both a call with a timeout and a simpler, easier call that blocks forever. It would be better if, instead of overloading a single function, the no-timeout version were labeled “CheckoutAndMaybeKillMySystem.”
Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses. The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures. Apply Timeouts to recover from unexpected failures. When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that. Consider delayed retries. Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later. Circuit Breaker Not too long ago, when electrical wiring was first being built into houses, many people fell victim to physics.
Now, circuit breakers protect overeager gadget hounds from burning their houses down. The principle is the same: detect excess usage, fail first, and open the circuit.
More abstractly, the circuit breaker exists to allow one subsystem (an electrical circuit) to fail (excessive current draw, possibly from a short circuit) without destroying the entire system (the house). Furthermore, once the danger has passed, the circuit breaker can be reset to restore full function to the system.
Leaky Bucket pattern from Pattern Languages of Program Design It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.
Operations needs some way to directly trip or reset the circuit breaker. The circuit breaker is also a convenient place to gather metrics about call volumes and response times.
Circuit breakers are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses. They work so closely with timeouts that they often track timeout failures separately from execution failures.
The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen. Pick a useful granularity. You can partition thread pools inside an application, CPUs in a server, or servers in a cluster. Consider Bulkheads particularly with shared services models. Failures in service-oriented or microservice architectures can propagate very quickly. If your service goes down because of a Chain Reaction, does the entire company come to a halt? Then you’d better put in some Bulkheads.
Nevertheless, someday your little database will grow up. When it hits the teenage years—about two in human years—it’ll get moody, sullen, and resentful. In the worst case, it’ll start undermining the whole system (and it will probably complain that nobody understands it, too).
There are few general rules here. Much depends on the database and libraries in use. RDBMS plus ORM tends to deal badly with dangling references, for example, whereas a document-oriented database won’t even notice.
One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.
Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.
To a long-running server, memory is like oxygen. Cache, left untended, will suck up all the oxygen. Low memory conditions are a threat to both stability and capacity.
Improper use of caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts. Nothing gets administrators in the habit of being logged onto production like daily (or nightly) chores.
Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.
Avoid Slow Responses and Fail Fast. If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.
Reserve resources, verify Integration Points early. In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.
Sometimes the best thing you can do to create system-level stability is to abandon component-level stability. In the Erlang world, this is called the “let it crash” philosophy.
We must be able to get back into that clean state and resume normal operation as quickly as possible. Otherwise, we’ll see performance degrade when too many of our instances are restarting at the same time. In the limit, we could have loss of service because all of our instances are busy restarting. With in-process components like actors, the restart time is measured in microseconds. Callers are unlikely to really notice that kind of disruption. You’d have to set up a special test case just to measure it.
Actor systems use a hierarchical tree of supervisors to manage the restarts. Whenever an actor terminates, the runtime notifies the supervisor. The supervisor can then decide to restart the child actor, restart all of its children, or crash itself. If the supervisor crashes, the runtime will terminate all its children and notify the supervisor’s supervisor. Ultimately you can get whole branches of the supervision tree to restart with a clean state. The design of the supervision tree is integral to the system design.
Crash components to save systems. It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a kn