Sometimes Kill -9 Isn’t Enough

If there’s one thing to know about distributed systems, it’s that they have to be designed with the expectation of failure. It’s also safe to say that most software these days is, in some form, distributed—whether it’s a database, mobile app, or enterprise SaaS. If you have two different processes talking to each other, you have a distributed system, and it doesn’t matter if those processes are local or intergalactically displaced.

Marc Hedlund recently had a great post on Stripe’s game-day exercises where they block off an afternoon, take a blunt instrument to their servers, and see what happens. We’re talking like abruptly killing instances here—kill -9, ec2-terminate-instances, yanking on the damn power cord—that sort of thing. Everyone should be doing this type of stuff. You really don’t know how your system behaves until you see it under failure conditions.

Netflix uses Chaos Monkey to randomly terminate instances, and they do it in production. That takes some balls, but you know you have a pretty solid system when you’re comfortable killing live production servers. At Workiva, we have a middleware we use to inject datastore and other RPC errors into Google App Engine. Building resilient systems is an objective concern, but we still have a ways to go.

We need to be pessimists and design for failure, but injecting failure isn’t enough. Sure, every so often shit hits the proverbial fan, and we need to be tolerant of that. But more often than not, that fan is just a strong headwind.

Simulating failure is a necessary element for building reliable distributed systems, but system behavior isn’t black and white, it’s a continuum. We build our system in a vacuum and (hopefully) test it under failure, but we should also be observing it in this gray area. How does it perform with unreliable network connections? Low bandwidth? High latency? Dropped packets? Out-of-order packets? Duplicate packets? Not only do our systems need to be fault-tolerant, they need to be pressure-tolerant.

Simulating Pressure

There are a lot of options to do these types of “pressure” simulations. On Linux, we can use iptables to accomplish this.

This will drop incoming and outgoing packets with a 10% probability. Alternatively, we can use tc to simulate network latency, limited bandwidth, and packet loss.

The above adds an additional 250ms of latency with 10% packet loss and a bandwidth limit of 1Mbps. Likewise, on OSX and BSD we can use ipfw or pfctl.

Here we inject 500ms of latency while limiting bandwidth to 1Mbps and dropping 10% of packets.

These are just some very simple traffic-shaping examples. Several of these tools allow you to perform even more advanced testing, like adding variation and correlation values. This would allow you to emulate burst packet loss and other situations we often encounter. For instance, with tc, we can add jitter to the network latency.

This adds 50±20ms of latency. Since network latency typically isn’t uniform, we can apply a normal distribution to achieve a more realistic simulation.

Now we get a nice bell curve which is probably more representative of what we see in practice. We can also use tc to re-order, duplicate, and corrupt packets.

I’ve been working on an open-source tool which attempts to wrap these controls up so you don’t have to memorize the options or worry about portability. It’s pretty primitive and doesn’t support much yet, but it provides a thin layer of abstraction.

Conclusion

Injecting failure is crucial to understanding systems and building confidence, but like good test coverage, it’s important to examine suboptimal-but-operating scenarios. This isn’t even 99th-percentile stuff—this is the type of shit your users deal with every single day. If you can’t handle sustained latency and sporadic network partitions, who cares if you tolerate instance failure? The tools are at our disposal, they just need to be leveraged.

He Sed, She Sed

Shortly after switching to GitHub, I decided to relicense Infinitum from GNU LGPL to Apache License 2.0. There aren’t really any implications except one: replacing the license and copyright header in every source file.

I’m far from being a Unix expert (more like amateur at best), but I figured sed would be the quickest and easiest way to do this. Sed is a Unix utility for processing text streams, and it allows you to replace string patterns in files. A simple string replacement using sed is quite easy:

This will replace “foo” with “bar”. The “g” indicates that every matching occurrence in file.txt will be replaced, and “-i” means it will do the replacement in place.

In my case, I wanted to find every occurrence of the following string in every Java source file:

And I wanted to replace it with this:

I needed to do a multi-line replacement across a couple hundred files. Feeding lots of files to sed is actually pretty simple:

This command will pass all of the Java files in the current directory (and all sub-directories)  to sed. The reason xargs is needed is because it lets us avoid the “Argument list too long” problem.

In order to replace multiple lines, I needed to use an additional feature of sed. The “c” command lets you replace a range of lines:

There’s a caveat that I have so far ignored. Many Unix utilities have idiosyncrasies or differences between platforms, and sed is no exception. I failed to mention that I was doing this on Mac OSX, whose implementation of sed, as I encountered, had some peculiar quirks. The “-e” in the above command is one such quirk as it’s needed to perform an in-place pattern replacement on OSX.

So, I had a way to process a bunch of files at once and a way to replace multiple lines in a file. Now I just needed to combine these two techniques to replace the license header in all of my project files.

This replaces the range of lines covering the original license with the new license. It works, but the formatting becomes slightly off. That’s because OSX’s sed does not preserve leading whitespace, so the space before each asterisk is stripped. Fortunately, GNU sed does preserve leading whitespace, so building that and using it in place of OSX’s sed solved the problem. Also, GNU sed doesn’t require “-e” for in-place replacement.

Sed is a very handy little tool that every developer should have in his or her toolbelt. Admittedly, I don’t leverage Unix’s utilities nearly enough (although I’m working on it!), but tools like grep, sed, find, and xargs are immensely powerful and pretty simple to use. I think some developers have a tendency to over-engineer solutions for problems that could otherwise be solved using a trivial combination of these tools — I know I have! Of course, it’s helped that I’ve started to do all my programming, both work and play, on Mac. It’s my goal to become a better Unix developer!