Are We There Yet: The Go Generics Debate

At GopherCon a couple weeks ago, Russ Cox gave a talk titled The Future of Go, in which he discussed what the Go community might want to change about the language—particularly for the so-called Go 2.0 milestone—and the process for realizing those changes. Part of that process is identifying real-world use cases through experience reports, which turn an abstract problem into a concrete one and help the core team to understand its significance. Also mentioned in the talk, of course, were generics. Over the weekend, Dave Cheney posted Should Go 2.0 support generics? Allow me to add to the noise.

First, I agree with Dave’s point in that the Go team has two choices: implement templated types and parameterized functions (or some equivalent thereof) or don’t and own that decision. Though I think his example of Haskell programmers owning their differences with imperative programming is a bit of a false equivalence. Haskell is firmly in the functional camp, while Go is firmly in the imperative. While there is overlap between the two, I think most programmers understand the difference. Few people are asking Haskell to be more PHP-like, but I bet a lot more are asking Elm to be more Haskell-like. Likewise, lots of people are asking Go to be more Java-like—if only a little. This is why so many are asking for generics, and Dave says it himself: “Mainstream programmers expect some form of templated types because they’re used to it in the other languages they interact with alongside Go.”

But I digress. The point is that the Go team should not pay any lip service to the generics discussion if they are not going to fully commit to addressing the problem. There is plenty of type theory and prior art. It’s not a technical problem, it’s a philosophical one.

That said, if the Go team decides to say “no” to generics and own that decision, I suspect they will never fully put an end to the discussion. The similarities to other mainstream languages are too close, unlike the PHP/Haskell example, and the case for generics too compelling, unlike the composition-over-inheritance decision. The lack of generics in Go has already become a meme at this point.

Ian Lance Taylor’s recent post shows there is a divide within the Go team regarding generics. He tells the story of how the copy() and append() functions came about, noting that they were added only because of the lack of generics.

The point I want to make here is that because we had no way to write a generic Vector type with an Append method, we wound up adding a special purpose language feature to implement it. A language that supported parameterized types with methods would not have required a special built-in function that only works with slices. An append operation makes sense for other sorts of data structures, such as various kinds of linked lists. The built-in append function can not be used for them.

My biggest complaint on the matter at this point in the language’s life is the doublethink. That is, the mere existence of channels, maps, and slices seems like a contradiction to the argument against generics. The experience reports supporting generics are trivial: every time someone instantiates one of these things. The argument is that having a small number of built-in generic types is substantially different than allowing them to be user-defined. Allowing generic APIs to be strewn about will increase the cognitive load on developers. Yet, at the same time, we’re okay with adding new APIs to the standard library that do not provide compile-time type safety by effectively subverting the type system through the use of interface{}. By using interface{}, we instead push the responsibility of type safety onto users to do it at runtime in precarious fashion for something the compiler could be well-equipped to handle. The argument here is what is the greater cognitive load? More generally, it’s where should the responsibility lie? Considering Go’s lineage and its aim at the “ordinary” programmer, I’d argue safety—in this case—should be the language’s responsibility.

However, we haven’t fully addressed the “complex generic APIs strewn about” argument. Generics are not the source of complexity in API design, poorly designed APIs are. For every bad generic API in Java, I’ll show you a good one. In Go, I would argue more complexity comes from the workarounds to the lack of generics—interface{}, reflection, code duplication, code generation, type assertions—than from introducing them into the language, not to mention the performance cost with some of these workarounds. The heap and list packages are some of my least favorite packages in the language, largely due to their use of interface{}.

Another frustration I have with the argument against generics is the anecdotal evidence—”I’ve never experienced a need for generics, so anyone who has must be wrong.” It’s a weird rationalization. It’s like a C programmer arguing against garbage collection to build a web app because free() works fine. Someone pointed out to me what’s actually going on is the Blub paradox, which Paul Graham coined.

Graham considers the hierarchy of programming languages with the example of “Blub”, a hypothetically average language “right in the middle of the abstractness continuum. It is not the most powerful language, but it is more powerful than Cobol or machine language.” It was used by Graham to illustrate a comparison, beyond Turing completeness, of programming language power, and more specifically to illustrate the difficulty of comparing a programming language one knows to one that one does not.

Graham considers a hypothetical Blub programmer. When the programmer looks down the “power continuum”, he considers the lower languages to be less powerful because they miss some feature that a Blub programmer is used to. But when he looks up, he fails to realise that he is looking up: he merely sees “weird languages” with unnecessary features and assumes they are equivalent in power, but with “other hairy stuff thrown in as well”. When Graham considers the point of view of a programmer using a language higher than Blub, he describes that programmer as looking down on Blub and noting its “missing” features from the point of view of the higher language.

Graham describes this as the “Blub paradox” and concludes that “By induction, the only programmers in a position to see all the differences in power between the various languages are those who understand the most powerful one.”

I sympathize with the Go team’s desire to keep the overall surface area of the language small and the complexity low, but I have a hard time reconciling this with the existing built-in generics and continued use of interface{} in the standard library. At the very least, I think Go would be better off with continued use of built-in generics for certain types instead of adding more APIs using interface{} like sync.Map, sync.Pool, and atomic.Value. Certainly, I think the debate is worth having though, but the hero-worship and genetic-fallacy type arguments do not further the discussion. My gut feeling is that Go will eventually have generics. It’s just a question of when and how.

So You Wanna Go Fast?

I originally proposed this as a GopherCon talk on writing “high-performance Go”, which is why it may seem rambling, incoherent, and—at times—not at all related to Go. The talk was rejected (probably because of the rambling and incoherence), but I still think it’s a subject worth exploring. The good news is, since it was rejected, I can take this where I want. The remainder of this piece is mostly the outline of that talk with some parts filled in, some meandering stories which may or may not pertain to the topic, and some lessons learned along the way. I think it might make a good talk one day, but this will have to do for now.

We work on some interesting things at Workiva—graph traversal, distributed and in-memory calculation engines, low-latency messaging systems, databases optimized for two-dimensional data computation. It turns out, when you want to build a complicated financial-reporting suite with the simplicity and speed of Microsoft Office, and put it entirely in the cloud, you can’t really just plumb some crap together and call it good. It also turns out that when you try to do this, performance becomes kind of important, not because of the complexity of the data—after all, it’s mostly just numbers and formulas—but because of the scale of it. Now, distribute that data in the cloud, consider the security and compliance implications associated with it, add in some collaboration and control mechanisms, and you’ve got yourself some pretty monumental engineering problems.

As I hinted at, performance starts to be really important, whether it’s performing a formula evaluation, publishing a data-change event, or opening up a workbook containing a million rows of data (accountants are weird). A lot of the backend systems powering all of this are, for better or worse, written in Go. Go is, of course, a garbage-collected language, and it compares closely to Java (though the latter has over 20 years invested in it, while the former has about seven).

At this point, you might be asking, “why not C?” It’s honestly a good question to ask, but the reality is there is always history. The first solution was written in Python on Google App Engine (something about MVPs, setting your customers’ expectations low, and giving yourself room to improve?). This was before Go was even a thing, though Java and C were definitely things, but this was a startup. And it was Python. And it was on App Engine. I don’t know exactly what led to those combination of things—I wasn’t there—but, truthfully, App Engine probably played a large role in the company’s early success. Python and App Engine were fast. Not like “this code is fucking fast” fast—what we call performance—more like “we need to get this shit working so we have jobs tomorrow” fast—what we call delivery. I don’t envy that kind of fast, but when you’re a startup trying to disrupt, speed to market matters a hell of a lot more than the speed of your software.

I’ve talked about App Engine at length before. Ultimately, you hit the ceiling of what you can do with it, and you have to migrate off (if you’re a business that is trying to grow, anyway). We hit that migration point at a really weird, uncomfortable time. This was right when Docker was starting to become a thing, and microservices were this thing that everybody was talking about but nobody was doing. Google had been successfully using containers for years, and Netflix was all about microservices. Everybody wanted to be like them, but no one really knew how—but it was the future (unikernels are the new future, by the way).

The problem is—coming from a PaaS like App Engine that does your own laundry—you don’t have the tools, skills, or experience needed to hit the ground running, so you kind of drunkenly stumble your way there. You don’t even have a DevOps team because you didn’t need one! Nobody knew how to use Docker, which is why at the first Dockercon, five people got on stage and presented five solutions to the same problem. It was the blind leading the blind. I love this article by Jesper L. Andersen, How to build stable systems, which contains a treasure trove of practical engineering tips. The very last paragraph of the article reads:

Docker is not mature (Feb 2016). Avoid it in production for now until it matures. Currently Docker is a time sink not fulfilling its promises. This will change over time, so know when to adopt it.

Trying to build microservices using Docker while everyone is stumbling over themselves was, and continues to be, a painful process, exacerbated by the heavy weight suddenly lifted by leaving App Engine. It’s not great if you want to go fast. App Engine made scaling easy by restricting you in what you could do, but once that burden was removed, it was off to the races. What people might not have realized, however, was that App Engine also made distributed systems easy by restricting you in what you could do. Some seem to think the limitations enforced by App Engine are there to make their lives harder or make Google richer (trust me, they’d bill you more if they could), so why would we have similar limitations in our own infrastructure? App Engine makes these limitations, of course, so that it can actually scale. Don’t take that for granted.

App Engine was stateless, so the natural tendency once you’re off it was to make everything stateful. And we did. What I don’t think we realized was that we were, in effect, trading one type of fast for the other—performance for delivery. We can build software that’s fast and runs on your desktop PC like in the 90’s, but now you want to put that in the cloud and make it scale? It takes a big infrastructure investment. It also takes a big time investment. Neither of which are good if you want to go fast, especially when you’re using enough microservices, Docker, and Go to rattle the Hacker News fart chamber. You kind of get caught in this endless rut of innovation that you almost lose your balance. Leaving the statelessness of App Engine for more stateful pastures was sort of like an infant learning to walk. You look down and it dawns on you—you have legs! So you run with it, because that’s amazing, and you stumble spectacularly a few times along the way. Finally, you realize maybe running full speed isn’t the best idea for someone who just learned to walk.

We were also making this transition while Go had started reaching critical mass. Every other headline in the tech aggregators was “why we switched to Go and you should too.” And we did. I swear this post has a point.

Tips for Writing High-Performance Go

By now, I’ve forgotten what I was writing about, but I promised this post was about Go. It is, and it’s largely about performance fast, not delivery fast—the two are often at odds with each other. Everything up until this point was mostly just useless context and ranting. But it also shows you that we are solving some hard problems and why we are where we are. There is always history.

I work with a lot of smart people. Many of us have a near obsession with performance, but the point I was attempting to make earlier is we’re trying to push the boundaries of what you can expect from cloud software. App Engine had some rigid boundaries, so we made a change. Since adopting Go, we’ve learned a lot about how to make things fast and how to make Go work in the world of systems programming.

Go’s simplicity and concurrency model make it an appealing choice for backend systems, but the larger question is how does it fare for latency-sensitive applications? Is it worth sacrificing the simplicity of the language to make it faster? Let’s walk through a few areas of performance optimization in Go—namely language features, memory management, and concurrency—and try to make that determination. All of the code for the benchmarks presented here are available on GitHub.

Channels

Channels in Go get a lot of attention because they are a convenient concurrency primitive, but it’s important to be aware of their performance implications. Usually the performance is “good enough” for most cases, but in certain latency-critical situations, they can pose a bottleneck. Channels are not magic. Under the hood, they are just doing locking. This works great in a single-threaded application where there is no lock contention, but in a multithreaded environment, performance significantly degrades. We can mimic a channel’s semantics quite easily using a lock-free ring buffer.

The first benchmark looks at the performance of a single-item-buffered channel and ring buffer with a single producer and single consumer. First, we look at the performance in the single-threaded case (GOMAXPROCS=1).

BenchmarkChannel 3000000 512 ns/op
BenchmarkRingBuffer 20000000 80.9 ns/op

As you can see, the ring buffer is roughly six times faster (if you’re unfamiliar with Go’s benchmarking tool, the first number next to the benchmark name indicates the number of times the benchmark was run before giving a stable result). Next, we look at the same benchmark with GOMAXPROCS=8.

BenchmarkChannel-8 3000000 542 ns/op
BenchmarkRingBuffer-8 10000000 182 ns/op

The ring buffer is almost three times faster.

Channels are often used to distribute work across a pool of workers. In this benchmark, we look at performance with high read contention on a buffered channel and ring buffer. The GOMAXPROCS=1 test shows how channels are decidedly better for single-threaded systems.

BenchmarkChannelReadContention 10000000 148 ns/op
BenchmarkRingBufferReadContention 10000 390195 ns/op

However, the ring buffer is faster in the multithreaded case:

BenchmarkChannelReadContention-8 1000000 3105 ns/op
BenchmarkRingBufferReadContention-8 3000000 411 ns/op

Lastly, we look at performance with contention on both the reader and writer. Again, the ring buffer’s performance is much worse in the single-threaded case but better in the multithreaded case.

BenchmarkChannelContention 10000 160892 ns/op
BenchmarkRingBufferContention 2 806834344 ns/op
BenchmarkChannelContention-8 5000 314428 ns/op
BenchmarkRingBufferContention-8 10000 182557 ns/op

The lock-free ring buffer achieves thread safety using only CAS operations. We can see that deciding to use it over the channel depends largely on the number of OS threads available to the program. For most systems, GOMAXPROCS > 1, so the lock-free ring buffer tends to be a better option when performance matters. Channels are a rather poor choice for performant access to shared state in a multithreaded system.

Defer

Defer is a useful language feature in Go for readability and avoiding bugs related to releasing resources. For example, when we open a file to read, we need to be careful to close it when we’re done. Without defer, we need to ensure the file is closed at each exit point of the function.

This is really error-prone since it’s easy to miss a return point. Defer solves this problem by effectively adding the cleanup code to the stack and invoking it when the enclosing function returns.

At first glance, one would think defer statements could be completely optimized away by the compiler. If I defer something at the beginning of a function, simply insert the closure at each point the function returns. However, it’s more complicated than this. For example, we can defer a call within a conditional statement or a loop. The first case might require the compiler to track the condition leading to the defer. The compiler would also need to be able to determine if a statement can panic since this is another exit point for a function. Statically proving this seems to be, at least on the surface, an undecidable problem.

The point is defer is not a zero-cost abstraction. We can benchmark it to show the performance overhead. In this benchmark, we compare locking a mutex and unlocking it with a defer in a loop to locking a mutex and unlocking it without defer.

BenchmarkMutexDeferUnlock-8 20000000 96.6 ns/op
BenchmarkMutexUnlock-8 100000000 19.5 ns/op

Using defer is almost five times slower in this test. To be fair, we’re looking at a difference of 77 nanoseconds, but in a tight loop on a critical path, this adds up. One trend you’ll notice with these optimizations is it’s usually up to the developer to make a trade-off between performance and readability. Optimization rarely comes free.

Reflection and JSON

Reflection is generally slow and should be avoided for latency-sensitive applications. JSON is a common data-interchange format, but Go’s encoding/json package relies on reflection to marshal and unmarshal structs. With ffjson, we can use code generation to avoid reflection and benchmark the difference.

BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op
BenchmarkJSONMarshal-8 500000 3981 ns/op

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONUnmarshal-8 300000 5839 ns/op

Code-generated JSON is about 38% faster than the standard library’s reflection-based implementation. Of course, if we’re concerned about performance, we should really avoid JSON altogether. MessagePack is a better option with serialization code that can also be generated. In this benchmark, we use the msgp library and compare its performance to JSON.

BenchmarkMsgpackMarshal-8 3000000 555 ns/op
BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op
BenchmarkJSONMarshal-8 500000 3981 ns/op

BenchmarkMsgpackUnmarshal-8 20000000 94.6 ns/op
BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONUnmarshal-8 300000 5839 ns/op

The difference here is dramatic. Even when compared to the generated JSON serialization code, MessagePack is significantly faster.

If we’re really trying to micro-optimize, we should also be careful to avoid using interfaces, which have some overhead not just with marshaling but also method invocations. As with other kinds of dynamic dispatch, there is a cost of indirection when performing a lookup for the method call at runtime. The compiler is unable to inline these calls.

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONReflectionUnmarshalIface-8 200000 10099 ns/op

We can also look at the overhead of the invocation lookup, I2T, which converts an interface to its backing concrete type. This benchmark calls the same method on the same struct. The difference is the second one holds a reference to an interface which the struct implements.

BenchmarkStructMethodCall-8 2000000000 0.44 ns/op
BenchmarkIfaceMethodCall-8 1000000000 2.97 ns/op

Sorting is a more practical example that shows the performance difference. In this benchmark, we compare sorting a slice of 1,000,000 structs and 1,000,000 interfaces backed by the same struct. Sorting the structs is nearly 92% faster than sorting the interfaces.

BenchmarkSortStruct-8 10 105276994 ns/op
BenchmarkSortIface-8 5 286123558 ns/op

To summarize, avoid JSON if possible. If you need it, generate the marshaling and unmarshaling code. In general, it’s best to avoid code that relies on reflection and interfaces and instead write code that uses concrete types. Unfortunately, this often leads to a lot of duplicated code, so it’s best to abstract this with code generation. Once again, the trade-off manifests.

Memory Management

Go doesn’t actually expose heap or stack allocation directly to the user. In fact, the words “heap” and “stack” do not appear anywhere in the language specification. This means anything pertaining to the stack and heap are technically implementation-dependent. In practice, of course, Go does have a stack per goroutine and a heap. The compiler does escape analysis to determine if an object can live on the stack or needs to be allocated in the heap.

Unsurprisingly, avoiding heap allocations can be a major area of optimization. By allocating on the stack, we avoid expensive malloc calls, as the benchmark below shows.

BenchmarkAllocateHeap-8 20000000 62.3 ns/op 96 B/op 1 allocs/op
BenchmarkAllocateStack-8 100000000 11.6 ns/op 0 B/op 0 allocs/op

Naturally, passing by reference is faster than passing by value since the former requires copying only a pointer while the latter requires copying values. The difference is negligible with the struct used in these benchmarks, though it largely depends on what has to be copied. Keep in mind there are also likely some compiler optimizations being performed in this synthetic benchmark.

BenchmarkPassByReference-8 1000000000 2.35 ns/op
BenchmarkPassByValue-8 200000000 6.36 ns/op

However, the larger issue with heap allocation is garbage collection. If we’re creating lots of short-lived objects, we’ll cause the GC to thrash. Object pooling becomes quite important in these scenarios. In this benchmark, we compare allocating structs in 10 concurrent goroutines on the heap vs. using a sync.Pool for the same purpose. Pooling yields a 5x improvement.

BenchmarkConcurrentStructAllocate-8 5000000 337 ns/op
BenchmarkConcurrentStructPool-8 20000000 65.5 ns/op

It’s important to point out that Go’s sync.Pool is drained during garbage collection. The purpose of sync.Pool is to reuse memory between garbage collections. One can maintain their own free list of objects to hold onto memory across garbage collection cycles, though this arguably subverts the purpose of a garbage collector. Go’s pprof tool is extremely useful for profiling memory usage. Use it before blindly making memory optimizations.

False Sharing

When performance really matters, you have to start thinking at the hardware level. Formula One driver Jackie Stewart is famous for once saying, “You don’t have to be an engineer to be be a racing driver, but you do have to have mechanical sympathy.” Having a deep understanding of the inner workings of a car makes you a better driver. Likewise, having an understanding of how a computer actually works makes you a better programmer. For example, how is memory laid out? How do CPU caches work? How do hard disks work?

Memory bandwidth continues to be a limited resource in modern CPU architectures, so caching becomes extremely important to prevent performance bottlenecks. Modern multiprocessor CPUs cache data in small lines, typically 64 bytes in size, to avoid expensive trips to main memory. A write to a piece of memory will cause the CPU cache to evict that line to maintain cache coherency. A subsequent read on that address requires a refresh of the cache line. This is a phenomenon known as false sharing, and it’s especially problematic when multiple processors are accessing independent data in the same cache line.

Imagine a struct in Go and how it’s laid out in memory. Let’s use the ring buffer from earlier as an example. Here’s what that struct might normally look like:

The queue and dequeue fields are used to determine producer and consumer positions, respectively. These fields, which are both eight bytes in size, are concurrently accessed and modified by multiple threads to add and remove items from the queue. Since these two fields are positioned contiguously in memory and occupy only 16 bytes of memory, it’s likely they will stored in a single CPU cache line. Therefore, writing to one will result in evicting the other, meaning a subsequent read will stall. In more concrete terms, adding or removing things from the ring buffer will cause subsequent operations to be slower and will result in lots of thrashing of the CPU cache.

We can modify the struct by adding padding between fields. Each padding is the width of a single CPU cache line to guarantee the fields end up in different lines. What we end up with is the following:

How big a difference does padding out CPU cache lines actually make? As with anything, it depends. It depends on the amount of multiprocessing. It depends on the amount of contention. It depends on memory layout. There are many factors to consider, but we should always use data to back our decisions. We can benchmark operations on the ring buffer with and without padding to see what the difference is in practice.

First, we benchmark a single producer and single consumer, each running in a goroutine. With this test, the improvement between padded and unpadded is fairly small, about 15%.

BenchmarkRingBufferSPSC-8 10000000 156 ns/op
BenchmarkRingBufferPaddedSPSC-8 10000000 132 ns/op

However, when we have multiple producers and multiple consumers, say 100 each, the difference becomes slightly more pronounced. In this case, the padded version is about 36% faster.

BenchmarkRingBufferMPMC-8 100000 27763 ns/op
BenchmarkRingBufferPaddedMPMC-8 100000 17860 ns/op

False sharing is a very real problem. Depending on the amount of concurrency and memory contention, it can be worth introducing padding to help alleviate its effects. These numbers might seem negligible, but they start to add up, particularly in situations where every clock cycle counts.

Lock-Freedom

Lock-free data structures are important for fully utilizing multiple cores. Considering Go is targeted at highly concurrent use cases, it doesn’t offer much in the way of lock-freedom. The encouragement seems to be largely directed towards channels and, to a lesser extent, mutexes.

That said, the standard library does offer the usual low-level memory primitives with the atomic package. Compare-and-swap, atomic pointer access—it’s all there. However, use of the atomic package is heavily discouraged:

We generally don’t want sync/atomic to be used at all…Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations…If we had thought of internal packages when we added the sync/atomic package, perhaps we would have used that. Now we can’t remove the package because of the Go 1 guarantee.

How hard can lock-free really be though? Just rub some CAS on it and call it a day, right? After a sufficient amount of hubris, I’ve come to learn that it’s definitely a double-edged sword. Lock-free code can get complicated in a hurry. The atomic and unsafe packages are not easy to use, at least not at first. The latter gets its name for a reason. Tread lightly—this is dangerous territory. Even more so, writing lock-free algorithms can be tricky and error-prone. Simple lock-free data structures, like the ring buffer, are pretty manageable, but anything more than that starts to get hairy.

The Ctrie, which I wrote about in detail, was my foray into the world of lock-free data structures beyond your standard fare of queues and lists. Though the theory is reasonably understandable, the implementation is thoroughly complex. In fact, the complexity largely stems from the lack of a native double compare-and-swap, which is needed to atomically compare indirection nodes (to detect mutations on the tree) and node generations (to detect snapshots taken of the tree). Since no hardware provides such an operation, it has to be simulated using standard primitives (and it can).

The first Ctrie implementation was actually horribly broken, and not even because I was using Go’s synchronization primitives incorrectly. Instead, I had made an incorrect assumption about the language. Each node in a Ctrie has a generation associated with it. When a snapshot is taken of the tree, its root node is copied to a new generation. As nodes in the tree are accessed, they are lazily copied to the new generation (à la persistent data structures), allowing for constant-time snapshotting. To avoid integer overflow, we use objects allocated on the heap to demarcate generations. In Go, this is done using an empty struct. In Java, two newly constructed Objects are not equivalent when compared since their memory addresses will be different. I made a blind assumption that the same was true in Go, when in fact, it’s not. Literally the last paragraph of the Go language specification reads:

A struct or array type has size zero if it contains no fields (or elements, respectively) that have a size greater than zero. Two distinct zero-size variables may have the same address in memory.

Oops. The result was that two different generations were considered equivalent, so the double compare-and-swap always succeeded. This allowed snapshots to potentially put the tree in an inconsistent state. That was a fun bug to track down. Debugging highly concurrent, lock-free code is hell. If you don’t get it right the first time, you’ll end up sinking a ton of time into fixing it, but only after some really subtle bugs crop up. And it’s unlikely you get it right the first time. You win this time, Ian Lance Taylor.

But wait! Obviously there’s some payoff with complicated lock-free algorithms or why else would one subject themselves to this? With the Ctrie, lookup performance is comparable to a synchronized map or a concurrent skip list. Inserts are more expensive due to the increased indirection. The real benefit of the Ctrie is its scalability in terms of memory consumption, which, unlike most hash tables, is always a function of the number of keys currently in the tree. The other advantage is its ability to perform constant-time, linearizable snapshots. We can compare performing a “snapshot” on a synchronized map concurrently in 100 different goroutines with the same test using a Ctrie:

BenchmarkConcurrentSnapshotMap-8 1000 9941784 ns/op
BenchmarkConcurrentSnapshotCtrie-8 20000 90412 ns/op

Depending on access patterns, lock-free data structures can offer better performance in multithreaded systems. For example, the NATS message bus uses a synchronized map-based structure to perform subscription matching. When compared with a Ctrie-inspired, lock-free structure, throughput scales a lot better. The blue line is the lock-based data structure, while the red line is the lock-free implementation.

matchbox_bench_1_1

Avoiding locks can be a boon depending on the situation. The advantage was apparent when comparing the ring buffer to the channel. Nonetheless, it’s important to weigh any benefit against the added complexity of the code. In fact, sometimes lock-freedom doesn’t provide any tangible benefit at all!

A Note on Optimization

As we’ve seen throughout this post, performance optimization almost always comes with a cost. Identifying and understanding optimizations themselves is just the first step. What’s more important is understanding when and where to apply them. The famous quote by C. A. R. Hoare, popularized by Donald Knuth, has become a longtime adage of programmers:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

Though the point of this quote is not to eliminate optimization altogether, it’s to learn how to strike a balance between speeds—speed of an algorithm, speed of delivery, speed of maintenance, speed of a system. It’s a highly subjective topic, and there is no single rule of thumb. Is premature optimization the root of all evil? Should I just make it work, then make it fast? Does it need to be fast at all? These are not binary decisions. For example, sometimes making it work then making it fast is impossible if there is a fundamental problem in the design.

However, I will say focus on optimizing along the critical path and outward from that only as necessary. The further you get from that critical path, the more likely your return on investment is to diminish and the more time you end up wasting. It’s important to identify what adequate performance is. Do not spend time going beyond that point. This is an area where data-driven decisions are key—be empirical, not impulsive. More important, be practical. There’s no use shaving nanoseconds off of an operation if it just doesn’t matter. There is more to going fast than fast code.

Wrapping Up

If you’ve made it this far, congratulations, there might be something wrong with you. We’ve learned that there are really two kinds of fast in software—delivery and performance.  Customers want the first, developers want the second, and CTOs want both. The first is by far the most important, at least when you’re trying to go to market. The second is something you need to plan for and iterate on. Both are usually at odds with each other.

Perhaps more interestingly, we looked at a few ways we can eke out that extra bit of performance in Go and make it viable for low-latency systems. The language is designed to be simple, but that simplicity can sometimes come at a price. Like the trade-off between the two fasts, there is a similar trade-off between code lifecycle and code performance. Speed comes at the cost of simplicity, at the cost of development time, and at the cost of continued maintenance. Choose wisely.

Go Is Unapologetically Flawed, Here’s Why We Use It

Go is decidedly polarizing. While many are touting their transition to Go, it has become equally fashionable to criticize and mock the language. As Bjarne Stroustrup so eloquently put it, “There are only two kinds of programming languages: those people always bitch about and those nobody uses.” This adage couldn’t be more true. I apologize in advance for what appears to be just another in a long line of diatribes. I’m not really sorry, though.

I normally don’t advocate promoting or condemning a particular programming language or pontificate on why it is or isn’t used within an organization. They’re just tools for a job.


Today I’m going to be a hypocrite. The truth is we should care about what language and technologies we use to build and standardize on, but those decisions should be local to an organization. We shouldn’t choose a technology because it worked for someone else. Chances are they had a very different problem, different set of requirements, different engineering culture. There are so many factors that go into “success”—technology is probably the least impactful. Someone else’s success doesn’t translate to your success. It’s not the technology that makes or breaks us, it’s how the technology is appropriated, among many other conflating elements.

Now that I’ve prefaced why you shouldn’t choose a technology because it’s trendy, I’m going to talk about why we use Go where I work—yes, that’s meant to be ironic. However, I’m also going to describe why the language is essentially flawed. As I’ve alluded to, there are countless blog posts and articles which describe the shortcomings of Go. On the one hand, I’m apprehensive this doesn’t contribute anything meaningful to the dialogue. On the other hand, I feel the dialogue is important and, when framed in the right context, constructive.

Simplicity Through Indignity

Go is refreshingly simple. It’s what drew me to the language in the first place, and I suspect others feel the same way. There’s a popular quote from Rob Pike which I think is worth reiterating:

The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt.

Granted, it’s taken out of context, but on the surface this kind of does sound like Go is a disservice to intelligent programmers. However, there is value in pursuing a simple, yet powerful, lingua franca of backend systems. Any engineer, regardless of experience, can dive into virtually any codebase and quickly understand how something works. Unfortunately, the notion of programmers not understanding a “brilliant language” is a philosophy carried throughout Go, and it hinders productivity more than it helps.

We use Go because it’s boring. Previously, we worked almost exclusively with Python, and after a certain point, it becomes a nightmare. You can bend Python to your will. You can hack it, you can monkey patch it, and you can write remarkably expressive, terse code. It’s also remarkably difficult to maintain and slow. I think this is characteristic of statically and dynamically typed languages in general. Dynamic typing allows you to quickly build and iterate but lacks the static-analysis tooling needed for larger codebases and performance characteristics required for more real-time systems. In my mind, the curve tends to look something like this:

static vs dynamic 2

Of course, this isn’t particular to Go or Python. As highlighted above, there are a lot of questions you must ask when considering such a transition. Like I mentioned, languages are tools for a job. One might argue, then, why would a company settle on a single language? Use the right tool for the job! This is true in principle, but the reality is there are other factors to consider, the largest of which is momentum. When you commit to a language, you produce reusable libraries, APIs, tooling, and knowledge. If you “use the right tool for the job,” you end up pulling yourself in different directions and throwing away those things. If you’re Google scale, this is less of an issue. Most organizations aren’t Google scale. It’s a delicate balance when choosing a technology.

Go makes it easy to write code that is understandable. There’s no “magic” like many enterprise Java frameworks and none of the cute tricks you’ll find in most Python or Ruby codebases. The code is verbose but readable, unsophisticated but intelligible, tedious but predictable. But the pendulum swings too far. So far, in fact, that it sacrifices one of software development’s most sacred doctrines, Don’t Repeat Yourself, and it does so unapologetically.

The Untype System

To put it mildly, Go’s type system is impaired. It does not lend itself to writing quality, maintainable code at a large scale, which seems to be in stark contrast to the language’s ambitions. The type system is noble in theory, but in practice it falls apart rather quickly. Without generics, programmers are forced to either copy and paste code for each type, rely on code generation which is often clumsy and laborious, or subvert the type system altogether through reflection. Passing around interface{} harks back to the Java-pre-generics days of doing the same with Object. The code gets downright dopey if you want to write a reusable library.

The argument there, I suppose, is to rely on interfaces to specify the behavior needed in a function. In passing, this sounds reasonable, but again, it quickly falls apart for even the most trivial situations. Further, you can’t add methods to types from a different (or standard library) package. Instead, you must effectively alias or wrap the type with a new type, resulting in more boilerplate and code that generally takes longer to grok. You start to realize that Go isn’t actually all that great at what it sets out to accomplish in terms of fostering maintainable, large-scale codebases—boilerplate and code duplication abound. It’s 2015, why in the world are we still writing code like this:

Now repeat for uint32, uint64, int32, etc. In any other modern programming language, this would get you laughed out of a code review. In Go, no one seems to bat an eye, and the alternatives aren’t much better.

Interfaces in Go are interesting because they are implicitly implemented. There are advantages, such as implementing mocks and generally dealing with code you don’t own. They also can cause some subtle problems like accidental implementation. Just because a type matches the signature of an interface doesn’t mean it was intended to implement its contract. Not to mention the confusion caused by storing nil in an interface:

This is a common source of confusion. The basic answer is to never store something in an interface if you don’t expect the methods to be called on it. The language may allow it, but that violates the semantics of the interface. To expound, a nil value should usually not be stored in an interface unless it is of a type that has explicitly handled that case in its pointer-valued methods and has no value-receiver methods.

Go is designed to be simple, but that behavior isn’t simple to me. I know it’s tripped up many others. Another lurking danger to newcomers is the behavior around variable declarations and shadowing. It can cause some nasty bugs if you’re not careful.

Rules Are Meant to Be Broken, Just Not by You

Python relies on a notion of “we’re all consenting adults here.” This is great and all, but it starts to break down when you have to scale your organization. Go takes a very different approach which aligns itself with large development teams. Great! But it’s taken to the extreme, and the language seems to break many of its own rules, which can be both confusing and frustrating.

Go sort of supports generic functions as evidenced by its built-ins. You just can’t implement your own. Go sort of supports generic types as evidenced by slices, maps, and channels. You just can’t implement your own. Go sort of supports function overloading as evidenced again by its built-ins. You just can’t implement your own. Go sort of supports exceptions as evidenced by panic and recover. You just can’t implement your own. Go sort of supports iterators as evidenced by ranging on slices, maps, and channels. You just can’t implement your own.

There are other peculiar idiosyncrasies. Error handling is generally done by returning error values. This is fine, and I can certainly see the motivation coming from the abomination of C++ exceptions, but there are cases where Go doesn’t follow its own rule. For example, map lookups return two values: the value itself (or zero-value/nil if it doesn’t exist) and a boolean indicating if the key was in the map. Interestingly, we can choose to ignore the boolean value altogether—a syntax reserved for certain blessed types in the standard library. Type assertions and channel receives have equally curious behavior.

Another idiosyncrasy is adding an item to a channel which is closed. Instead of returning an error, or a boolean, or whatever, it panics. Perhaps because it’s considered a programmer error? I’m not sure. Either way, these behaviors seem inconsistent to me. I often find myself asking what the “idiomatic” approach would be when designing an API. Go could really use proper algebraic data types.

One of Go’s philosophies is “Share memory by communicating; don’t communicate by sharing memory.” This is another rule the standard library seems to break often. There are roughly 60 channels created in the standard library, excluding tests. If you look through the code, you’ll see that mutexes tend to be preferred and often perform better—more on this in a moment.

By the same token, Go actively discourages the use of the sync/atomic and unsafe packages. In fact, there have been indications sync/atomic would be removed if it weren’t for backward-compatibility requirements:

We want sync to be clearly documented and used when appropriate. We generally don’t want sync/atomic to be used at all…Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations…If we had thought of internal packages when we added the sync/atomic package, perhaps we would have used that. Now we can’t remove the package because of the Go 1 guarantee.

Frankly, I’m not sure how you write performant data structures and algorithms without those packages. Performance is relative of course, but you need these primitives if you want to write anything which is lock-free. The irony is once you start writing highly concurrent things, which Go is generally considered good at, mutexes and channels tend to fall short performance-wise.

In actuality, to write high-performance Go, you end up throwing away many of the language’s niceties. Defers add overhead, interface indirection is expensive (granted, this is not unique to Go), and channels are, generally speaking, on the slowish side.

For being one of Go’s hallmarks, channels are a bit disappointing. As I already mentioned, the behavior of panicking on puts to a closed channel is problematic. What about cases where we have producers blocked on a put to a channel and another goroutine calls close on it? They panic. Other annoyances include not being able to peek into the channel or get more than one item from it, common operations on most blocking queues. I can live with that, but what’s harder to stomach are the performance implications, which I hinted at earlier. For this, I turn to my colleague and our resident performance nut, Dustin Hiatt:

Rarely do the Golang devs discuss channel performance, although rumblings were heard last time I was at Gophercon about not using defers or channels. You see, when Rob Pike makes the claim that you can use channels instead of locks, he’s not being entirely honest. Behind the scenes, channels are using locks to serialize access and provide threadsafety. So by using channels to synchronize access to memory, you are, in fact, using locks; locks wrapped in a threadsafe queue. So how do Go’s fancy locks compare to just using mutex’s from their standard library “sync” package? The following numbers were obtained by using Go’s builtin benchmarking functionality to serially call Put on a single set of their respective types.

BenchmarkSimpleSet-8 3000000 391 ns/op
BenchmarkSimpleChannelSet-8 1000000 1699 ns/op

This is with a buffered channel, what happens if we use unbuffered?

BenchmarkSimpleChannelSet-8  1000000          2252 ns/op

Yikes, with light or no multithreading, putting using the mutex is quite a bit faster (go version go1.4 linux/amd64). How well does it do in a multithreaded environment. The following numbers were obtained by inserting the same number of items, but doing so in 4 separate Goroutines to test how well channels do under contention.

BenchmarkSimpleSet-8 2000000 645 ns/op
BenchmarkChannelSimpleSet-8 2000000 913 ns/op
BenchmarkChannelSimpleSet-8 2000000 901 ns/op

Better, but the mutex is still almost 30% faster. Clearly, some of the channel magic is costing us here, and that’s without the extra mental overhead to prevent memory leaks. Golang felt the same way, I think, and that’s why in their standard libraries that get benchmarked, like “net/http,” you’ll almost never find channels, always mutexes.

Clearly, channels are not particularly great for workload throughput, and you’re typically better off using a lock-free ring buffer or even a synchronized queue. Channels as a unit of composition tend to fall short as well. Instead, they are better suited as a coordination pattern, a mechanism for signaling and timing-related code. Ultimately, you must use channels judiciously if you are sensitive to performance.

There are a lot of things in Go that sound great in theory and look neat in demos, but then you start writing real systems and go, “oh wait, that doesn’t actually work.” Once again, channels are a good example of this. The range keyword, which allows you to iterate over a data structure, is reserved to slices, maps, and channels. At first glance, it appears channels provide an elegant way to build your own iterators:

But upon closer inspection, we realize this approach is subtly broken. While it works, if we stop iterating, the loop adding items to the channel will block—the goroutine is leaked. Instead, we must push the onus onto the user to signal the iteration is finished. It’s far less elegant and prone to leaks if not used correctly—so much for channels and goroutines.

Goroutines are nice. They make it incredibly easy to spin off concurrent workers. They also make it incredibly easy to leak things. This shouldn’t be a problem for the intelligent programmer, but for Rob Pike’s beloved Googlers, they can be a double-edged sword.

Dependency Management in Practice

For being a language geared towards Google-sized projects, Go’s approach to managing dependencies is effectively nonexistent. For small projects with little-to-no dependencies, go get works great. But Go is a server language, and we typically have many dependencies which must be pinned to different versions. Go’s package structure and go get do not support this. Reproducible builds and dependency management continue to be a source of frustration for folks trying to build real software with it.

In fairness, dependency management is not an issue with the language per se, but to me, tooling is equally important as the language itself. Go doesn’t actually take an official stance on versioning:

“Go get” does not have any explicit concept of package versions. Versioning is a source of significant complexity, especially in large code bases, and we are unaware of any approach that works well at scale in a large enough variety of situations to be appropriate to force on all Go users. What “go get” and the larger Go toolchain do provide is isolation of packages with different import paths.

Fortunately, the tooling in this area is actively improving. I’m confident this problem can be solved in better ways, but the current state of the art will leave newcomers feeling uneasy.

A Community or a Carousel

Go has an increasingly vibrant community, but it’s profoundly stubborn. My biggest gripe is not with the language itself, but with the community’s seemingly us-versus-them mentality. You’re either with us or against us. It’s almost comical because it seems every criticism of the language, mine included, is prefixed with “I really like Go, but…” to ostensibly diffuse the situation. Parts of the community can seem religious, almost cult-like. The sheer mention of generics is now met with a hearty dismissal. It’s not the Go way.

The attitude of the decision making around the language is unfortunate, and I think Go could really take a page from Rust’s book with respect to its governance model. I agree entirely with the sentiment of “it is a poor craftsman who blames their tools,” but it is an even poorer craftsman who doesn’t choose the best tools at their disposal. I’m not partial to any of my tools. They’re a means to an end, but we should aim to improve them and make them more effective. Community should not breed complacency. With Go, I fear both are thriving.

Despite your hand wringing over the effrontery of Go’s designers to not include your prerequisite features, interest in Go is sky rocketing. Rather than finding new ways to hate a language for reasons that will not change, why not invest that time and join the growing number of programmers who are using the language to write real software today.

This is dangerous reasoning, and it hinders progress. Yes, programmers are using Go to write real software today. They were also writing real software with Java circa 2004. I write Go every day for a living. I work with smart people who do the same. Most of my open-source projects on GitHub are written in Go. I have invested countless hours into the language, so I feel qualified to point out its shortcomings. They are not irreparable, but let’s not just brush them off as people toying with Go and “finding ways to hate it”—it’s insulting and unproductive.

The Good Parts

Alas, Go is not beyond reproach. But at the same time, the language gets a lot of things right. The advantages of a single, self-contained binary are real, and compilation is fast. Coming from C or C++, the compilation speed is a big deal. Cross-compile allows you to target other platforms, and it’s getting even better with Go 1.5.

The garbage collector, while currently a pain point for performance-critical systems, is the focus of a lot of ongoing effort. Go 1.5 will bring about an improved garbage collector, and more enhancements—including generational techniques—are planned for the future. Compared to current cutting-edge garbage collectors like HotSpot, Go’s is still quite young—lots of room for improvement here.

Over the last couple of months, I dipped my toes back in Java. Along with C#, Java used to be my modus operandi. Going back gave me a newfound appreciation for Go’s composability. In Go, the language and libraries are designed to be composable, à la Unix. In Java, everyone brings their own walled garden of classes.

Java is really a ghastly language in retrospect. Even the simplest of tasks, like reading a file, require a wildly absurd amount of hoop-jumping. This is where Go’s simplicity nails it. Building a web application in Java generally requires an application server, which often puts you in J2EE-land. It’s not a place I recommend you visit. In contrast, building a web server in Go takes a couple lines of code using the standard library—no overhead whatsoever. I just wish Java shared some of its generics Kool-Aid. C# does generics even better, implementing them all the way down to the byte-code level without type erasure.

Beyond go get, Go’s toolchain is actually pretty good. Testing and benchmarking are built in, and the data-race detector is super handy for debugging race conditions in your myriad of goroutines. The gofmt command is brilliant—every language needs something like this—as are vet and godoc. Lastly, Go provides a solid set of profiling tools for analyzing memory, CPU utilization, and other runtime behavior. Sadly, CPU profiling doesn’t work on OSX due to a kernel bug.

Although channels and goroutines are not without their problems, Go is easily the best “concurrent” programming language I’ve used. Admittedly, I haven’t used Erlang, so I suspect that statement made some Erlangers groan. Combined with the select statement, channels allow you to solve some problems which would otherwise be solved in a much more crude manner.

Go fits into your stack as a language for backend services. With the work being done by Docker, CoreOS, HashiCorp, Google, and others, it clearly is becoming the language of Infrastructure as a Service, cloud orchestration, and DevOps as well. Go is not a replacement for C/C++ but a replacement for Java, Python, and the like—that much is clear.

Moving Forward

Ultimately, we use Go because it’s boring. We don’t use it because Google uses it. We don’t use it because it’s trendy. We use it because it’s no-frills and, hey, it usually gets the job done assuming you’ve found the right nail. But Go is still in its infancy and has a lot of room for growth and improvement.

I’m cautiously optimistic about Go’s future. I don’t consider myself a hater, I consider myself a hopeful. As it continues to gain a critical mass, I’m hopeful that the language will continue to improve but fearful of its relentless dogma. Go needs to let go of this attitude of “you don’t need that” or “it’s too complicated” or “programmers won’t know how to use it.” It’s toxic. It’s not all that different from your users requesting features after you release a product and telling those users they aren’t smart enough to use them. It’s not on your users, it’s on you to make the UX good.

A language can have considerable depth while still retaining its simplicity. I wish this were the ideal Go embraced, not one of negativity, of pessimism, of “no.” The question is not how can we protect developers from themselves, it’s how can we make them more productive? How can we enable them to solve problems? But just because people are solving problems with Go today does not mean we can’t do better. There is always room for improvement. There is never room for complacency.

My thanks to Dustin Hiatt for reviewing this and his efforts in benchmarking and profiling various parts of the Go runtime. It’s largely Dustin’s work that has helped pave the way for building performance-critical systems in Go.

Fast, Scalable Networking in Go with Mangos

In the past, I’ve looked at nanomsg and why it’s a formidable alternative to the well-regarded ZeroMQ. Like ZeroMQ, nanomsg is a native library which markets itself as a way to build fast and scalable networking layers. I won’t go into detail on how nanomsg accomplishes this since my analysis of it already covers that fairly extensively, but instead I want to talk about a Go implementation of the protocol called Mangos. ((Full disclosure: I am a contributor on the Mangos project, but only because I was a user first!)) If you’re not familiar with nanomsg or Scalability Protocols, I recommend reading my overview of those first.

nanomsg is a shared library written in C. This, combined with its zero-copy API, makes it an extremely low-latency transport layer. While there are a lot of client bindings which allow you to use nanomsg from other languages, dealing with shared libraries can often be a pain—not to mention it complicates deployment.

More and more companies are starting to use Go for backend development because of its speed and concurrency primitives. It’s really good at building server components that scale. Go obviously provides the APIs needed for socket networking, but building a scalable distributed system that’s reliable using these primitives can be somewhat onerous. Solutions like nanomsg’s Scalability Protocols and ZeroMQ attempt to make this much easier by providing useful communication patterns and by taking care of other messaging concerns like queueing.

Naturally, there are Go bindings for nanomsg and ZeroMQ, but like I said, dealing with shared libraries can be fraught with peril. In Go (and often other languages), we tend to avoid loading native libraries if we can. It’s much easier to reason about, debug, and deploy a single binary than multiple. Fortunately, there’s a really nice implementation of nanomsg’s Scalability Protocols in pure Go called Mangos by Garrett D’Amore of illumos fame.

Mangos offers an idiomatic Go implementation and interface which affords us the same messaging patterns that nanomsg provides while maintaining compatibility. Pub/Sub, Pair, Req/Rep, Pipeline, Bus, and Survey are all there. It also supports the same pluggable transport model, allowing additional transports to be added (and extended ((Mangos supports TLS with the TCP transport as an experimental extension.))) on top of the base TCP, IPC, and inproc ones. ((A nanomsg WebSocket transport is currently in the works.)) Mangos has been tested for interoperability with nanomsg using the nanocat command-line interface.

One of the advantages of using a language like C is that it’s not garbage collected. However, if you’re using Go with nanomsg, you’re already paying the cost of GC. Mangos makes use of object pools in order to reduce pressure on the garbage collector. We can’t turn Go’s GC off, but we can make an effort to minimize pauses. This is critical for high-throughput systems, and Mangos tends to perform quite comparably to nanomsg.

Mangos (and nanomsg) has a very familiar, socket-like API. To show what this looks like, the code below illustrates a simple example of how the Pub/Sub protocol is used to build a fan-out messaging system.

My message queue test framework, Flotilla, uses the Req/Rep protocol to allow clients to send requests to distributed daemon processes, which handle them and respond. While this is a very simple use case where you could just as easily get away with raw TCP sockets, there are more advanced cases where Scalability Protocols make sense. We also get the added advantage of transport abstraction, so we’re not strictly tied to TCP sockets.

I’ve been building a distributed messaging system using Mangos as a means of federated communication. Pub/Sub enables a fan-out, interest-based broadcast and Bus facilitates many-to-many messaging. Both of these are exceptionally useful for connecting disparate systems. Mangos also supports an experimental new protocol called Star. This pattern is like Bus, but when a message is received by an immediate peer, it’s propagated to all other members of the topology.

My favorite Scalability Protocol is Survey. As I discussed in my nanomsg overview, there are a lot of really interesting applications of this. Survey allows a process to query the state of multiple peers in one shot. It’s similar to Pub/Sub in that the surveyor publishes a single message which is received by all the respondents (although there’s no topic subscriptions). The respondents then send a message back, and the surveyor collects these responses. We can also enforce a deadline on the respondent replies, which makes Survey particularly useful for service discovery.

With my messaging system, I’ve used Survey to implement a heartbeat protocol. When a broker spins up, it begins broadcasting a heartbeat using a Survey socket. New brokers can connect to existing ones, and they reply to the heartbeat which allows brokers to “discover” each other. If a heartbeat isn’t received before the deadline, the peer is removed. Mangos also handles reconnects, so if a broker goes offline and comes back up, peers will automatically reconnect.

To summarize, if you’re building distributed systems in Go, consider taking a look at Mangos. You can certainly roll your own messaging layer with raw sockets, but you’re going to end up writing a lot of logic for a robust system. Mangos, and nanomsg in general, gives you the right abstraction to quickly build systems that scale and are fast.

Iris Decentralized Cloud Messaging

A couple weeks ago, I published a rather extensive analysis of numerous message queues, both brokered and brokerless. Brokerless messaging is really just another name for peer-to-peer communication. As we saw, the difference in message latency and throughput between peer-to-peer systems and brokered ones is several orders of magnitude. ZeroMQ and nanomsg are able to reliably transmit millions of messages per second at the expense of guaranteed delivery.

Peer-to-peer messaging is decentralized, scalable, and fast, but it brings with it an inherent complexity. There is a dichotomy between how brokerless messaging is conceptualized and how distributed systems are actually built. Distributed systems are composed of services like applications, databases, caches, etc. Services are composed of instances or nodes—individually addressable hosts, either physical or virtual. The key observation is that, conceptually, the unit of interaction lies at the service level, not the instance level. We don’t care about which database server we interact with, we just want to talk to database server (or perhaps multiple). We’re concerned with logical groups of nodes.

While traditional socket-queuing systems like ZeroMQ solve the problem of scaling, they bring about a certain coupling between components. System designers are forced to build applications which communicate with nodes, not services. We can introduce load balancers like HAProxy, but we’re still addressing specific locations while creating potential single points of failure. With lightweight VMs and the pervasiveness of elastic clouds, IP addresses are becoming less and less static—they come and go. The canonical way of dealing with this problem is to use distributed coordination and service discovery via ZooKeeper, et al., but this introduces more configuration, more moving parts, and more headaches.

The reality is that distributed systems are not built with the instance as the smallest unit of composition in mind, they’re built with services in mind. As discussed earlier, a service is simply a logical grouping of nodes. This abstraction is what we attempt to mimic with things like etcd, ZooKeeper and HAProxy. These assemblies are proven, but there are alternative solutions that offer zero configuration, minimal network management, and overall less complexity. One such solution that I want to explore is a distributed messaging framework called Iris.

Decentralized Messaging with Iris

Iris is posited as a decentralized approach to backend messaging middleware. It looks to address several of the fundamental issues with traditional brokerless systems, like tight coupling and security.

In order to avoid the problem of addressing instances, Iris considers clusters to be the smallest logical blocks of which systems are composed. A cluster is a collection of zero or more nodes which are responsible for a certain service sub-task. Clusters are then assembled into services such that they can communicate with each other without any regard as to which instance is servicing their requests or where it’s located. Lastly, services are composed into federations, which allow them to communicate across different clouds.

This form of composition allows Iris to use semantic or logical addressing instead of the standard physical addressing. Nodes specify the name of the cluster they wish to participate in, while Iris handles the intricacies of routing and balancing. For example, you might have three database servers which belong to a single cluster called “databases.” The cluster is reached by its name and requests are distributed across the three nodes. Iris also takes care of service discovery, detecting new clusters as they are created on the same cloud.

With libraries like ZeroMQ, security tends to be an afterthought. Iris has been built from the ground-up with security in mind, and it provides a security model that is simple and fast.

Iris uses a relaxed security model that provides perfect secrecy whilst at the same time requiring effectively zero configuration. This is achieved through the observation that if a node of a service is compromised, the whole system is considered undermined. Hence, the unit of security is a service – opposed to individual instances – where any successfully authenticated node is trusted by all. This enables full data protection whilst maintaining the loosely coupled nature of the system.

In practice, what this means is that each cluster uses a single private key. This encryption scheme not only makes deployment trivial, it minimizes the effect security has on speed.

Like ZeroMQ and nanomsg, Iris offers a few different messaging patterns. It provides the standard request-reply and publish-subscribe schemes, but it’s important to remember that the smallest addressable unit is the cluster, not the node. As such, requests are targeted at a cluster and subsequently relayed on to a member in a load-balanced fashion. Publish-subscribe, on the other hand, is not targeted at a single cluster. It allows members of any cluster to subscribe and publish to a topic.

Iris also implements two patterns called “broadcast” and “tunnel.” While request-reply forwards a message to one member of a cluster, broadcast forwards it to all members. The caveat is that there is no way to listen for responses to a broadcast.

Tunnel is designed to address the problem of stateful or streaming transactions where a communication between two endpoints may consist of multiple data exchanges which need to occur as an atomic operation. It provides the guarantee of in-order and throttled message delivery by establishing a channel between a client and a node.

Performance Characteristics

According to its author, Iris is still in a “feature phase” and hasn’t been optimized for speed. Since it’s written in Go, I’ve compared its pub/sub benchmark performance with other Go messaging libraries, NATS and NSQ. As before, these benchmarks shouldn’t be taken as gospel, the code is available here, and pull requests are welcome.

We can see that Iris is comparable to NSQ on the sending side and about 4x on the receiving side, at least out of the box.

Conclusion

Brokerless systems like ZeroMQ and nanomsg offer considerably higher throughput and less latency than classical message-oriented middleware but require greater orchestration of network topologies. They offer high scalability but can lead to tighter coupling between components. Traditional brokered message queues, like those of the AMQP variety, tend to be slower while providing more guarantees and reduced coupling. However, they are also more prone to scale problems like availability and partitioning.

In terms of its qualities, Iris appears to be a reasonable compromise between the decentralized nature of the brokerless systems and the minimal-configuration and management of the brokered ones. Its intrinsic value lies in its ability to hide the complexities of the underlying infrastructure behind distributed systems. Rather, Iris lends itself to building large-scale systems the way we conceptualize and reason about them—by using services as the building blocks, not instances.