Building a Distributed Log from Scratch, Part 1: Storage Mechanics

The log is a totally-ordered, append-only data structure. It’s a powerful yet simple abstraction—a sequence of immutable events. It’s something that programmers have been using for a very long time, perhaps without even realizing it because it’s so simple. Whether it’s application logs, system logs, or access logs, logging is something every developer uses on a daily basis. Essentially, it’s a timestamp and an event, a when and a what, and typically appended to the end of a file. But when we generalize that pattern, we end up with something much more useful for a broad range of problems. It becomes more interesting when we look at the log not just as a system of record but a central piece in managing data and distributing it across the enterprise efficiently.

 

There are a number of implementations of this idea: Apache Kafka, Amazon Kinesis, NATS Streaming, Tank, and Apache Pulsar to name a few. We can probably credit Kafka with popularizing the idea.

I think there are at least three key priorities for the effectiveness of one of these types of systems: performance, high availability, and scalability. If it’s not fast enough, the data becomes decreasingly useful. If it’s not highly available, it means we can’t reliably get our data in or out. And if it’s not scalable, it won’t be able to meet the needs of many enterprises.

When we apply the traditional pub/sub semantics to this idea of a log, it becomes a very useful abstraction that applies to a lot of different problems.

In this series, we’re not going to spend much time discussing why the log is useful. Jay Kreps has already done the legwork on that with The Log: What every software engineer should know about real-time data’s unifying abstraction. There’s even a book on it. Instead, we will focus on what it takes to build something like this using Kafka and NATS Streaming as case studies of sorts—Kafka because of its ubiquity, NATS Streaming because it’s something with which I have personal experience. We’ll look at a few core components like leader election, data replication, log persistence, and message delivery. Part one of this series starts with the storage mechanics. Along the way, we will also discuss some lessons learned while building NATS Streaming, which is a streaming data layer on top of the NATS messaging system. The intended outcome of this series is threefold: to learn a bit about the internals of a log abstraction, to learn how it can achieve the three goals described above, and to learn some applied distributed systems theory.

With that in mind, you will probably never need to build something like this yourself (nor should you), but it helps to know how it works. I also find that software engineering is all about pattern matching. Many types of problems look radically different but are surprisingly similar. Some of these ideas may apply to other things you come across. If nothing else, it’s just interesting.

Let’s start by looking at data storage since this is a critical part of the log and dictates some other aspects of it. Before we dive into that, though, let’s highlight some first principles we’ll use as a starting point for driving our design.

As we know, the log is an ordered, immutable sequence of messages. Messages are atomic, meaning they can’t be broken up. A message is either in the log or not, all or nothing. Although we only ever add messages to the log and never remove them (as with a message queue), the log has a notion of message retention based on some policies, which allows us to control how the log is truncated. This is a practical requirement since otherwise the log will grow endlessly. These policies might be based on time, number of messages, number of bytes, etc.

The log can be played back from any arbitrary position. With position, we normally refer to a logical message timestamp rather than a physical wall-clock time, such as an offset into the log. The log is stored on disk, and sequential disk access is actually relatively fast. The graphic below taken from the ACM Queue article The Pathologies of Big Data helps bear this out (this is helpfully pointed out by Kafka’s documentation).

That said, modern OS page caches mean that sequential access often avoids going to disk altogether. This is because the kernel keeps cached pages in otherwise unused portions of RAM. This means both reads and writes go to the in-memory page cache instead of disk. With Kafka, for example, we can verify this quite easily by running a simple test that writes some data and reads it back and looking at disk IO using iostat. After running such a test, you will likely see something resembling the following, which shows the number of blocks read and written is exactly zero.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.53    0.00   11.28    0.00    0.00   75.19

Device:    tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda      0.00         0.00         0.00          0          0

With the above in mind, our log starts to look an awful lot like an actual logging file, but instead of timestamps and log messages, we have offsets and opaque data messages. We simply add new messages to the end of the file with a monotonically increasing offset.

However, there are some problems with this approach. Namely, the file is going to get very, very large. Recall that we need to support a few different access patterns: looking up messages by offset and also truncating the log using a variety of different retention policies. Since the log is ordered, a lookup is simply a binary search for the offset, but this is expensive with a large log file. Similarly, aging out data by retention policy is harder.

To account for this, we break up the log file into chunks. In Kafka, these are called segments. In NATS Streaming, they are called slices. Each segment is a new file. At a given time, there is a single active segment, which is the segment messages are written to. Once the segment is full (based on some configuration), a new one is created and becomes active.

Segments are defined by their base offset, i.e. the offset of the first message stored in the segment. In Kafka, the files are also named with this offset. This allows us to quickly locate the segment in which a given message is contained by doing a binary search.

Alongside each segment file is an index file that maps message offsets to their respective positions in the log segment. In Kafka, the index uses 4 bytes for storing an offset relative to the base offset and 4 bytes for storing the log position. Using a relative offset is more efficient because it means we can avoid storing the actual offset as an int64. In NATS Streaming, the timestamp is also stored to do time-based lookups.

Ideally, the data written to the log segment is written in protocol format. That is, what gets written to disk is exactly what gets sent over the wire. This allows for zero-copy reads. Let’s take a look at how this otherwise works.

When you read messages from the log, the kernel will attempt to pull the data from the page cache. If it’s not there, it will be read from disk. The data is copied from disk to page cache, which all happens in kernel space. Next, the data is copied into the application (i.e. user space). This all happens with the read system call. Now the application writes the data out to a socket using send, which is going to copy it back into kernel space to a socket buffer before it’s copied one last time to the NIC. All in all, we have four copies (including one from page cache) and two system calls.

However, if the data is already in wire format, we can bypass user space entirely using the sendfile system call, which will copy the data directly from the page cache to the NIC buffer—two copies (including one from page cache) and one system call. This turns out to be an important optimization, especially in garbage-collected languages since we’re bringing less data into application memory. Zero-copy also reduces CPU cycles and memory bandwidth.

NATS Streaming does not currently make use of zero-copy for a number of reasons, some of which we will get into later in the series. In fact, the NATS Streaming storage layer is actually pluggable in that it can be backed by any number of mediums which implement the storage interface. Out of the box it includes the file-backed storage described above, in-memory, and SQL-backed.

There are a few other optimizations to make here such as message batching and compression, but we’ll leave those as an exercise for the reader.

In part two of this series, we will discuss how to make this log fault tolerant by diving into data-replication techniques.

Thrift on Steroids: A Tale of Scale and Abstraction

Apache Thrift is an RPC framework developed at Facebook for building “scalable cross-language services.” It consists of an interface definition language (IDL), communication protocol, API libraries, and a code generator that allows you to build and evolve services independently and in a polyglot fashion across a wide range of languages. This is nothing new and has been around for over a decade now.

There are a number of notable users of Thrift aside from Facebook, including Twitter (mainly by way of Finagle), Foursquare, Pinterest, Uber (via TChannel), and Evernote, among others—and for good reason, Thrift is mature and battle-tested.

The white paper explains the motivation behind Thrift in greater detail, though I think the following paragraph taken from the introduction does a pretty good job of summarizing it:

As Facebook’s traffic and network structure have scaled, the resource demands of many operations on the site (i.e. search, ad selection and delivery, event logging) have presented technical requirements drastically outside the scope of the LAMP framework. In our implementation of these services, various programming languages have been selected to optimize for the right combination of performance, ease and speed of development, availability of existing libraries, etc. By and large, Facebook’s engineering culture has tended towards choosing the best tools and implementations available over standardizing on any one programming language and begrudgingly accepting its inherent limitations.

Basically, as Facebook scaled, they moved more and more away from PHP and the LAMP stack and became increasingly polyglot. I think this same evolution is seen at most startups as they grow into themselves. We saw a similar transition in my time at Workiva, moving from our monolothic Python application on Google App Engine to a polyglot service-oriented architecture in AWS. It was an exciting but awkward time as we went through our adolescence as an engineering culture and teams started to find their identities. Teams learned what it meant to build backward-compatible APIs and loosely coupled services, how to deprecate APIs, how to build resilient and highly available systems, how to properly instrument services and diagnose issues, how to run and manage the underlying infrastructure, and—most importantly—how to collaborate with each other. There was lots of stumbling and mistakes along the way, lots of postmortems, lots of stress, but with that comes the learning and growing. The payoff is big but the process is painful. I don’t think it ever isn’t.

With one or two services written in the same language and relatively few developers, it was easy to just stick with “REST” (in quotes because it’s always a bastardized version of what REST ought to be), sling some JSON around, and call it a day. As the number of tech stacks and integration points increase, it becomes apparent that some standards are important. And once things are highly polyglot with lots of developers and lots of services running with lots of versions, strict service contracts become essential.

Uber has a blog post on building microservices that explains this and why they settled on Thrift to solve this problem.

Since the number of service calls grows rapidly, it is necessary to maintain a well-defined interface for every call. We knew we wanted to use an IDL for managing this interface, and we ultimately decided on Thrift. Thrift forces service owners to publish strict interface definitions, which streamlines the process of integrating with services. Calls that do not abide by the interface are rejected at the Thrift level instead of leaking into a service and failing deeper within the code. This strategy of publicly declaring your interface emphasizes the importance of backwards compatibility, since multiple versions of a service’s Thrift interface could be in use at any given time. The service author must not make breaking changes, and instead must only make non-breaking additions to the interface definition until all consumers are ready for deprecation.

Early on, I was tasked with building a unified messaging solution that would help with our integration challenges. The advantages of a unified solution should be obvious: reusability (before this, everyone was solving the problem in their own way), focus (allow developers to focus on their problem space, not the glue), acceleration (if the tools are already available, there’s less work to do), and shared pain points (it’s a lot easier to prioritize your work when everyone is complaining about the same thing). Also, a longer term benefit is developing the knowledge of this shared solution into an organizational competency which has a sort of “economy of scale” to it. Our job was not just to ship a messaging platform but evangelize it and help other teams to be successful with it. We did this through countless blog posts, training sessions, workshops, talks, and even a podcast.

Before we set out on building a common messaging solution, there were a few key principles we used to guide ourselves. We wanted to provide a core set of tools, libraries, and infrastructure for service integration. We wanted a solution that was rigid yet flexible. We provide only a minimal set of messaging patterns to act as generic building blocks with strict, strongly typed APIs, and promote design best practices and a service-oriented mindset. This meant supporting service evolution and API iteration through versioning and backward compatibility, allowing for resiliency patterns like timeouts, retries, circuit breakers, etc., and generally advocating asynchronous, loosely coupled communication. Lastly, we had to keep in mind that, at the end of the day, developers are just trying to ship stuff, so we had to balance these concerns out with ergonomics and developer experience so they could build, integrate, and ship quickly.

As much as I think RPC is a bad abstraction, it’s what developers want. If you don’t provide them with an RPC solution, they will build their own, so we had to provide first-class support for it. We evaluated solutions in the RPC space. We looked at GRPC extensively, which is the new RPC hotness from Google, but it had a few key drawbacks, namely its “newness” (it was still in early beta at the time and has since been almost entirely rewritten), it’s coupled to HTTP/2 as a transport (which at the time had fairly limited support), and it lacks support for JavaScript (let alone Dart, which is what most of our client applications were being written in). Avro was another we looked at.

Ultimately, we settled on Thrift due to its maturity and wide use in production, its performance, its architecture (it separates out the transports, protocols, and RPC layer with the first two being pluggable), its rich feature set, and its wide range of language support (checking off all the languages we standardized on as a company including Go, Java, Python, JavaScript, and Dart). Thrift is not without its problems though—more on this in a bit.

In addition to RPC, we wanted to promote a more asynchronous, message-passing style of communication with pub/sub. This would allow for greater flexibility in messaging patterns like fan-out and fan-in, interest-based messaging, and reduced coupling and fragility of services. This enables things like the worker pattern where we can distribute work to a pool of workers and scale that pool independently, whereas RPC tends to promote more stateful types of services. In my experience, developers tend to bias towards stateful services since this is how we’ve built things for a long time, but as we’ve entered the cloud-native era, things are running in containers which are autoscaled, more ephemeral, and more distributed. We have to grapple with the complexity imposed by distributed systems. This is why asynchronous messaging is important and why we wanted to support it from the onset.

We selected NATS as a messaging backplane because of its simplicity, performance, scalability, and adoption of the cloud-native mentality. When it comes to service integration, you need an always-on dial tone and NATS provides just that. Because of Thrift’s pluggable transport layer, we could build a NATS RPC transport while also providing HTTP and TCP transports.

Unfortunately, Thrift doesn’t provide any kind of support for pub/sub, and we wanted the same guarantees for it that we had with RPC, like type safety and versioning with code-generated APIs and service contracts. Aside from this, Thrift has a number of other, more glaring problems:

  • Head-of-line blocking: a single, slow request will block any subsequent requests for a client.
  • Out-of-order responses: an out-of-order response puts a Thrift transport in a bad state, requiring it to be torn down and reestablished, e.g. if a slow request times out at the client, the client issues a subsequent request, and a response comes back for the first request, the client blows up.
  • Concurrency: a Thrift client cannot be shared between multiple threads of execution, requiring each thread to have its own client issuing requests sequentially. This, combined with head-of-line blocking, is a major performance killer. This problem is compounded when each transport has its own resources, such as a socket.
  • RPC timeouts: Thrift does not provide good facilities for per-request timeouts, instead opting for a global transport read timeout.
  • Request headers: Thrift does not provide support for request metadata, making it difficult to implement things like authentication/authorization and distributed tracing. Instead, you are required to bake these things into your IDL or in a wrapped transport. The problem with this is it puts the onus on service providers rather than allowing an API gateway or middleware to perform these functions in a centralized way.
  • Middleware: Thrift does not have any support for client or server middleware. This means clients must be wrapped to implement interceptor logic and middleware code must be duplicated within handler functions. This makes it impossible to implement AOP-style logic in a clean, DRY way.

Twitter’s Finagle addresses many of these issues but is solely for the JVM, so we decided to address Thrift’s shortcomings in a cross-platform way without completely reinventing the wheel. That is, we took Thrift and extended it. What we ended up with was Frugal, a superset of Thrift recently open sourced that aims to solve the problems described above while also providing support for asynchronous pub/sub APIs—a sort of Thrift on steroids as I’ve come to call it. Its key features include:

  • Request multiplexing: client requests are fully multiplexed, allowing them to be issued concurrently while simultaneously avoiding the head-of-line blocking and out-of-order response problems. This also lays some groundwork for asynchronous messaging patterns.
  • Thread-safety: clients can be safely shared between multiple threads in which requests can be made in parallel.
  • Pub/sub: IDL and code-generation extensions for defining pub/sub APIs in a type-safe way.
  • Request context: a first-class request context object is added to every operation which allows defining request/response headers and per-request timeouts. By making the context part of the Frugal protocol, headers can be introspected or even injected by external middleware. This context could be used to send OAuth2 tokens and user-context information, avoiding the need to include it everywhere in your IDL and handler logic. Correlation IDs for distributed tracing purposes are also built into the request context.
  • Middleware: client- and server- side middleware is supported for RPC and pub/sub APIs. This allows you to implement interceptor logic around handler functions, e.g. for authentication, logging, or retry policies. One can easily integrate OpenTracing as a middleware, for example.
  • Cross-language: support for Go, Java, Dart, and Python (2.7 and 3.5).

Frugal adds a second kind of transport alongside Thrift’s RPC transport for pub/sub. With this, we provide a NATS transport for both pub/sub and RPC (internally, Workiva also has an at-least-once delivery pub/sub transport built around Amazon SQS for mission-critical data). In addition to this, we built a SDK which developers use to connect to the messaging infrastructure (such as NATS) with minimal ceremony. The messaging SDK played a vital role not just in making it easy for developers to adopt and integrate, but providing us a shim where we could introduce sweeping changes across the organization in one place, such as adding instrumentation, tracing, and authentication. This enabled us to roll critical integration components out to every service by making a change in one place.

To support pub/sub, we extended the Thrift IDL with an additional top-level construct called a scope, which is effectively a pub/sub namespace (basically what a service is to RPC). We wrote the IDL using a parsing expression grammar which allows us to generate a parser. We then implemented a code generator for the various language targets. The Frugal compiler is written in Go and is, at least in my opinion, much more maintainable than Thrift’s C++ codebase. However, the language libraries make use of the existing Thrift APIs, such as protocols, transports, etc. This means we didn’t need to implement any of the low-level mechanics like serialization.

I’ve since left Workiva (and am now actually working on NATS), but as far as I know, Frugal helps power nearly every production service at the company. It was an interesting experience from which I learned a lot. I was happy to see some of that work open sourced so others could use it and learn from it.

Of course, if I were starting over today, things would probably look different. GRPC is much more mature and the notion of a “service mesh” has taken the container world by storm with things like Istio, Linkerd, and Envoy. What we built was Workiva’s service mesh, we just didn’t have a name for it, so we called it a “Messaging SDK.” The corollary to this is you don’t need to adopt bleeding-edge tech to be successful. The concepts are what’s important, and if enough people are working on the same types of problems in parallel, they will likely converge on solutions that look very similar to each other given enough time and enough people working on them.

I think there’s a delicate balance between providing solutions that are “easy” from a developer point of view but may provide longer term drawbacks when it comes to building complex systems the “right” way. I see RPC as an example of this. It’s an “easy” abstraction but it hides a lot of complexity. Service meshes might even be in this category, but they have obvious upsides when it comes to building software in a way that is scalable. Peter Alvaro’s Strange Loop talk “I See What You Mean” does a great job of articulating this dilemma, which I’ve also written about myself. In the end, we decided to optimize for shipping, but we took a principled approach: provide the tools developers need (or want) but help educate them to utilize those tools in a way that allows them to ship products that are reliable and maintainable. Throwing tools or code over the wall is not enough.

Software Is About Storytelling

Software engineering is more a practice in archeology than it is in building. As an industry, we undervalue storytelling and focus too much on artifacts and tools and deliverables. How many times have you been left scratching your head while looking at a piece of code, system, or process? It’s the story, the legacy left behind by that artifact, that is just as important—if not more—than the artifact itself.

And I don’t mean what’s in the version control history—that’s often useless. I mean the real, human story behind something. Artifacts, whether that’s code or tools or something else entirely, are not just snapshots in time. They’re the result of a series of decisions, discussions, mistakes, corrections, problems, constraints, and so on.  They’re the product of the engineering process, but the problem is they usually don’t capture that process in its entirety. They rarely capture it at all. They commonly end up being nothing but a snapshot in time.

It’s often the sign of an inexperienced engineer when someone looks at something and says, “this is stupid” or “why are they using X instead of Y?” They’re ignoring the context, the fact that circumstances may have been different. There is a story that led up to that point, a reason for why things are the way they are. If you’re lucky, the people involved are still around. Unfortunately, this is not typically the case. And so it’s not necessarily the poor engineer’s fault for wondering these things. Their predecessors haven’t done enough to make that story discoverable and share that context.

I worked at a company that built a homegrown container PaaS on ECS. Doing that today would be insane with the plethora of container solutions available now. “Why aren’t you using Kubernetes?” Well, four years ago when we started, Kubernetes didn’t exist. Even Docker was just in its infancy. And it’s not exactly a flick of a switch to move multiple production environments to a new container runtime, not to mention the politicking with leadership to convince them it’s worth it to not ship any new code for the next quarter as we rearchitect our entire platform. Oh, and now the people behind the original solution are no longer with the company. Good luck! And this is on the timescale of about five years. That’s maybe like one generation of engineers at the company at most—nothing compared to the decades or more software usually lives (an interesting observation is that timescale, I think, is proportional to the size of an organization). Don’t underestimate momentum, but also don’t underestimate changing circumstances, even on a small time horizon.

The point is, stop looking at technology in a vacuum. There are many facets to consider. Likewise, decisions are not made in a vacuum. Part of this is just being an empathetic engineer. The corollary to this is you don’t need to adopt every bleeding-edge tech that comes out to be successful, but the bigger point is software is about storytelling. The question you should be asking is how does your organization tell those stories? Are you deliberate or is it left to tribal knowledge and hearsay? Is it something you truly value and prioritize or simply a byproduct?

Documentation is good, but the trouble with documentation is it’s usually haphazard and stagnant. It’s also usually documentation of how and not why. Documenting intent can go a long way, and understanding the why is a good way to develop empathy. Code survives us. There’s a fantastic talk by Bryan Cantrill on oral tradition in software engineering where he talks about this. People care about intent. Specifically, when you write software, people care what you think. As Bryan puts it, future generations of programmers want to understand your intent so they can abide by it, so we need to tell them what our intent was. We need to broadcast it. Good code comments are an example of this. They give you a narrative of not only what’s going on, but why. When we write software, we write it for future generations, and that’s the most underestimated thing in all of software. Documenting intent also allows you to document your values, and that allows the people who come after you to continue to uphold them.

Storytelling in software is important. Without it, software archeology is simply the study of puzzles created by time and neglect. When an organization doesn’t record its history, it’s bound to repeat the same mistakes. A company’s memory is comprised of its people, but the fact is people churn. Knowing how you got here often helps you with getting to where you want to be. Storytelling is how we transcend generational gaps and the inevitable changing of the old guard to the new guard in a maturing engineering organization. The same is true when we expand that to the entire industry. We’re too memoryless—shipping code and not looking back, discovering everything old that is new again, and simply not appreciating our lineage.

FIFO, Exactly-Once, and Other Costs

There’s been a lot of discussion about exactly-once semantics lately, sparked by the recent announcement of support for it in Kafka 0.11. I’ve already written at length about strong guarantees in messaging.

My former coworker Kevin Sookocheff recently made a post about ordered and exactly-once message delivery as it relates to Amazon SQS. It does a good job of illustrating what the trade-offs are, and I want to drive home some points.

In the article, Kevin shows how FIFO delivery is really only meaningful when you have one single-threaded publisher and one single-threaded receiver. Amazon’s FIFO queues allow you to control how restrictive this requirement is by applying ordering on a per-group basis. In other words, we can improve throughput if we can partition work into different ordered groups rather than a single totally ordered group. However, FIFO still effectively limits throughput on a group to a single publisher and single subscriber. If there are multiple publishers, they have to coordinate to ensure ordering is preserved with respect to our application’s semantics. On the subscriber side, things are simpler because SQS will only deliver messages in a group one at a time in order amongst subscribers.

Amazon’s FIFO queues also have an exactly-once processing feature which deduplicates messages within a five-minute window. Note, however, that there are some caveats with this, the obvious one being duplicate delivery outside of the five-minute window. A mission-critical system would have to be designed to account for this possibility. My argument here is if you still have to account for it, what’s the point unless the cost of detecting duplicates is prohibitively expensive? But to be fair, five minutes probably reduces the likelihood enough to the point that it’s useful and in those rare cases where it fails, the duplicate is acceptable.

The more interesting caveat is that FIFO queues do not guarantee exactly-once delivery to consumers (which, as we know, is impossible). Rather, they offer exactly-once processing by guaranteeing that once a message has successfully been acknowledged as processed, it won’t be delivered again. It’s up to applications to ack appropriately. When a message is delivered to a consumer, it remains in the queue until it’s acked. The visibility timeout prevents other consumers from processing it. With FIFO queues, this also means head-of-line blocking for other messages in the same group.

Now, let’s assume a subscriber receives a batch of messages from the queue, processes them—perhaps by storing some results to a database—and then sends an acknowledgement back to SQS which removes them from the queue. It’s entirely possible that during that process step a delay happens—a prolonged GC pause, crash, network delay, whatever. When this happens, the visibility timeout expires and the messages are redelivered and, potentially, reprocessed. What has to happen here is essentially cooperation between the queue and processing step. We might do this by using a database transaction to atomically process and acknowledge the messages. An alternative, yet similar, approach might be to use a write-ahead-log-like strategy whereby the consuming system reads messages from SQS and transactionally stores them in a database for future processing. Once the messages have been committed, the consumer deletes the messages from SQS. In either of these approaches, we’re basically shifting the onus of exactly-once processing onto an ACID-compliant relational database.

Note that this is really how Kafka achieves its exactly-once semantics. It requires end-to-end cooperation for exactly-once to work. State changes in your application need to be committed transactionally with your Kafka offsets.

As Kevin points out, FIFO SQS queues offer exactly-once processing only if 1) publishers never publish duplicate messages wider than five minutes apart and 2) consumers never fail to delete messages they have processed from the queue. Solving either of these problems probably requires some kind of coordination between the application and queue, likely in the form of a database transaction. And if you’re using a database either as the message source, sink, or both, what are exactly-once FIFO queues actually buying you? You’re paying a seemingly large cost in throughput for little perceived value. Your messages are already going through some sort of transactional boundary that provides ordering and uniqueness.

Where I see FIFO and exactly-once semantics being useful is when talking to systems which cannot cooperate with the end-to-end transaction. This might be a legacy service or a system with side effects, such as sending an email. Often in the case of these “distributed workflows”, latency is a lower priority and humans can be involved in various steps. Other use cases might be scheduled integrations with legacy batch processes where throughput is known a priori. These can simply be re-run when errors occur.

When people describe a messaging system with FIFO and exactly-once semantics, they’re usually providing a poor description of a relational database with support for ACID transactions. Providing these semantics in a messaging system likely still involves database transactions, it’s just more complicated. It turns out relational databases are really good at ensuring invariants like exactly-once.

I’ve picked on Kafka a bit in the past, especially with the exactly-once announcement, but my issue is not with Kafka itself. Kafka is a fantastic technology. It’s well-architected, battle-tested, and the team behind it is talented and knows the space well. My issue is more with some of the intangible costs associated with it. The same goes for similar systems (like exactly-once FIFO SQS queues). Beyond just the operational complexity (which Confluent is attempting to tackle with its Kafka-as-a-service), you have to get developer understanding. This is harder than it sounds in any modestly-sized organization. That’s not to say that developers are dumb or incapable of understanding, but the fact is your average developer is simply not thinking about all of the edge cases brought on by operating distributed systems at scale. They see “exactly-once FIFO queues” in SQS or “exactly-once delivery” in Kafka and take it at face value. They don’t read beyond the headline. They don’t look for the caveats. That’s why I took issue with how Kafka claimed to do the impossible with exactly-once delivery when it’s really exactly-once processing or, as I’ve come to call it, “atomic processing.” Henry Robinson put it best when talking about the Kafka announcement:

If I were to rewrite the article, I’d structure it thus: “exactly-once looks like atomic broadcast. Atomic broadcast is impossible. Here’s how exactly-once might fail, and here’s why we think you shouldn’t be worried about it.” That’s a harder argument for users to swallow…

Basically “exactly-once” markets better. It’s something developers can latch onto, but it’s also misleading. I know it’s only a matter of time before people start linking me to the Confluent post saying, “see, exactly-once is easy!” But this is just pain deferral. On the contrary, exactly-once semantics require careful construction of your application, assume a closed, transactional world, and do not support the case where I think people want exactly-once the most: side effects.

Interestingly, one of my chief concerns about Kafka’s implementation was what the difficulty of ensuring end-to-end cooperation would be in practice. Side effects into downstream systems with no support for idempotency or transactions could make it difficult. Jay’s counterpoint to this was that the majority of users are using good old-fashioned relational databases, so all you really need to do is commit your offsets and state changes together. It’s not trivial, but it’s not that much harder than avoiding partial updates on failure if you’re updating multiple tables. This brings us back to two of the original points of contention: why not merely use the database for exactly-once in the first place and what about legacy systems?

That’s not to say exactly-once semantics, as offered in systems like SQS and Kafka, are not useful. I think we just need to be more conscientious of the other costs and encourage developers to more deeply understand the solution space—too much sprinkling on of Kafka or exactly-once or FIFO and not enough thinking about the actual business problem. Too much prescribing of solutions and not enough describing of problems.

My thanks to Kevin Sookocheff and Beau Lyddon for reviewing this post.

Are We There Yet: The Go Generics Debate

At GopherCon a couple weeks ago, Russ Cox gave a talk titled The Future of Go, in which he discussed what the Go community might want to change about the language—particularly for the so-called Go 2.0 milestone—and the process for realizing those changes. Part of that process is identifying real-world use cases through experience reports, which turn an abstract problem into a concrete one and help the core team to understand its significance. Also mentioned in the talk, of course, were generics. Over the weekend, Dave Cheney posted Should Go 2.0 support generics? Allow me to add to the noise.

First, I agree with Dave’s point in that the Go team has two choices: implement templated types and parameterized functions (or some equivalent thereof) or don’t and own that decision. Though I think his example of Haskell programmers owning their differences with imperative programming is a bit of a false equivalence. Haskell is firmly in the functional camp, while Go is firmly in the imperative. While there is overlap between the two, I think most programmers understand the difference. Few people are asking Haskell to be more PHP-like, but I bet a lot more are asking Elm to be more Haskell-like. Likewise, lots of people are asking Go to be more Java-like—if only a little. This is why so many are asking for generics, and Dave says it himself: “Mainstream programmers expect some form of templated types because they’re used to it in the other languages they interact with alongside Go.”

But I digress. The point is that the Go team should not pay any lip service to the generics discussion if they are not going to fully commit to addressing the problem. There is plenty of type theory and prior art. It’s not a technical problem, it’s a philosophical one.

That said, if the Go team decides to say “no” to generics and own that decision, I suspect they will never fully put an end to the discussion. The similarities to other mainstream languages are too close, unlike the PHP/Haskell example, and the case for generics too compelling, unlike the composition-over-inheritance decision. The lack of generics in Go has already become a meme at this point.

Ian Lance Taylor’s recent post shows there is a divide within the Go team regarding generics. He tells the story of how the copy() and append() functions came about, noting that they were added only because of the lack of generics.

The point I want to make here is that because we had no way to write a generic Vector type with an Append method, we wound up adding a special purpose language feature to implement it. A language that supported parameterized types with methods would not have required a special built-in function that only works with slices. An append operation makes sense for other sorts of data structures, such as various kinds of linked lists. The built-in append function can not be used for them.

My biggest complaint on the matter at this point in the language’s life is the doublethink. That is, the mere existence of channels, maps, and slices seems like a contradiction to the argument against generics. The experience reports supporting generics are trivial: every time someone instantiates one of these things. The argument is that having a small number of built-in generic types is substantially different than allowing them to be user-defined. Allowing generic APIs to be strewn about will increase the cognitive load on developers. Yet, at the same time, we’re okay with adding new APIs to the standard library that do not provide compile-time type safety by effectively subverting the type system through the use of interface{}. By using interface{}, we instead push the responsibility of type safety onto users to do it at runtime in precarious fashion for something the compiler could be well-equipped to handle. The argument here is what is the greater cognitive load? More generally, it’s where should the responsibility lie? Considering Go’s lineage and its aim at the “ordinary” programmer, I’d argue safety—in this case—should be the language’s responsibility.

However, we haven’t fully addressed the “complex generic APIs strewn about” argument. Generics are not the source of complexity in API design, poorly designed APIs are. For every bad generic API in Java, I’ll show you a good one. In Go, I would argue more complexity comes from the workarounds to the lack of generics—interface{}, reflection, code duplication, code generation, type assertions—than from introducing them into the language, not to mention the performance cost with some of these workarounds. The heap and list packages are some of my least favorite packages in the language, largely due to their use of interface{}.

Another frustration I have with the argument against generics is the anecdotal evidence—”I’ve never experienced a need for generics, so anyone who has must be wrong.” It’s a weird rationalization. It’s like a C programmer arguing against garbage collection to build a web app because free() works fine. Someone pointed out to me what’s actually going on is the Blub paradox, which Paul Graham coined.

Graham considers the hierarchy of programming languages with the example of “Blub”, a hypothetically average language “right in the middle of the abstractness continuum. It is not the most powerful language, but it is more powerful than Cobol or machine language.” It was used by Graham to illustrate a comparison, beyond Turing completeness, of programming language power, and more specifically to illustrate the difficulty of comparing a programming language one knows to one that one does not.

Graham considers a hypothetical Blub programmer. When the programmer looks down the “power continuum”, he considers the lower languages to be less powerful because they miss some feature that a Blub programmer is used to. But when he looks up, he fails to realise that he is looking up: he merely sees “weird languages” with unnecessary features and assumes they are equivalent in power, but with “other hairy stuff thrown in as well”. When Graham considers the point of view of a programmer using a language higher than Blub, he describes that programmer as looking down on Blub and noting its “missing” features from the point of view of the higher language.

Graham describes this as the “Blub paradox” and concludes that “By induction, the only programmers in a position to see all the differences in power between the various languages are those who understand the most powerful one.”

I sympathize with the Go team’s desire to keep the overall surface area of the language small and the complexity low, but I have a hard time reconciling this with the existing built-in generics and continued use of interface{} in the standard library. At the very least, I think Go would be better off with continued use of built-in generics for certain types instead of adding more APIs using interface{} like sync.Map, sync.Pool, and atomic.Value. Certainly, I think the debate is worth having though, but the hero-worship and genetic-fallacy type arguments do not further the discussion. My gut feeling is that Go will eventually have generics. It’s just a question of when and how.