Thrift on Steroids: A Tale of Scale and Abstraction

Apache Thrift is an RPC framework developed at Facebook for building “scalable cross-language services.” It consists of an interface definition language (IDL), communication protocol, API libraries, and a code generator that allows you to build and evolve services independently and in a polyglot fashion across a wide range of languages. This is nothing new and has been around for over a decade now.

There are a number of notable users of Thrift aside from Facebook, including Twitter (mainly by way of Finagle), Foursquare, Pinterest, Uber (via TChannel), and Evernote, among others—and for good reason, Thrift is mature and battle-tested.

The white paper explains the motivation behind Thrift in greater detail, though I think the following paragraph taken from the introduction does a pretty good job of summarizing it:

As Facebook’s traffic and network structure have scaled, the resource demands of many operations on the site (i.e. search, ad selection and delivery, event logging) have presented technical requirements drastically outside the scope of the LAMP framework. In our implementation of these services, various programming languages have been selected to optimize for the right combination of performance, ease and speed of development, availability of existing libraries, etc. By and large, Facebook’s engineering culture has tended towards choosing the best tools and implementations available over standardizing on any one programming language and begrudgingly accepting its inherent limitations.

Basically, as Facebook scaled, they moved more and more away from PHP and the LAMP stack and became increasingly polyglot. I think this same evolution is seen at most startups as they grow into themselves. We saw a similar transition in my time at Workiva, moving from our monolothic Python application on Google App Engine to a polyglot service-oriented architecture in AWS. It was an exciting but awkward time as we went through our adolescence as an engineering culture and teams started to find their identities. Teams learned what it meant to build backward-compatible APIs and loosely coupled services, how to deprecate APIs, how to build resilient and highly available systems, how to properly instrument services and diagnose issues, how to run and manage the underlying infrastructure, and—most importantly—how to collaborate with each other. There was lots of stumbling and mistakes along the way, lots of postmortems, lots of stress, but with that comes the learning and growing. The payoff is big but the process is painful. I don’t think it ever isn’t.

With one or two services written in the same language and relatively few developers, it was easy to just stick with “REST” (in quotes because it’s always a bastardized version of what REST ought to be), sling some JSON around, and call it a day. As the number of tech stacks and integration points increase, it becomes apparent that some standards are important. And once things are highly polyglot with lots of developers and lots of services running with lots of versions, strict service contracts become essential.

Uber has a blog post on building microservices that explains this and why they settled on Thrift to solve this problem.

Since the number of service calls grows rapidly, it is necessary to maintain a well-defined interface for every call. We knew we wanted to use an IDL for managing this interface, and we ultimately decided on Thrift. Thrift forces service owners to publish strict interface definitions, which streamlines the process of integrating with services. Calls that do not abide by the interface are rejected at the Thrift level instead of leaking into a service and failing deeper within the code. This strategy of publicly declaring your interface emphasizes the importance of backwards compatibility, since multiple versions of a service’s Thrift interface could be in use at any given time. The service author must not make breaking changes, and instead must only make non-breaking additions to the interface definition until all consumers are ready for deprecation.

Early on, I was tasked with building a unified messaging solution that would help with our integration challenges. The advantages of a unified solution should be obvious: reusability (before this, everyone was solving the problem in their own way), focus (allow developers to focus on their problem space, not the glue), acceleration (if the tools are already available, there’s less work to do), and shared pain points (it’s a lot easier to prioritize your work when everyone is complaining about the same thing). Also, a longer term benefit is developing the knowledge of this shared solution into an organizational competency which has a sort of “economy of scale” to it. Our job was not just to ship a messaging platform but evangelize it and help other teams to be successful with it. We did this through countless blog posts, training sessions, workshops, talks, and even a podcast.

Before we set out on building a common messaging solution, there were a few key principles we used to guide ourselves. We wanted to provide a core set of tools, libraries, and infrastructure for service integration. We wanted a solution that was rigid yet flexible. We provide only a minimal set of messaging patterns to act as generic building blocks with strict, strongly typed APIs, and promote design best practices and a service-oriented mindset. This meant supporting service evolution and API iteration through versioning and backward compatibility, allowing for resiliency patterns like timeouts, retries, circuit breakers, etc., and generally advocating asynchronous, loosely coupled communication. Lastly, we had to keep in mind that, at the end of the day, developers are just trying to ship stuff, so we had to balance these concerns out with ergonomics and developer experience so they could build, integrate, and ship quickly.

As much as I think RPC is a bad abstraction, it’s what developers want. If you don’t provide them with an RPC solution, they will build their own, so we had to provide first-class support for it. We evaluated solutions in the RPC space. We looked at GRPC extensively, which is the new RPC hotness from Google, but it had a few key drawbacks, namely its “newness” (it was still in early beta at the time and has since been almost entirely rewritten), it’s coupled to HTTP/2 as a transport (which at the time had fairly limited support), and it lacks support for JavaScript (let alone Dart, which is what most of our client applications were being written in). Avro was another we looked at.

Ultimately, we settled on Thrift due to its maturity and wide use in production, its performance, its architecture (it separates out the transports, protocols, and RPC layer with the first two being pluggable), its rich feature set, and its wide range of language support (checking off all the languages we standardized on as a company including Go, Java, Python, JavaScript, and Dart). Thrift is not without its problems though—more on this in a bit.

In addition to RPC, we wanted to promote a more asynchronous, message-passing style of communication with pub/sub. This would allow for greater flexibility in messaging patterns like fan-out and fan-in, interest-based messaging, and reduced coupling and fragility of services. This enables things like the worker pattern where we can distribute work to a pool of workers and scale that pool independently, whereas RPC tends to promote more stateful types of services. In my experience, developers tend to bias towards stateful services since this is how we’ve built things for a long time, but as we’ve entered the cloud-native era, things are running in containers which are autoscaled, more ephemeral, and more distributed. We have to grapple with the complexity imposed by distributed systems. This is why asynchronous messaging is important and why we wanted to support it from the onset.

We selected NATS as a messaging backplane because of its simplicity, performance, scalability, and adoption of the cloud-native mentality. When it comes to service integration, you need an always-on dial tone and NATS provides just that. Because of Thrift’s pluggable transport layer, we could build a NATS RPC transport while also providing HTTP and TCP transports.

Unfortunately, Thrift doesn’t provide any kind of support for pub/sub, and we wanted the same guarantees for it that we had with RPC, like type safety and versioning with code-generated APIs and service contracts. Aside from this, Thrift has a number of other, more glaring problems:

  • Head-of-line blocking: a single, slow request will block any subsequent requests for a client.
  • Out-of-order responses: an out-of-order response puts a Thrift transport in a bad state, requiring it to be torn down and reestablished, e.g. if a slow request times out at the client, the client issues a subsequent request, and a response comes back for the first request, the client blows up.
  • Concurrency: a Thrift client cannot be shared between multiple threads of execution, requiring each thread to have its own client issuing requests sequentially. This, combined with head-of-line blocking, is a major performance killer. This problem is compounded when each transport has its own resources, such as a socket.
  • RPC timeouts: Thrift does not provide good facilities for per-request timeouts, instead opting for a global transport read timeout.
  • Request headers: Thrift does not provide support for request metadata, making it difficult to implement things like authentication/authorization and distributed tracing. Instead, you are required to bake these things into your IDL or in a wrapped transport. The problem with this is it puts the onus on service providers rather than allowing an API gateway or middleware to perform these functions in a centralized way.
  • Middleware: Thrift does not have any support for client or server middleware. This means clients must be wrapped to implement interceptor logic and middleware code must be duplicated within handler functions. This makes it impossible to implement AOP-style logic in a clean, DRY way.

Twitter’s Finagle addresses many of these issues but is solely for the JVM, so we decided to address Thrift’s shortcomings in a cross-platform way without completely reinventing the wheel. That is, we took Thrift and extended it. What we ended up with was Frugal, a superset of Thrift recently open sourced that aims to solve the problems described above while also providing support for asynchronous pub/sub APIs—a sort of Thrift on steroids as I’ve come to call it. Its key features include:

  • Request multiplexing: client requests are fully multiplexed, allowing them to be issued concurrently while simultaneously avoiding the head-of-line blocking and out-of-order response problems. This also lays some groundwork for asynchronous messaging patterns.
  • Thread-safety: clients can be safely shared between multiple threads in which requests can be made in parallel.
  • Pub/sub: IDL and code-generation extensions for defining pub/sub APIs in a type-safe way.
  • Request context: a first-class request context object is added to every operation which allows defining request/response headers and per-request timeouts. By making the context part of the Frugal protocol, headers can be introspected or even injected by external middleware. This context could be used to send OAuth2 tokens and user-context information, avoiding the need to include it everywhere in your IDL and handler logic. Correlation IDs for distributed tracing purposes are also built into the request context.
  • Middleware: client- and server- side middleware is supported for RPC and pub/sub APIs. This allows you to implement interceptor logic around handler functions, e.g. for authentication, logging, or retry policies. One can easily integrate OpenTracing as a middleware, for example.
  • Cross-language: support for Go, Java, Dart, and Python (2.7 and 3.5).

Frugal adds a second kind of transport alongside Thrift’s RPC transport for pub/sub. With this, we provide a NATS transport for both pub/sub and RPC (internally, Workiva also has an at-least-once delivery pub/sub transport built around Amazon SQS for mission-critical data). In addition to this, we built a SDK which developers use to connect to the messaging infrastructure (such as NATS) with minimal ceremony. The messaging SDK played a vital role not just in making it easy for developers to adopt and integrate, but providing us a shim where we could introduce sweeping changes across the organization in one place, such as adding instrumentation, tracing, and authentication. This enabled us to roll critical integration components out to every service by making a change in one place.

To support pub/sub, we extended the Thrift IDL with an additional top-level construct called a scope, which is effectively a pub/sub namespace (basically what a service is to RPC). We wrote the IDL using a parsing expression grammar which allows us to generate a parser. We then implemented a code generator for the various language targets. The Frugal compiler is written in Go and is, at least in my opinion, much more maintainable than Thrift’s C++ codebase. However, the language libraries make use of the existing Thrift APIs, such as protocols, transports, etc. This means we didn’t need to implement any of the low-level mechanics like serialization.

I’ve since left Workiva (and am now actually working on NATS), but as far as I know, Frugal helps power nearly every production service at the company. It was an interesting experience from which I learned a lot. I was happy to see some of that work open sourced so others could use it and learn from it.

Of course, if I were starting over today, things would probably look different. GRPC is much more mature and the notion of a “service mesh” has taken the container world by storm with things like Istio, Linkerd, and Envoy. What we built was Workiva’s service mesh, we just didn’t have a name for it, so we called it a “Messaging SDK.” The corollary to this is you don’t need to adopt bleeding-edge tech to be successful. The concepts are what’s important, and if enough people are working on the same types of problems in parallel, they will likely converge on solutions that look very similar to each other given enough time and enough people working on them.

I think there’s a delicate balance between providing solutions that are “easy” from a developer point of view but may provide longer term drawbacks when it comes to building complex systems the “right” way. I see RPC as an example of this. It’s an “easy” abstraction but it hides a lot of complexity. Service meshes might even be in this category, but they have obvious upsides when it comes to building software in a way that is scalable. Peter Alvaro’s Strange Loop talk “I See What You Mean” does a great job of articulating this dilemma, which I’ve also written about myself. In the end, we decided to optimize for shipping, but we took a principled approach: provide the tools developers need (or want) but help educate them to utilize those tools in a way that allows them to ship products that are reliable and maintainable. Throwing tools or code over the wall is not enough.

Software Is About Storytelling

Software engineering is more a practice in archeology than it is in building. As an industry, we undervalue storytelling and focus too much on artifacts and tools and deliverables. How many times have you been left scratching your head while looking at a piece of code, system, or process? It’s the story, the legacy left behind by that artifact, that is just as important—if not more—than the artifact itself.

And I don’t mean what’s in the version control history—that’s often useless. I mean the real, human story behind something. Artifacts, whether that’s code or tools or something else entirely, are not just snapshots in time. They’re the result of a series of decisions, discussions, mistakes, corrections, problems, constraints, and so on.  They’re the product of the engineering process, but the problem is they usually don’t capture that process in its entirety. They rarely capture it at all. They commonly end up being nothing but a snapshot in time.

It’s often the sign of an inexperienced engineer when someone looks at something and says, “this is stupid” or “why are they using X instead of Y?” They’re ignoring the context, the fact that circumstances may have been different. There is a story that led up to that point, a reason for why things are the way they are. If you’re lucky, the people involved are still around. Unfortunately, this is not typically the case. And so it’s not necessarily the poor engineer’s fault for wondering these things. Their predecessors haven’t done enough to make that story discoverable and share that context.

I worked at a company that built a homegrown container PaaS on ECS. Doing that today would be insane with the plethora of container solutions available now. “Why aren’t you using Kubernetes?” Well, four years ago when we started, Kubernetes didn’t exist. Even Docker was just in its infancy. And it’s not exactly a flick of a switch to move multiple production environments to a new container runtime, not to mention the politicking with leadership to convince them it’s worth it to not ship any new code for the next quarter as we rearchitect our entire platform. Oh, and now the people behind the original solution are no longer with the company. Good luck! And this is on the timescale of about five years. That’s maybe like one generation of engineers at the company at most—nothing compared to the decades or more software usually lives (an interesting observation is that timescale, I think, is proportional to the size of an organization). Don’t underestimate momentum, but also don’t underestimate changing circumstances, even on a small time horizon.

The point is, stop looking at technology in a vacuum. There are many facets to consider. Likewise, decisions are not made in a vacuum. Part of this is just being an empathetic engineer. The corollary to this is you don’t need to adopt every bleeding-edge tech that comes out to be successful, but the bigger point is software is about storytelling. The question you should be asking is how does your organization tell those stories? Are you deliberate or is it left to tribal knowledge and hearsay? Is it something you truly value and prioritize or simply a byproduct?

Documentation is good, but the trouble with documentation is it’s usually haphazard and stagnant. It’s also usually documentation of how and not why. Documenting intent can go a long way, and understanding the why is a good way to develop empathy. Code survives us. There’s a fantastic talk by Bryan Cantrill on oral tradition in software engineering where he talks about this. People care about intent. Specifically, when you write software, people care what you think. As Bryan puts it, future generations of programmers want to understand your intent so they can abide by it, so we need to tell them what our intent was. We need to broadcast it. Good code comments are an example of this. They give you a narrative of not only what’s going on, but why. When we write software, we write it for future generations, and that’s the most underestimated thing in all of software. Documenting intent also allows you to document your values, and that allows the people who come after you to continue to uphold them.

Storytelling in software is important. Without it, software archeology is simply the study of puzzles created by time and neglect. When an organization doesn’t record its history, it’s bound to repeat the same mistakes. A company’s memory is comprised of its people, but the fact is people churn. Knowing how you got here often helps you with getting to where you want to be. Storytelling is how we transcend generational gaps and the inevitable changing of the old guard to the new guard in a maturing engineering organization. The same is true when we expand that to the entire industry. We’re too memoryless—shipping code and not looking back, discovering everything old that is new again, and simply not appreciating our lineage.

FIFO, Exactly-Once, and Other Costs

There’s been a lot of discussion about exactly-once semantics lately, sparked by the recent announcement of support for it in Kafka 0.11. I’ve already written at length about strong guarantees in messaging.

My former coworker Kevin Sookocheff recently made a post about ordered and exactly-once message delivery as it relates to Amazon SQS. It does a good job of illustrating what the trade-offs are, and I want to drive home some points.

In the article, Kevin shows how FIFO delivery is really only meaningful when you have one single-threaded publisher and one single-threaded receiver. Amazon’s FIFO queues allow you to control how restrictive this requirement is by applying ordering on a per-group basis. In other words, we can improve throughput if we can partition work into different ordered groups rather than a single totally ordered group. However, FIFO still effectively limits throughput on a group to a single publisher and single subscriber. If there are multiple publishers, they have to coordinate to ensure ordering is preserved with respect to our application’s semantics. On the subscriber side, things are simpler because SQS will only deliver messages in a group one at a time in order amongst subscribers.

Amazon’s FIFO queues also have an exactly-once processing feature which deduplicates messages within a five-minute window. Note, however, that there are some caveats with this, the obvious one being duplicate delivery outside of the five-minute window. A mission-critical system would have to be designed to account for this possibility. My argument here is if you still have to account for it, what’s the point unless the cost of detecting duplicates is prohibitively expensive? But to be fair, five minutes probably reduces the likelihood enough to the point that it’s useful and in those rare cases where it fails, the duplicate is acceptable.

The more interesting caveat is that FIFO queues do not guarantee exactly-once delivery to consumers (which, as we know, is impossible). Rather, they offer exactly-once processing by guaranteeing that once a message has successfully been acknowledged as processed, it won’t be delivered again. It’s up to applications to ack appropriately. When a message is delivered to a consumer, it remains in the queue until it’s acked. The visibility timeout prevents other consumers from processing it. With FIFO queues, this also means head-of-line blocking for other messages in the same group.

Now, let’s assume a subscriber receives a batch of messages from the queue, processes them—perhaps by storing some results to a database—and then sends an acknowledgement back to SQS which removes them from the queue. It’s entirely possible that during that process step a delay happens—a prolonged GC pause, crash, network delay, whatever. When this happens, the visibility timeout expires and the messages are redelivered and, potentially, reprocessed. What has to happen here is essentially cooperation between the queue and processing step. We might do this by using a database transaction to atomically process and acknowledge the messages. An alternative, yet similar, approach might be to use a write-ahead-log-like strategy whereby the consuming system reads messages from SQS and transactionally stores them in a database for future processing. Once the messages have been committed, the consumer deletes the messages from SQS. In either of these approaches, we’re basically shifting the onus of exactly-once processing onto an ACID-compliant relational database.

Note that this is really how Kafka achieves its exactly-once semantics. It requires end-to-end cooperation for exactly-once to work. State changes in your application need to be committed transactionally with your Kafka offsets.

As Kevin points out, FIFO SQS queues offer exactly-once processing only if 1) publishers never publish duplicate messages wider than five minutes apart and 2) consumers never fail to delete messages they have processed from the queue. Solving either of these problems probably requires some kind of coordination between the application and queue, likely in the form of a database transaction. And if you’re using a database either as the message source, sink, or both, what are exactly-once FIFO queues actually buying you? You’re paying a seemingly large cost in throughput for little perceived value. Your messages are already going through some sort of transactional boundary that provides ordering and uniqueness.

Where I see FIFO and exactly-once semantics being useful is when talking to systems which cannot cooperate with the end-to-end transaction. This might be a legacy service or a system with side effects, such as sending an email. Often in the case of these “distributed workflows”, latency is a lower priority and humans can be involved in various steps. Other use cases might be scheduled integrations with legacy batch processes where throughput is known a priori. These can simply be re-run when errors occur.

When people describe a messaging system with FIFO and exactly-once semantics, they’re usually providing a poor description of a relational database with support for ACID transactions. Providing these semantics in a messaging system likely still involves database transactions, it’s just more complicated. It turns out relational databases are really good at ensuring invariants like exactly-once.

I’ve picked on Kafka a bit in the past, especially with the exactly-once announcement, but my issue is not with Kafka itself. Kafka is a fantastic technology. It’s well-architected, battle-tested, and the team behind it is talented and knows the space well. My issue is more with some of the intangible costs associated with it. The same goes for similar systems (like exactly-once FIFO SQS queues). Beyond just the operational complexity (which Confluent is attempting to tackle with its Kafka-as-a-service), you have to get developer understanding. This is harder than it sounds in any modestly-sized organization. That’s not to say that developers are dumb or incapable of understanding, but the fact is your average developer is simply not thinking about all of the edge cases brought on by operating distributed systems at scale. They see “exactly-once FIFO queues” in SQS or “exactly-once delivery” in Kafka and take it at face value. They don’t read beyond the headline. They don’t look for the caveats. That’s why I took issue with how Kafka claimed to do the impossible with exactly-once delivery when it’s really exactly-once processing or, as I’ve come to call it, “atomic processing.” Henry Robinson put it best when talking about the Kafka announcement:

If I were to rewrite the article, I’d structure it thus: “exactly-once looks like atomic broadcast. Atomic broadcast is impossible. Here’s how exactly-once might fail, and here’s why we think you shouldn’t be worried about it.” That’s a harder argument for users to swallow…

Basically “exactly-once” markets better. It’s something developers can latch onto, but it’s also misleading. I know it’s only a matter of time before people start linking me to the Confluent post saying, “see, exactly-once is easy!” But this is just pain deferral. On the contrary, exactly-once semantics require careful construction of your application, assume a closed, transactional world, and do not support the case where I think people want exactly-once the most: side effects.

Interestingly, one of my chief concerns about Kafka’s implementation was what the difficulty of ensuring end-to-end cooperation would be in practice. Side effects into downstream systems with no support for idempotency or transactions could make it difficult. Jay’s counterpoint to this was that the majority of users are using good old-fashioned relational databases, so all you really need to do is commit your offsets and state changes together. It’s not trivial, but it’s not that much harder than avoiding partial updates on failure if you’re updating multiple tables. This brings us back to two of the original points of contention: why not merely use the database for exactly-once in the first place and what about legacy systems?

That’s not to say exactly-once semantics, as offered in systems like SQS and Kafka, are not useful. I think we just need to be more conscientious of the other costs and encourage developers to more deeply understand the solution space—too much sprinkling on of Kafka or exactly-once or FIFO and not enough thinking about the actual business problem. Too much prescribing of solutions and not enough describing of problems.

My thanks to Kevin Sookocheff and Beau Lyddon for reviewing this post.

Are We There Yet: The Go Generics Debate

At GopherCon a couple weeks ago, Russ Cox gave a talk titled The Future of Go, in which he discussed what the Go community might want to change about the language—particularly for the so-called Go 2.0 milestone—and the process for realizing those changes. Part of that process is identifying real-world use cases through experience reports, which turn an abstract problem into a concrete one and help the core team to understand its significance. Also mentioned in the talk, of course, were generics. Over the weekend, Dave Cheney posted Should Go 2.0 support generics? Allow me to add to the noise.

First, I agree with Dave’s point in that the Go team has two choices: implement templated types and parameterized functions (or some equivalent thereof) or don’t and own that decision. Though I think his example of Haskell programmers owning their differences with imperative programming is a bit of a false equivalence. Haskell is firmly in the functional camp, while Go is firmly in the imperative. While there is overlap between the two, I think most programmers understand the difference. Few people are asking Haskell to be more PHP-like, but I bet a lot more are asking Elm to be more Haskell-like. Likewise, lots of people are asking Go to be more Java-like—if only a little. This is why so many are asking for generics, and Dave says it himself: “Mainstream programmers expect some form of templated types because they’re used to it in the other languages they interact with alongside Go.”

But I digress. The point is that the Go team should not pay any lip service to the generics discussion if they are not going to fully commit to addressing the problem. There is plenty of type theory and prior art. It’s not a technical problem, it’s a philosophical one.

That said, if the Go team decides to say “no” to generics and own that decision, I suspect they will never fully put an end to the discussion. The similarities to other mainstream languages are too close, unlike the PHP/Haskell example, and the case for generics too compelling, unlike the composition-over-inheritance decision. The lack of generics in Go has already become a meme at this point.

Ian Lance Taylor’s recent post shows there is a divide within the Go team regarding generics. He tells the story of how the copy() and append() functions came about, noting that they were added only because of the lack of generics.

The point I want to make here is that because we had no way to write a generic Vector type with an Append method, we wound up adding a special purpose language feature to implement it. A language that supported parameterized types with methods would not have required a special built-in function that only works with slices. An append operation makes sense for other sorts of data structures, such as various kinds of linked lists. The built-in append function can not be used for them.

My biggest complaint on the matter at this point in the language’s life is the doublethink. That is, the mere existence of channels, maps, and slices seems like a contradiction to the argument against generics. The experience reports supporting generics are trivial: every time someone instantiates one of these things. The argument is that having a small number of built-in generic types is substantially different than allowing them to be user-defined. Allowing generic APIs to be strewn about will increase the cognitive load on developers. Yet, at the same time, we’re okay with adding new APIs to the standard library that do not provide compile-time type safety by effectively subverting the type system through the use of interface{}. By using interface{}, we instead push the responsibility of type safety onto users to do it at runtime in precarious fashion for something the compiler could be well-equipped to handle. The argument here is what is the greater cognitive load? More generally, it’s where should the responsibility lie? Considering Go’s lineage and its aim at the “ordinary” programmer, I’d argue safety—in this case—should be the language’s responsibility.

However, we haven’t fully addressed the “complex generic APIs strewn about” argument. Generics are not the source of complexity in API design, poorly designed APIs are. For every bad generic API in Java, I’ll show you a good one. In Go, I would argue more complexity comes from the workarounds to the lack of generics—interface{}, reflection, code duplication, code generation, type assertions—than from introducing them into the language, not to mention the performance cost with some of these workarounds. The heap and list packages are some of my least favorite packages in the language, largely due to their use of interface{}.

Another frustration I have with the argument against generics is the anecdotal evidence—”I’ve never experienced a need for generics, so anyone who has must be wrong.” It’s a weird rationalization. It’s like a C programmer arguing against garbage collection to build a web app because free() works fine. Someone pointed out to me what’s actually going on is the Blub paradox, which Paul Graham coined.

Graham considers the hierarchy of programming languages with the example of “Blub”, a hypothetically average language “right in the middle of the abstractness continuum. It is not the most powerful language, but it is more powerful than Cobol or machine language.” It was used by Graham to illustrate a comparison, beyond Turing completeness, of programming language power, and more specifically to illustrate the difficulty of comparing a programming language one knows to one that one does not.

Graham considers a hypothetical Blub programmer. When the programmer looks down the “power continuum”, he considers the lower languages to be less powerful because they miss some feature that a Blub programmer is used to. But when he looks up, he fails to realise that he is looking up: he merely sees “weird languages” with unnecessary features and assumes they are equivalent in power, but with “other hairy stuff thrown in as well”. When Graham considers the point of view of a programmer using a language higher than Blub, he describes that programmer as looking down on Blub and noting its “missing” features from the point of view of the higher language.

Graham describes this as the “Blub paradox” and concludes that “By induction, the only programmers in a position to see all the differences in power between the various languages are those who understand the most powerful one.”

I sympathize with the Go team’s desire to keep the overall surface area of the language small and the complexity low, but I have a hard time reconciling this with the existing built-in generics and continued use of interface{} in the standard library. At the very least, I think Go would be better off with continued use of built-in generics for certain types instead of adding more APIs using interface{} like sync.Map, sync.Pool, and atomic.Value. Certainly, I think the debate is worth having though, but the hero-worship and genetic-fallacy type arguments do not further the discussion. My gut feeling is that Go will eventually have generics. It’s just a question of when and how.

You Cannot Have Exactly-Once Delivery Redux

A couple years ago I wrote You Cannot Have Exactly-Once Delivery. It stirred up quite a bit of discussion and was even referenced in a book, which I found rather surprising considering I’m not exactly an academic. Recently, the topic of exactly-once delivery has again become a popular point of discussion, particularly with the release of Kafka 0.11, which introduces support for idempotent producers, transactional writes across multiple partitions, and—wait for it—exactly-once semantics.

Naturally, when this hit Hacker News, I received a lot of messages from people asking me, “what gives?” There’s literally a TechCrunch headline titled, Confluent achieves holy grail of “exactly once” delivery on Kafka messaging service (Jay assures me, they don’t write the headlines). The myth has been disproved!

First, let me say what Confluent has accomplished with Kafka is an impressive achievement and one worth celebrating. They made a monumental effort to implement these semantics, and it paid off. The intention of this post is not to minimize any of that work but to try to clarify a few key points and hopefully cut down on some of the misinformation and noise.

“Exactly-once delivery” is a poor term. The word “delivery” is overloaded. Frankly, I think it’s a marketing word. The better term is “exactly-once processing.” Some call the distinction pedantic, but I think it’s important and there is some nuance. Kafka did not solve the Two Generals Problem. Exactly-once delivery, at the transport level, is impossible. It doesn’t exist in any meaningful way and isn’t all that interesting to talk about. “We have a word for infinite packet delay—outage,” as Jay puts it. That’s why TCP exists, but TCP doesn’t care about your application semantics. And in the end, that’s what’s interesting—application semantics. My problem with “exactly-once delivery” is it assumes too much, which causes a lot of folks to make bad assumptions. “Delivery” is a transport semantic. “Processing” is an application semantic.

All is not lost, however. We can still get correct results by having our application cooperate with the processing pipeline. This is essentially what Kafka does, exactly-once processing, and Confluent makes note of that in the blog post towards the end. What does this mean?

To achieve exactly-once processing semantics, we must have a closed system with end-to-end support for modeling input, output, and processor state as a single, atomic operation. Kafka supports this by providing a new transaction API and idempotent producers. Any state changes in your application need to be made atomically in conjunction with Kafka. You must commit your state changes and offsets together. It requires architecting your application in a specific way. State changes in external systems must be part of the Kafka transaction. Confluent’s goal is to make this as easy as possible by providing the platform around Kafka with its streams and connector APIs. The point here is it’s not just a switch you flip and, magically, messages are delivered exactly once. It requires careful construction, application logic coordination, isolating state change and non-determinism, and maintaining a closed system around Kafka. Applications that use the consumer API still have to do this themselves. As Neha puts it in the post, it’s not “magical pixie dust.” This is the most important part of the post and, if it were up to me, would be at the very top.

Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.

Side effects into downstream systems with no support for idempotency or distributed transactions make this really difficult in practice I suspect. The argument is that most people are using relational databases that support transactions, but I think there’s still a reasonably large, non-obvious assumption here. Making your event processing atomic might not be easy in all cases. Moreover, every part in your system needs to participate to ensure end-to-end, exactly-once semantics.

Several other messaging systems like TIBCO EMS and Azure Service Bus have provided similar transactional processing guarantees. Kafka, as I understand it, attempts to make it easier and with less performance overhead. That’s a great accomplishment.

What’s really worth drawing attention to is the effort made by Confluent to deliver a correct solution. Achieving exactly-once processing, in and of itself, is relatively “easy” (I use that word loosely). What’s hard is dealing with the range of failures. The announcement shows they’ve done extensive testing, likely much more than most other systems, and have shown that it works and with minimal performance impact.

Kafka provides exactly-once processing semantics because it’s a closed system. There is still a lot of difficulty in ensuring those semantics are maintained across external services, but Confluent attempts to ameliorate this through APIs and tooling. But that’s just it: it’s not exactly-once semantics in a building block that’s the hard thing, it’s building loosely coupled systems that agree on the state of the world. Nevertheless, there is no holy grail here, just some good ole’ fashioned hard work.

Special thanks to Jay Kreps and Sean T. Allen for their feedback on an early draft of this post. Any inaccuracies or opinions are mine alone.

Smart Endpoints, Dumb Pipes

I read an interesting article recently called How do you cut a monolith in half? There are a lot of thoughts in the article that resonate with me and some that I disagree with, prompting this response.

The overall message of the article is don’t use a message broker to break apart a monolith because it’s like a cross between a load balancer and a database, with the disadvantages of both and the advantages of neither. The author argues that message brokers are a popular way to pull apart components over a network because they have low setup cost and provide easy service discovery, but they come at a high operational cost. My response to that is the same advice the author puts forward: it depends.

I think it’s important not to conflate “message broker” and “message queue.” The article uses them interchangeably, but it’s really talking about the latter, which I see as a subset of the former. Queues provide, well, queuing semantics. They try to ensure delivery of a message or, more generally, distribution of work. As the author puts it: “In practice, a message broker is a service that transforms network errors and machine failures into filled disks.” Replace “broker” with “queue” and I agree with this statement. This is really describing systems like RabbitMQ, Amazon SQS, TIBCO EMS, IronMQ, and maybe even Kafka fits into that category.

People are easily seduced by “fat” middleware—systems with more features, more capabilities, more responsibilities—because they think it makes their lives easier, and it might at first. Pushing off more responsibility onto the infrastructure makes the application simpler, but it also makes the infrastructure more complex, more fragile, and more slow. Take exactly-once message delivery, for example. Lots of people want it, but the pursuit of it introduces a host of complexity, overhead (in terms of development, operations, and performance), and risk. The end result is something that, in addition to these things, requires all downstream systems to not introduce duplicates and be mindful about their side effects. That is, everything in the processing pipeline must be exactly-once or nothing is. So typically what you end up with is an approximation of exactly-once delivery. You make big investments to lower the likelihood of duplicates, but you still have to deal with the problem. This might make sense if the cost of having duplicates is high, but that doesn’t seem like the common case. My advice is to always opt for the simple solution. We tend to think of engineering challenges as technical problems when, in reality, they’re often just mindset problems. Usually the technical problems have already been solved if we can just adjust our mindset.

There are a couple things to keep in mind here. The first thing to consider is simply capability lock-in. As you push more and more logic off onto more and more specialized middleware, you make it harder to move off it or change things. The second is what we already hinted at. Even with smart middleware, problems still leak out and you have to handle them at the edge—you’re now being taxed twice. This is essentially the end-to-end argument. Push responsibility to the edges, smart endpoints, dumb pipes, etc. It’s the idea that if you need business-level guarantees, build them into the business layer because the infrastructure doesn’t care about them.

The article suggests for short-lived tasks, use a load balancer because with a queue, you’ll end up building a load balancer along with an ad-hoc RPC system, with extra latency. For long-lived tasks, use a database because with a queue, you’ll be building a lock manager, a database, and a scheduler.

A persistent message queue is not bad in itself, but relying on it for recovery, and by extension, correct behaviour, is fraught with peril.

So why the distinction between message brokers and message queues? The point is not all message brokers need to be large, complicated pieces of infrastructure like most message queues tend to be. This was the reason I gravitated towards NATS while architecting Workiva’s messaging platform and why last month I joined Apcera to work on NATS full time.

When Derek Collison originally wrote NATS it was largely for the reasons stated in the article and for the reasonstalk about frequently. It was out of frustration with the current state of the art. In my opinion, NATS was the first system in the space that really turned the way we did messaging on its head (outside of maybe ZeroMQ). It didn’t provide any strong delivery guarantees, transactions, message persistence, or other responsibilities usually assumed by message brokers (there is a layer that provides some of these things, but it’s not baked into the core technology). Instead, NATS prioritized availability, simplicity, and performance over everything else. A simple technology in a vast sea of complexity (my marketing game is strong).

NATS is no-frills pub/sub. It solves the problem of service discovery and work assignment, assumes no other responsibilities, and gets out of your way. It’s designed to be easy to use, easy to operate, and add minimal latency even at scale so that, unlike many other brokers, it is a good way to integrate your microservices. What makes NATS interesting is what it doesn’t do and what it gains by not doing them. Simplicity is a feature—the ultimate sophistication, according to da Vinci. I call it looking at the negative space.

The article reads:

A protocol is the rules and expectations of participants in a system, and how they are beholden to each other. A protocol defines who takes responsibility for failure.

The problem with message brokers, and queues, is that no-one does.

NATS plays to the strengths of the end-to-end principle. It’s a dumb pipe. Handle failures and retries at the client and NATS will do everything it can to remain available and fast. Don’t rely on fragile guarantees or semantics. Instead, face complexity head-on. The author states what you really want is request/reply, which is one point I disagree on. RPC is a bad abstraction for building distributed systems. Use simple, versatile primitives and embrace asynchrony and messaging.

So yes, be careful about relying on message brokers. How smart should the pipes really be? More to the point, be careful about relying on strong semantics because experience shows few things are guaranteed when working with distributed systems at scale. Err to the side of simple. Make few assumptions of your middleware. Push work out of your infrastructure and to the edges if you care about performance and scalability because nothing is harder to scale (or operate) than slow infrastructure that tries to do too much.

Engineering Empathy

This was a talk I gave at an internal R&D conference my last week at Workiva. I got a lot of positive feedback on the talk, so I figured I would share it with a wider audience. Be warned: it’s long. Feel free to read each section separately, though they largely tie together.

Why do you work where you work? For many in tech, the answer is probably culture. When you tell a friend about your job, the culture is probably the first thing you describe. It’s culture that can be a company’s biggest asset—and its biggest downfall. But what is it?

Culture isn’t a list of values or a mission statement. It’s not a casual dress code or a beer fridge. Culture is what you reward and what you don’t. More importantly, it’s what you reward and what you punish. That’s an important distinction to make because when you don’t punish behavior that’s inconsistent with your culture, you send a message: you don’t care about it.

So culture is what you live day in and day out. It’s not what you say, it’s what you do. Put yourself in the shoes of a new hire at your company. A new hire doesn’t walk in automatically knowing your culture. They walk in—filled with anxiety—hoping for success, fearing failure, and they look around. They observe their environment. They see who is succeeding, and they try to emulate that behavior. They see who is failing, and they try to avoid that behavior. They ask the question: what makes someone successful here?

Jim Rohn was credited with saying we are the average of the five people we spend the most time with, and I think there’s a lot of truth to this idea. The people around you shape who you are. They shape your behavior, your habits, your thoughts, your opinions, your worldview. Culture is really a feedback loop, and we all contribute to it. It’s not mandated or dictated down (or up through grassroots programs), it’s emergent, but we have to be deliberate about our behaviors because that’s what shapes our culture (note that this doesn’t mean culture doesn’t start with leadership—it most certainly does).

In my opinion, a strong engineering culture is comprised of three parts: the right people, the right processes, and the right priorities. The right people means people that align with and protect your values. Processes are how you execute—how you communicate, develop, deliver, etc. And priorities are your values—the skills or behaviors you value in fellow employees. It’s also your vision. These help you to make decisions—what gets done today and what goes to the bottom of the list. For example, many companies claim a customer-first principle, but how many of them actually use it to drive their day-to-day decisions? This is the difference between a list of values and a culture.

What about technology? Experience? Customer relationships? These are all important competitive advantages, but I think they largely emerge from having the right people, processes, and priorities. The three are deeply intertwined. A culture is the unique combination of processes and values within an organization, and it’s those processes and values that enable you to replicate your success. I’m a big fan of Reed Hastings’ Netflix culture slide deck, but there are some things with which I fundamentally disagree. Hastings says, “The more talent density you have the less process you need. The more process you create the less talent you retain.” This is wrong on a number of levels, which I will talk about later.

Empathy is the common thread throughout each of these three areas—the people, the processes, and the priorities—and we’ll see how it applies to each.

The Ultimate Complex System: People

We largely think about software development as a purely technical feat, one which requires skill and creativity and ingenuity. I think for many, it’s why we became engineers in the first place. We like solving problems. But when all is said and done, computers do what you tell them to do. Computers don’t have opinions or biases or agendas or egos. The technical challenges are really a small part of any sufficiently complicated piece of software. Having been an individual contributor, tech lead, and manager, I’ve come to the realization that it’s people that is the ultimate complex system.

There’s a quote from The Five Dysfunctions of a Team which I’ve referenced before on this blog that I think captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” Teamwork is powerful is intuitive, but teamwork is rare is more profound. Teamwork is a competitive advantage because it’s rare—that’s a pretty strong statement. Let’s unpack it a bit.

It’s the difference between technical architecture and social architecture. We tend to focus on the former while neglecting the latter, but software engineering is more about collaboration than code. It wasn’t until I became a manager that I realized good managers are force multipliers, but social architecture is everyone’s responsibility. Remember, culture is a feedback loop which we all contribute to, so everyone is a social architect.

A key part of social architecture is communication empathy. Back in the 90’s, an evolutionary psychologist by the name of Robin Dunbar proposed the idea that humans can only maintain about 150 stable social relationships. The limit is referred to as Dunbar’s number. He drew a correlation between primate brain size and the average size of cohesive social groups. Informally, Dunbar describes this as “the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” This number includes all of the relationships in your life, both personal and professional, past and present.

What’s interesting about Dunbar’s number is how it applies to our jobs as software engineers and the interplay with cognitive biases we have as humans. When you don’t understand what someone else does, you’re automatically biased against them. In your head, you’re king. You understand exactly what you do, what your job is, the value that you add, but outside your head—outside your Dunbar’s number—those people are all a mystery. And humans have this funny tendency to mock what they don’t understand. We have lots of these cognitive biases.

Building those types of stable relationships is really hard when you can’t just walk over to someone’s desk and talk to them face-to-face. This was something we struggled a lot with at Workiva, being a company of a few hundred engineers spread out across 11 or so offices. Not only is split-brain an inevitability, but just a general lack of rapport. That stage is super hard to go through for a lot of companies—going from dozens of engineers to hundreds in a few short years. The cruel irony is no matter how agile a company claims to be, the culture—of which structure and processes are a part of—is usually the slowest to adjust. No longer is decision making done by standing up and telling a roomful of people, “I’m going to do this. Please tell me now if that’s a bad idea.” And no longer are decisions often made unanimously and rapidly.

Building stable relationships is much harder without the random hallway error correction. The right people aren’t always bumping into each other at the right time, whereas before a lot of decisions could be made just-in-time. Instead, communication has to be more deliberate and no longer organic. Decisions are made by the Jeff Bezos philosophy of disagree and commit. But nevertheless, building empathy is hard without face-to-face communication, and you miss out on a lot of the nuance of communication.

Again, communication is highly nuanced, and nuance is hard to convey over HipChat. The role of emotions plays a big part. Imagine the situation where you’re walking down a sidewalk or a hallway and someone accidentally walks in front of you. You might sidestep or do a little dance to get around each other, smile or nod, and get on with your day. This minor inconvenience becomes almost this tiny, pleasant interaction between two people. Now, take the same scenario but between two drivers, and you probably have some kind of road rage type situation. The only real difference is the steel cage surrounding the drivers blocking out the verbal and non-verbal communication. How many times has a HipChat conversation gone completely off the rails only to be resolved by a quick Google Hangout or in-person conversation? It’s the exact same thing.

One last note with respect to cognitive biases: once again, humans have a funny tendency where, in a vacuum of information, people will create their own. We will manufacture information just to fill the void, and often it’s not just creating information but taking information we’ve heard somewhere else and applying it to our own—and often the wrong—context. The extreme example of this is “We heard microservices worked well for Netflix, so we should use them at our growing startup” or “Google does monorepo, so we should too.” You’re ignoring all of the context—the path those companies followed to reach those points, the trade-offs made, the organizational differences and competencies, the pain points. When the cognitive biases and opinions we have as humans are added in, the problems amplify and compound. You get a frustrating game of telephone.

You can’t kill The Grapevine anymore than you can change human nature, so you have to address it head-on where and when you can. Don’t allow the vacuum of information to form in the first place, and be cognizant of when you’re applying information you’ve heard from someone else. Is the problem the same? Do you have the same amount of information? Are your constraints the same? What is the full context?

I tend to look at communication through two different lenses: push and pull. In order to be an empathetic communicator, I think it’s important to look at these while thinking about the cognitive biases discussed earlier, starting with push.

I’ve been in meetings where someone called out someone else as a “blocker”, and there was visible wincing in the room. I think for some, the word probably triggers some sort of PTSD. When you’re depending on something that’s outside of your direct control to get something else done, it’s hard not to drop the occasional B-word. It happens out of frustration. It happens because you’re just trying to ship. It also happens because you don’t want to look bad. And it might seem innocuous in the moment, but it has impact. By invoking it, you are tempting fate.

There’s a really good book that was written way back in 1944 called The Unwritten Laws of Engineering. It was published by the American Society of Mechanical Engineers and the language in the book is quite dated (parts of it read like a crotchety old guy yelling at kids to get off his lawn), but the ideas in the book still apply very much today and even to software engineering. It’s really about people, and fundamentally, people don’t change. One of the ideas in the book is this notion which I call communication impact. I’ve taken a quote from the book which I think highlights this idea (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

I see this a lot. Not just in emails (née “memos”) but in meetings or reviews, someone will—inadvertently or not—throw someone else under the proverbial bus, i.e. “broadcasting memorandum containing damaging or embarrassing statements”, “stepping too heavily upon someone’s toes”, or “revealing a serious shortcoming on somebody’s part.” The problem with this is, by doing it, you immediately put the other party on the defensive and also create a cognitive bias for everyone else in the room. You create a negative predisposition, which may or may not be warranted, toward them. Similarly, I liken “if it concerns manufacturing or customer difficulties” to production postmortems. This is why they need to be blameless. Why is it that retros on production issues are blameless while, at the same time, the development process is full of blame-assigning? It might seem innocuous, but your communication has impact. Push with respect and under the assumption the other person is probably doing the right thing. Don’t be willing to throw anyone under that bus. Likewise, be quick to take responsibility but slow to assign it. Don’t be willing to practice Cover Your Ass Engineering.

I’ve been in meetings where someone would get called out as a blocker, literally articulated in that way, and the person wasn’t even aware they were blocking anything. I’ve seen people create JIRA tickets on another team’s board and then immediately call them blockers. It’s important to call out dependencies ahead of time, and when someone is “blocking” your progress, speak to them about it individually and before it reaches a critical point. No one should be getting caught off guard by these things. Be careful about how and where you articulate these types of problems.

On the same topic of communication impact, I’ve seen engineers develop detailed and extravagant plans like “We’re going to move the entire company to a monorepo while simultaneously switching from Git to Mercurial” or “We’re going to build our own stream-processing framework from the ground up”, and then distribute them widely to the organization (“wide distribution” as referenced in The Unwritten Laws of Engineering passage above). The proposals are usually well-intentioned and maybe even compelling sometimes, but it’s the way in which they are communicated that is problematic. Recall The Grapevine: people see it, assume it’s reality, and then spread misinformation. “Did you know the company is switching to Mercurial?”

An effective way to build rapport between teams is genuinely celebrating the successes of other teams, even the small ones. I think for many organizations, it’s common to celebrate victories within a team—happy hour for shipping a new feature or a team outing for signing a major account—but celebrating another team’s win is more rare, especially when a company grows in size. The operative word is “genuine” though. Don’t just do it for the sake of doing it, be genuine about it. This is a compelling way to build the stable relationships needed to unlock the rarity of teamwork described earlier.

Equally important to understanding communication impact is understanding decision impact. I’ve already written about this, so I’ll keep it brief: your decisions impact others. How does adopting X affect Operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? Doing something might save you time, but does it create work or slow others down?

Teams operate in a way that minimizes the amount of pain they feel. It’s a natural instinct and a phenomenon I call pain displacement. Pain-driven development is following the path of least resistance. By doing this, we end up moving the pain somewhere else or deferring it until later (i.e. tech debt). Where the problems start to happen is when multiple teams or functions are involved. This is when the political and other organizational issues start to seep in. Patrick Lencioni, author of The Five Dysfunctions of a Team, has a book that touches on this subject called Silos, Politics, and Turf Wars. 

I believe the solution is multifaceted. First, teams need to think holistically, widening their vision beyond the deliverable immediately in front of them. They need to have a sense of organizational awareness. Second, teams—and especially leaders—need to be able to take off their job’s “hat” periodically in order to solve a shared problem. Lencioni observes that much of what causes organizational dysfunction is siloing, and this typically stems from strong intra-team loyalties. For example, within an engineering organization you might have development, operations, QA, security, and other functional teams. Empathy is being able to look at something through someone else’s perspective, and this requires removing your functional hat from time to time. Lastly, teams need to be able to rally around a common cause. This is a shared, compelling vision that motivates and mobilizes people and helps break down the silos. A shared vision aligns teams and enables them to work more autonomously. This is how decisions get made.

Pull communication is pretty much just how to ask questions without making people hate you, a skill that is very important to be an effective and empathetic communicator.

The single most common communication issue I see in engineering organizations is The XY Problem. It’s when someone focuses on a particular solution to their problem instead of describing the problem itself.

  • User wants to do X.
  • User doesn’t know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
  • User doesn’t know how to do Y either.
  • User asks for help with Y.
  • Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
  • After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn’t even a suitable solution for X.

The problem occurs when people get stuck on what they believe is the solution and are unable to step back and explain the issue in full. The solution to The XY Problem is simple: always provide the full context of what you’re trying to do. Describe the problem, don’t just prescribe the solution.

Part of being an effective communicator is being able to extract information from people and getting help without being a mental and emotional drain. This is especially true when it comes to debugging. I often see this “murder-mystery debugging” where someone basically tries to push off the blame for something that’s wrong with their code onto someone or something else. This flies in the face of the principle discussed earlier—be quick to take responsibility and slow to assign it. The first step when it comes to debugging anything is assume it’s your fault by default. When you run some code you’re writing and the compiler complains, you don’t blame the compiler, you assume you screwed up. It’s just taking this same mindset and applying it to everything else that we do.

And when you do need to seek help from others—just like with The XY Problem—provide as much context as possible. So much of what I see is this sort of information trickle, where the person seeking help drips information to the people trying to provide it. Don’t make it an interrogation. Lastly, provide a minimal working example that reproduces the problem. Don’t make people build a massive project with 20 dependencies just to reproduce your bug. It’s such a common problem for Stack Overflow that they actually have a name for it: MCVE—Minimal, Complete, Verifiable Example. Do your due diligence before taking time out of someone else’s day because the only thing worse than a bug report is a poorly described, hastily written accusation.

Another thing I see often is swoop-and-poop engineering. This is when someone comes to you with something they need help with—maybe a bug in a library you own, a feature request, something along these lines (this is especially true in open source). They have a sense of urgency; they say it either explicitly or just give off that vibe. You offer to setup a meeting to get more information or work through the problem with them only to find they aren’t available or willing to set aside some time with you. They’re heads down on “something more important,” yet their manager is ready to bite your head off weeks later, often without any documentation or warning. They’ve effectively dumped this on you, said the world’s on fire, and left as quickly as they came. You’re left confused and disoriented. You scratch your head and forget about it, then days or weeks later, they return, horrified that the world is still burning. I call these drive-by questions.

First, it’s important to have an appropriate sense of urgency. If you’re not willing to hop on a Hangout to work through a problem or provide additional information, it’s probably not that important, especially if you can’t even take the time to follow up. With few exceptions, it’s not fair to expect a team to drop everything they’re doing to help you at a moment’s notice, but if they do, you need to meet them halfway. It’s essential to realize that if you’re piling onto a team, others probably are too. If you submit a ticket with another team and then turn around and immediately call it a blocker, that just means you failed to plan accordingly. Having empathy is being cognizant that every team has its own set of priorities, commitments, and work that it’s juggling. By creating that ticket and calling it a blocker, you’re basically saying none of that stuff matters as much. Empathy is understanding that shit rolls downhill. For those who find themselves facing drive-by questions: document everything and be proactive about communicating.

There’s a really good essay by Eric Raymond called How To Ask Questions The Smart Way. It’s something I think every engineer should read and take to heart. My number one pet peeve is Help Vampires. These are people who refuse to take the time to ask coherent, specific questions and really aren’t interested in having their questions answered so much as getting someone else to do their work. They ask the same, tired questions over and over again without really retaining information or thinking critically. It’s question, answer, question, answer, question, answer, ad infinitum.

This is often a hard-earned lesson for junior engineers, but it’s an important one: when you ask a question, you’re not entitled to an answer, you earn the answer. Hasty sounding questions get hasty answers. As engineers, we should not operate like a tech support hotline that Grandma calls when her internet stops working. We need to put in a higher level of effort. We need to apply our technical and problem-solving aptitude as engineers. This is the only way you can scale this kind of support structure within an engineering organization. If teams are just constantly bombarding each other with low-effort questions, nothing will get done and people will get burnt out.

Avoid being a Help Vampire. Before asking a question, do your due diligence. Think carefully about where to ask your question. If it’s on HipChat, what is the appropriate room in which to ask? Also be mindful of doing things like @all or @here in a large room. Doing that is like walking into a crowded room, throwing your hands up in the air, and shouting at everyone to look at you. Be precise and informative about your problem, but also keep in mind that volume is not precision. Just dumping a bunch of log messages is noise. Don’t rush to claim that you’ve found a bug. As a first step, take responsibility. And just like with The XY Problem, describe the goal, not the step you took—describe vs. prescribe. Lastly, follow up on the solution. Everyone has been in this situation: you’ve found someone that asked the exact same question as you only to find they never followed up with how they fixed it. Even if it’s just in HipChat or Slack, drop a note indicating the issue was resolved and what the fix was so others can find it. This also helps close the loop when you’ve asked a question to a team and they are actively investigating it. Don’t leave them hanging.

In many ways, being an empathetic communicator just comes down to having self-awareness.

Codifying Values and Priorities: Processes

“Process” has a lot of negative connotations associated with it because it usually becomes this thing done on ceremony. But “process” should be a means of documenting and codifying your values. This is why I disagree with the Reed Hastings quote about process from earlier. Process is about repeatability and error correction. Camille Fournier’s new book on engineering management, The Manager’s Path, has a great section on “bootstrapping culture.” I particularly like the way she frames organizational structure and process:

When talking about structure and process with skeptics, I try to reframe the discussion. Instead of talking about structure, I talk about learning. Instead of talking about process, I talk about transparency. We don’t set up systems because structure and process have inherent value. We do it because we want to learn from our successes and our mistakes, and to share those successes and encode the lessons we learn from failures in a transparent way. This learning and sharing is how organizations become more stable and more scalable over time.

When a process “feels” wrong, it’s probably because it doesn’t reflect your organization’s values. For example, if a process feels heavy, it’s because you value velocity. If a process feels rigid, it’s because you value agility. If a process feels risky, it’s because you value safety. We have a hard time articulating this so instead it becomes “process is bad.”

Somewhere along the line, someone decides to document how stuff gets done. Things get standardized. Tools get made. Processes get established. But process becomes dogma when it’s interpreted as documentation of how rather than an explanation of why. Processes should tell the story of an organization: here’s what we value, here’s why we value it, and here’s how we protect and scale those values. The story is constantly evolving, so processes should be flexible. They shouldn’t be set in stone.

Michael Lopp’s book Managing Humans also provides a useful perspective on culture:

It’s entirely possible that too much process or the wrong process is developed during this build-out, but when this inevitable debate occurs, it should not be about the process. It’s a debate about values. The first question isn’t, “Is this a good, bad, or efficient process?” The first question is, “How does this process reflect our values?”

Processes should be traceable back to values. Each process should have a value or set of values associated with it. Understanding the why helps to develop empathy. It’s the difference between “here’s how we do things” and “here’s why we do things.” It’s much harder to develop a sense of empathy with just the how.

What We Value: Priorities

As engineers, we need to be curious. We need to have a “let’s go see!” attitude. When someone comes to you with a question—and hopefully it’s a well-formulated question based on the earlier discussion—your first reaction should be, “let’s go see!” Use it as an opportunity for both of you to learn. Even if you know the answer, sometimes it’s better to show, not tell, and as the person asking the question, you should be eager to learn. This is the reason I love Julia Evans’ blog so much. It’s oozing with wonder, curiosity, and intrigue. Being an engineer should mean having an innate curiosity. It’s not throwing up your hands at the first sign of an API boundary and saying, “not my problem!” It’s a willingness to roll up your sleeves and dig in to a problem but also a capacity for knowing how and when to involve others. Figure out what you don’t know and push beyond it.

Be humble. There’s a book that was written in the 70’s called The Psychology of Computer Programming, and it’s interesting because it focuses on the human elements of software development rather than the purely technical ones that we normally think about. In the book, it presents The 10 Commandments of Egoless Programming, which I think contain a powerful set of guiding principles for software engineers:

  1. Understand and accept that you will make mistakes.
  2. You are not your code.
  3. No matter how much “karate” you know, someone else will always know more.
  4. Don’t rewrite code without consultation.
  5. Treat people who know less than you with respect, deference, and patience.
  6. The only constant in the world is change. Be open to it and accept it with a smile.
  7. The only true authority stems from knowledge, not from position.
  8. Fight for what you believe, but gracefully accept defeat.
  9. Don’t be “the coder in the corner.”
  10. Critique code instead of people—be kind to the coder, not to the code.

Part of being a humble engineer is giving away all the credit. This is especially true for leaders or managers. A manager I once had put it this way: “As a manager, you should never say ‘I’ during a review unless shit went wrong and you’re in the process of taking responsibility for it.” A good leader gives away all the credit and takes all of the blame.

Be engaged. Coding is actually a very small part of our job as software engineers. Our job is to be engaged with the organization. Engage with stakeholder meetings and reviews. Engage with cross-trainings and workshops. Engage with your company’s engineering blog. Engage with other teams. Engage with recruiting and company outreach through conferences or meetups. People dramatically underestimate the value of developing their network, both to their employer and to themselves. You don’t have to do all of these things, but my point is engineers get overly fixated on coding and deliverables. Code is just the byproduct. We’re not paid to write code, we’re paid to add value to the business, and a big part of that is being engaged with the organization.

And of course I’d be remiss not to talk about empathy. Empathy is having a deep understanding of what problems someone is trying to solve. John Allspaw puts it best:

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Changing your perspective is a powerful way to deepen your relationships. Once again, it comes back to Dunbar’s number: we have a limited number of stable relationships, but developing and maintaining those relationships is the key to figuring out the rarity of teamwork.

A former coworker of mine passed away late last year. In going through some of his old files, we came across some notes he had on leadership. There was one quote in particular that I thought really captured the essence of this post nicely:

All music is made from the same 12 notes. All culture is made from the same five components: behaviors, relationships, attitudes, values, and environment. It’s the way those notes or components are put together that makes things sing.

This is what it takes to build a strong engineering culture and really just a healthy culture in general. The technology and everything else is secondary. It really starts with the people.

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.

Decision Impact

I think a critical part of building an empathetic engineering culture is understanding decision impact. This is a blindspot that I see happening a lot: a deliberate effort to understand the effects caused by a decision. How does adopting X affect operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? A particular decision might save you time, but does it create work or slow others down? Are we just displacing pain somewhere else?

What’s needed is a broad understanding of the net effects. Pain displacement is an indication that we’re not thinking beyond the path of least resistance. The problem is if we lack a certain empathy, we aren’t aware the pain displacement is occurring in the first place. It’s important we widen our vision beyond the deliverable in front of us. We have to think holistically—like a systems person—and think deeply about the interactions between decisions. Part of this is having an organizational awareness.

Tech debt is the one exception to this because it’s pain displacement we feel ourselves—it’s pain deferral. This is usually a decision we can make ourselves, but when we’re dealing with pain displacement involving multiple teams, that’s when problems start happening. And that’s where empathy becomes critical because software engineering is more about collaboration than code and shit has this natural tendency to roll downhill.

The first sentence of The Five Dysfunctions of a Team captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” The powerful part is obvious, but the bit about rarity is interesting when we think about teams holistically. The cause, I think, is deeply rooted in the silos or fiefdoms that naturally form around teams. The difficulty comes as an organization scales. What I see happening frequently are goals that diverge or conflict. The fix is rallying teams around a shared cause—a single, compelling vision. Likewise, it’s thinking holistically and having empathy. Understanding decision impact and pain displacement is one step to developing that empathy. This is what unlocks the rarity part of teamwork.