You Are Not Paid to Write Code

“Taco Bell Programming” is the idea that we can solve many of the problems we face as software engineers with clever reconfigurations of the same basic Unix tools. The name comes from the fact that every item on the menu at Taco Bell, a company which generates almost $2 billion in revenue annually, is simply a different configuration of roughly eight ingredients.

Many people grumble or reject the notion of using proven tools or techniques. It’s boring. It requires investing time to learn at the expense of shipping code.  It doesn’t do this one thing that we need it to do. It won’t work for us. For some reason—and I continue to be completely baffled by this—everyone sees their situation as a unique snowflake despite the fact that a million other people have probably done the same thing. It’s a weird form of tunnel vision, and I see it at every level in the organization. I catch myself doing it on occasion too. I think it’s just human nature.

I was able to come to terms with this once I internalized something a colleague once said: you are not paid to write code. You have never been paid to write code. In fact, code is a nasty byproduct of being a software engineer.

Every time you write code or introduce third-party services, you are introducing the possibility of failure into your system.

I think the idea of Taco Bell Programming can be generalized further and has broader implications based on what I see in industry. There are a lot of parallels to be drawn from The Systems Bible by John Gall, which provides valuable commentary on general systems theory. Gall’s Fundamental Theorem of Systems is that new systems mean new problems. I think the same can safely be said of code—more code, more problems. Do it without a new system if you can.

Systems are seductive and engineers in particular seem to have a predisposition for them. They promise to do a job faster, better, and more easily than you could do it by yourself or with a less specialized system. But when you introduce a new system, you introduce new variables, new failure points, and new problems.

But if you set up a system, you are likely to find your time and effort now being consumed in the care and feeding of the system itself. New problems are created by its very presence. Once set up, it won’t go away, it grows and encroaches. It begins to do strange and wonderful things. Breaks down in ways you never thought possible. It kicks back, gets in the way, and opposes its own proper function. Your own perspective becomes distorted by being in the system. You become anxious and push on it to make it work. Eventually you come to believe that the misbegotten product it so grudgingly delivers is what you really wanted all the time. At that point encroachment has become complete. You have become absorbed. You are now a systems person.

The last systems principle we look at is one I find particularly poignant: almost anything is easier to get into than out of. When we introduce new systems, new tools, new lines of code, we’re with them for the long haul. It’s like a baby that doesn’t grow up.

We’re not paid to write code, we’re paid to add value (or reduce cost) to the business. Yet I often see people measuring their worth in code, in systems, in tools—all of the output that’s easy to measure. I see it come at the expense of attending meetings. I see it at the expense of supporting other teams. I see it at the expense of cross-training and personal/professional development. It’s like full-bore coding has become the norm and we’ve given up everything else.

Another area I see this manifest is with the siloing of responsibilities. Product, Platform, Infrastructure, Operations, DevOps, QA—whatever the silos, it’s created a sort of responsibility lethargy. “I’m paid to write software, not tests” or “I’m paid to write features, not deploy and monitor them.” Things of that nature.

I think this is only addressed by stewarding a strong engineering culture and instilling the right values and expectations. For example, engineers should understand that they are not defined by their tools but rather the problems they solve and ultimately the value they add. But it’s important to spell out that this goes beyond things like commits, PRs, and other vanity metrics. We should embrace the principles of systems theory and Taco Bell Programming. New systems or more code should be the last resort, not the first step. Further, we should embody what it really means to be an engineer rather than measuring raw output. You are not paid to write code.

Infrastructure Engineering in the 21st Century

Infrastructure engineering is an inherently treacherous problem space because it’s core to so many things. Systems today are increasingly distributed and increasingly complex but are built on unreliable components and will continue to be. This includes unreliable networks and faulty hardware. The 21st century engineer understands failure is routine.

Naturally, application developers would rather not have to think about low-level failure modes so they can focus on solving the problem at hand. Infrastructure engineers are then tasked with competing goals: provide enough abstraction to make application development tractable and provide enough reliability to make subsystems useful. The second goal often comes with an additional proviso in that there must be sufficient reliability without sacrificing performance to the point of no longer being useful. Anyone who has worked on enterprise messaging systems can tell you that these goals are often contradictory. The result is a wall of sand intended to keep the developer’s feet dry from the incoming tide. The 21st century engineer understands that in order to play in the sand, we all need to be comfortable getting our feet a little wet from time to time.

With the deluge of technology becoming available today, it’s tempting to introduce it all into your stack. Likewise, engineers are never happy. Left unchecked, we will hyper optimize and iterate into oblivion. It’s a problem I call “innovating to a fault.” Relying on “it’s done when it’s done” is a great way to ship vaporware. Have tangible objectives, make them high-level, and realize things change and evolve over time. Frame the concrete things you’re doing today within the context of those objectives. There’s a difference between Agile micromanagey roadmaps and having a clear vision. Determine when to innovate and when not to. Not Invented Here syndrome can be a deadly disease. Take inventory of what’s being built, make sure it ties back to your objectives, and avoid falling prey to tech pop culture. Optimize for the right problems. The 21st century engineer understands that you are not defined by your tools, you are defined by what you produce at the end of the day.

The prevalence of microservice architecture has made production tooling and instrumentation more important than ever. Teams should take ownership of their systems. If you’re not willing to stand by your work, don’t ship it. However, just because something falls outside of your system’s boundaries doesn’t mean it’s not your problem. If you rely on it, own it. Don’t be afraid to roll up your sleeves and dive into someone else’s code. The 21st century engineer understands that they live and die by the code they have in production, and if they don’t run anything in production, they aren’t really an engineer at all.

The way in which we design systems today is different from the way we designed them in the 20th century and the way we will design them in the future. There is a vast amount of research that has gone into computer science and related fields dating back to the invention of the modern computer. Research from the 50’s, 60’s, all the way up to today shows that system design always is an evolving process. Compiling this body of knowledge together provides an invaluable foundation from which we can build. The 21st century engineer understands that without a deeper understanding of that foundation or with a blind trust, we are only as good as our sand castle.

It’s our responsibility as software engineers, as system designers, as programmers to use this knowledge. Our job is not to build systems or write code, our job is to solve problems, of which code is often a byproduct. No one cares about the code you write, they care about the problems you solve. More specifically, they care about the business problems you solve. The 21st century engineer understands that if we’re not thinking about our solutions end to end, we’re not really doing our job.

Engage to Assuage

Abstraction is important. It’s how humans deal with complexity. You shouldn’t have to understand every little intricate detail behind how your system works. It would take years to do so. But abstraction comes at a cost. You agree to the abstraction’s interface, you place your trust in it, and then you remove it from your mind. That is, until it fails—and abstractions of sufficient complexity will fail. After all, we are building atop unreliable components. Also, a layer of abstraction doesn’t provide any guarantees in higher levels above it, which often results in some false assumptions.

We cannot understand how everything will work, but we should have enough understanding of how it will not work. More plainly, we should understand the cost of the abstractions we use so that we can pay for them with confidence. This doesn’t mean giving up on abstraction but engaging with the complexity that it manages.

I’ve written before about how distributed systems are a UX problem. They’re also a design problem. And a development problem. And an ops problem. And a business problem. The point is they are everyone’s problem because they are complex, and things that are sufficiently complex eventually leak. There is no airtight abstraction in this world. Without understanding limitations and trade-offs, without using the knowledge and research that has come before us, without thinking end to end, we set ourselves up for failure. If we’re going to call ourselves engineers, let’s start acting like it. Nothing is a black box to the 21st century engineer.

Not Invented Here

Engineers love engineering things. The reason is self-evident (and maybe self-fulfilling—why else would you be an engineer?). We like to think we’re pretty good at solving problems. Unfortunately, this mindset can, on occasion, yield undesirable consequences which might not be immediately apparent but all the while damaging.

Developers are all in tune with the idea of “don’t reinvent the wheel,” but it seems to be eschewed sometimes, deliberately or otherwise. People don’t generally write their own merge sort, so why would they write their own consensus protocol? Anecdotally speaking, they do.

Not-Invented-Here Syndrome is a very real thing. In many cases, consciously or not, it’s a cultural problem. In others, it’s an engineering one. Camille Fournier’s blog post on ZooKeeper helps to illustrate this point and provide some context. In it, she describes why some distributed systems choose to rely on external services, such as ZooKeeper, for distributed coordination, while others build in their own coordination logic.

We draw a parallel between distributed systems and traditional RDBMSs, which typically implement their own file system and other low-level facilities. Why? Because it’s their competitive advantage. SQL databases sell because they offer finely tuned performance, and in order to do that, they need to control these things that the OS otherwise provides. Distributed databases like Riak sell because they own the coordination logic, which helps promote their competitive advantage. This follows what Joel Spolsky says about NIH Syndrome in that “if it’s a core business function—do it yourself, no matter what.”

If you’re developing a computer game where the plot is your competitive advantage, it’s OK to use a third party 3D library. But if cool 3D effects are going to be your distinguishing feature, you had better roll your own.

This makes a lot of sense. My sorting algorithm is unlikely to provide me with a competitive edge, but something else might, even if it’s not particularly novel.

So in some situations, homegrown is justifiable, but that’s not always the case. Redis’ competitive advantage is its predictably low latencies and data structures. Does it make sense for it to implement its own clustering and leader election protocols? Maybe, but this is where NIH can bite you. If what you’re doing is important and there’s precedent, lean on existing research and solutions. Most would argue write safety is important, and there is certainly precedent for leader election. Why not leverage that work? Things like Raft, Paxos, and Zab provide solutions which are proven using formal methods and are peer reviewed. That doesn’t mean new solutions can’t be developed, but they generally require model checking and further scrutiny to ensure correctness. Otherwise, you’ll inevitably run into problems. Implementing our own solutions can provide valuable insight, but leave them at home if they’re not rigorously approached. Rolling your own and calling it “good enough” is dishonest to your users if it’s not properly communicated.

Elasticsearch is another interesting case to look at. You might say Elasticsearch’s competitive advantage is its full-text search engine, but it’s not. Like Solr, it’s built on Lucene. Elasticsearch was designed from the ground-up to be distributed. This is what gives it a leg up over Solr and other similar search servers where horizontal scaling and fault tolerance were essentially tacked on. In a way, this resembles what happened with Redis, where failover and clustering were introduced as an afterthought. However, unlike Redis, which chose to implement its own failover coordination and cluster-membership protocol, Solr opted to use ZooKeeper as an external coordinator.

We see that Elasticsearch’s core advantage is its distributed nature. Following that notion, it makes sense for it to own that coordination, which is why its designers chose to implement their own internal cluster membership, ZenDisco. But it turns out writing cluster-membership protocols is really fucking hard, and unless you’ve written proofs for it, you probably shouldn’t do it at all. The analogy here would be writing your own encryption algorithm—there’s tons of institutional knowledge which has laid the groundwork for solutions which are well-researched and well-understood. That knowledge should be embraced in situations like this.

I don’t mean to pick on Redis and Elasticsearch. They’re both excellent systems, but they serve as good examples for this discussion. The problem is that users of these systems tend to overlook the issues exposed by this mentality. Frankly, few people would know problems exist unless they are clearly documented by vendors (and not sales people) and even then, how many people actually read the docs cover-to-cover? It’s essential we know a system’s shortcomings and edge cases so we can recognize which situations to apply it and, more important, which we should not.

You don’t have to rely on an existing third-party library or service. Believe it or not, this isn’t a sales pitch for ZooKeeper. If it’s a core business function, it probably makes sense to build it yourself as Joel describes. What doesn’t make sense, however, is to build out whatever that is without being cognizant of conventional wisdom. I’m amazed at how often people are willing to throw away institutional knowledge, either because they don’t seek it out or they think they can do better (without formal verification). If I have seen further, it is by standing on the shoulders of giants.