A Look at Nanomsg and Scalability Protocols (Why ZeroMQ Shouldn’t Be Your First Choice)

Earlier this month, I explored ZeroMQ and how it proves to be a promising solution for building fast, high-throughput, and scalable distributed systems. Despite lending itself quite well to these types of problems, ZeroMQ is not without its flaws. Its creators have attempted to rectify many of these shortcomings through spiritual successors Crossroads I/O and nanomsg.

The now-defunct Crossroads I/O is a proper fork of ZeroMQ with the true intention being to build a viable commercial ecosystem around it. Nanomsg, however, is a reimagining of ZeroMQ—a complete rewrite in C1. It builds upon ZeroMQ’s rock-solid performance characteristics while providing several vital improvements, both internal and external. It also attempts to address many of the strange behaviors that ZeroMQ can often exhibit. Today, I’ll take a look at what differentiates nanomsg from its predecessor and implement a use case for it in the form of service discovery.

Nanomsg vs. ZeroMQ

A common gripe people have with ZeroMQ is that it doesn’t provide an API for new transport protocols, which essentially limits you to TCP, PGM, IPC, and ITC. Nanomsg addresses this problem by providing a pluggable interface for transports and messaging protocols. This means support for new transports (e.g. WebSockets) and new messaging patterns beyond the standard set of PUB/SUB, REQ/REP, etc.

Nanomsg is also fully POSIX-compliant, giving it a cleaner API and better compatibility. No longer are sockets represented as void pointers and tied to a context—simply initialize a new socket and begin using it in one step. With ZeroMQ, the context internally acts as a storage mechanism for global state and, to the user, as a pool of I/O threads. This concept has been completely removed from nanomsg.

In addition to POSIX compliance, nanomsg is hoping to be interoperable at the API and protocol levels, which would allow it to be a drop-in replacement for, or otherwise interoperate with, ZeroMQ and other libraries which implement ZMTP/1.0 and ZMTP/2.0. It has yet to reach full parity, however.

ZeroMQ has a fundamental flaw in its architecture. Its sockets are not thread-safe. In and of itself, this is not problematic and, in fact, is beneficial in some cases. By isolating each object in its own thread, the need for semaphores and mutexes is removed. Threads don’t touch each other and, instead, concurrency is achieved with message passing. This pattern works well for objects managed by worker threads but breaks down when objects are managed in user threads. If the thread is executing another task, the object is blocked. Nanomsg does away with the one-to-one relationship between objects and threads. Rather than relying on message passing, interactions are modeled as sets of state machines. Consequently, nanomsg sockets are thread-safe.

Nanomsg has a number of other internal optimizations aimed at improving memory and CPU efficiency. ZeroMQ uses a simple trie structure to store and match PUB/SUB subscriptions, which performs nicely for sub-10,000 subscriptions but quickly becomes unreasonable for anything beyond that number. Nanomsg uses a space-optimized trie called a radix tree to store subscriptions. Unlike its predecessor, the library also offers a true zero-copy API which greatly improves performance by allowing memory to be copied from machine to machine while completely bypassing the CPU.

ZeroMQ implements load balancing using a round-robin algorithm. While it provides equal distribution of work, it has its limitations. Suppose you have two datacenters, one in New York and one in London, and each site hosts instances of “foo” services. Ideally, a request made for foo from New York shouldn’t get routed to the London datacenter and vice versa. With ZeroMQ’s round-robin balancing, this is entirely possible unfortunately. One of the new user-facing features that nanomsg offers is priority routing for outbound traffic. We avoid this latency problem by assigning priority one to foo services hosted in New York for applications also hosted there. Priority two is then assigned to foo services hosted in London, giving us a failover in the event that foos in New York are unavailable.

Additionally, nanomsg offers a command-line tool for interfacing with the system called nanocat. This tool lets you send and receive data via nanomsg sockets, which is useful for debugging and health checks.

Scalability Protocols

Perhaps most interesting is nanomsg’s philosophical departure from ZeroMQ. Instead of acting as a generic networking library, nanomsg intends to provide the “Lego bricks” for building scalable and performant distributed systems by implementing what it refers to as “scalability protocols.” These scalability protocols are communication patterns which are an abstraction on top of the network stack’s transport layer. The protocols are fully separated from each other such that each can embody a well-defined distributed algorithm. The intention, as stated by nanomsg’s author Martin Sustrik, is to have the protocol specifications standardized through the IETF.

Nanomsg currently defines six different scalability protocols: PAIR, REQREP, PIPELINE, BUS, PUBSUB, and SURVEY.

PAIR (Bidirectional Communication)

PAIR implements simple one-to-one, bidirectional communication between two endpoints. Two nodes can send messages back and forth to each other.

REQREP (Client Requests, Server Replies)

The REQREP protocol defines a pattern for building stateless services to process user requests. A client sends a request, the server receives the request, does some processing, and returns a response.

PIPELINE (One-Way Dataflow)

PIPELINE provides unidirectional dataflow which is useful for creating load-balanced processing pipelines. A producer node submits work that is distributed among consumer nodes.

BUS (Many-to-Many Communication)

BUS allows messages sent from each peer to be delivered to every other peer in the group.

PUBSUB (Topic Broadcasting)

PUBSUB allows publishers to multicast messages to zero or more subscribers. Subscribers, which can connect to multiple publishers, can subscribe to specific topics, allowing them to receive only messages that are relevant to them.

SURVEY (Ask Group a Question)

The last scalability protocol, and the one in which I will further examine by implementing a use case with, is SURVEY. The SURVEY pattern is similar to PUBSUB in that a message from one node is broadcasted to the entire group, but where it differs is that each node in the group responds to the message. This opens up a wide variety of applications because it allows you to quickly and easily query the state of a large number of systems in one go. The survey respondents must respond within a time window configured by the surveyor.

Implementing Service Discovery

As I pointed out, the SURVEY protocol has a lot of interesting applications. For example:

  • What data do you have for this record?
  • What price will you offer for this item?
  • Who can handle this request?

To continue exploring it, I will implement a basic service-discovery pattern. Service discovery is a pretty simple question that’s well-suited for SURVEY: what services are out there? Our solution will work by periodically submitting the question. As services spin up, they will connect with our service discovery system so they can identify themselves. We can tweak parameters like how often we survey the group to ensure we have an accurate list of services and how long services have to respond.

This is great because 1) the discovery system doesn’t need to be aware of what services there are—it just blindly submits the survey—and 2) when a service spins up, it will be discovered and if it dies, it will be “undiscovered.”

Here is the ServiceDiscovery class:

The discover method submits the survey and then collects the responses. Notice we construct a SURVEYOR socket and set the SURVEYOR_DEADLINE option on it. This deadline is the number of milliseconds from when a survey is submitted to when a response must be received—adjust it accordingly based on your network topology. Once the survey deadline has been reached, a NanoMsgAPIError is raised and we break the loop. The resolve method will take the name of a service and randomly select an available provider from our discovered services.

We can then wrap ServiceDiscovery with a daemon that will periodically run discover.

The discovery parameters are configured through environment variables which I inject into a Docker container.

Services must connect to the discovery system when they start up. When they receive a survey, they should respond by identifying what service they provide and where the service is located. One such service might look like the following:

Once again, we configure parameters through environment variables set on a container. Note that we connect to the discovery system with a RESPONDENT socket which then responds to service queries with the service name and address. The service itself uses a REP socket that simply responds to any requests with “The answer is 42,” but it could take any number of forms such as HTTP, raw socket, etc.

The full code for this example, including Dockerfiles, can be found on GitHub.

Nanomsg or ZeroMQ?

Based on all the improvements that nanomsg makes on top of ZeroMQ, you might be wondering why you would use the latter at all. Nanomsg is still relatively young. Although it has numerous language bindings, it hasn’t reached the maturity of ZeroMQ which has a thriving development community. ZeroMQ has extensive documentation and other resources to help developers make use of the library, while nanomsg has very little. Doing a quick Google search will give you an idea of the difference (about 500,000 results for ZeroMQ to nanomsg’s 13,500).

That said, nanomsg’s improvements and, in particular, its scalability protocols make it very appealing. A lot of the strange behaviors that ZeroMQ exposes have been resolved completely or at least mitigated. It’s actively being developed and is quickly gaining more and more traction. Technically, nanomsg has been in beta since March, but it’s starting to look production-ready if it’s not there already.

  1. The author explains why he should have originally written ZeroMQ in C instead of C++. []

Distributed Messaging with ZeroMQ

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” -Leslie Lamport

With the increased prevalence and accessibility of cloud computing, distributed systems architecture has largely supplanted more monolithic constructs. The implication of using a service-oriented architecture, of course, is that you now have to deal with a myriad of difficulties that previously never existed, such as fault tolerance, availability, and horizontal scaling. Another interesting layer of complexity is providing consistency across nodes, which itself is a problem surrounded with endless research. Algorithms like Paxos and Raft attempt to provide solutions for managing replicated data, while other solutions offer eventual consistency.

Building scalable, distributed systems is not a trivial feat, but it pales in comparison to building real-time systems of a similar nature. Distributed architecture is a well-understood problem and the fact is, most applications have a high tolerance for latency. Few systems have a demonstrable need for real-time communication, but the few that do present an interesting challenge for developers. In this article, I explore the use of ZeroMQ to approach the problem of distributed, real-time messaging in a scalable manner while also considering the notion of eventual consistency.

The Intelligent Transport Layer

ZeroMQ is a high-performance asynchronous messaging library written in C++. It’s not a dedicated message broker but rather an embeddable concurrency framework with support for direct and fan-out endpoint connections over a variety of transports. ZeroMQ implements a number of different communication patterns like request-reply, pub-sub, and push-pull through TCP, PGM (multicast), in-process, and inter-process channels. The glaring lack of UDP support is, more or less, by design because ZeroMQ was conceived to provide guaranteed-ish delivery of atomic messages. The library makes no actual guarantee of delivery, but it does make a best effort. What ZeroMQ does guarantee, however, is that you will never receive a partial message, and messages will be received in order. This is important because UDP’s performance gains really only manifest themselves in lossy or congested environments.

The comprehensive list of messaging patterns and transports alone make ZeroMQ an appealing choice for building distributed applications, but it particularly excels due to its reliability, scalability and high throughput. ZeroMQ and related technologies are popular within high-frequency trading, where packet loss of financial data is often unacceptable1. In 2011, CERN actually performed a study comparing CORBA, Ice, Thrift, ZeroMQ, and several other protocols for use in its particle accelerators and ranked ZeroMQ the highest.


ZeroMQ uses some tricks that allow it to actually outperform TCP sockets in terms of throughput such as intelligent message batching, minimizing network-stack traversals, and disabling Nagle’s algorithm. By default (and when possible), messages are queued on the subscriber, which attempts to avoid the problem of slow subscribers. However, when this isn’t sufficient, ZeroMQ employs a pattern called the “Suicidal Snail.” When a subscriber is running slow and is unable to keep up with incoming messages, ZeroMQ convinces the subscriber to kill itself. “Slow” is determined by a configurable high-water mark. The idea here is that it’s better to fail fast and allow the issue to be resolved quickly than to potentially allow stale data to flow downstream. Again, think about the high-frequency trading use case.

A Distributed, Scalable, and Fast Messaging Architecture

ZeroMQ makes a convincing case for use as a transport layer. Let’s explore a little deeper to see how it could be used to build a messaging framework for use in a real-time system. ZeroMQ is fairly intuitive to use and offers a plethora of bindings for various languages, so we’ll focus more on the architecture and messaging paradigms than the actual code.

About a year ago, while I first started investigating ZeroMQ, I built a framework to perform real-time messaging and document syncing called Zinc. A “document,” in this sense, is any well-structured and mutable piece of data—think text document, spreadsheet, canvas, etc. While purely academic, the goal was to provide developers with a framework for building rich, collaborative experiences in a distributed manner.

The framework actually had two implementations, one backed by the native ZeroMQ, and one backed by the pure Java implementation, JeroMQ2. It was really designed to allow any transport layer to be used though.

Zinc is structured around just a few core concepts: Endpoints, ChannelListeners, MessageHandlers, and Messages. An Endpoint represents a single node in an application cluster and provides functionality for sending and receiving messages to and from other Endpoints. It has outbound and inbound channels for transmitting messages to peers and receiving them, respectively.


ChannelListeners essentially act as daemons listening for incoming messages when the inbound channel is open on an Endpoint. When a message is received, it’s passed to a thread pool to be processed by a MessageHandler. Therefore, Messages are processed asynchronously in the order they are received, and as mentioned earlier, ZeroMQ guarantees in-order message delivery. As an aside, this is before I began learning Go, which would make for an ideal replacement for Java here as it’s quite well-suited to the problem :)

Messages are simply the data being exchanged between Endpoints, from which we can build upon with Documents and DocumentFragments. A Document is the structured data defined by an application, while DocumentFragment represents a partial Document, or delta, which can be as fine- or coarse- grained as needed.

Zinc is built around the publish-subscribe and push-pull messaging patterns. One Endpoint will act as the host of a cluster, while the others act as clients. With this architecture, the host acts as a publisher and the clients as subscribers. Thus, when a host fires off a Message, it’s delivered to every subscribing client in a multicast-like fashion. Conversely, clients also act as “push” Endpoints with the host being a “pull” Endpoint. Clients can then push Messages into the host’s Message queue from which the host is pulling from in a first-in-first-out manner.

This architecture allows Messages to be propagated across the entire cluster—a client makes a change which is sent to the host, who propagates this delta to all clients. This means that the client who initiated the change will receive an “echo” delta, but it will be discarded by checking the Message origin, a UUID which uniquely identifies an Endpoint. Clients are then responsible for preserving data consistency if necessary, perhaps through operational transformation or by maintaining a single source of truth from which clients can reconcile.


One of the advantages of this architecture is that it scales reasonably well due to its composability. Specifically, we can construct our cluster as a tree of clients with arbitrary breadth and depth. Obviously, the more we scale horizontally or vertically, the more latency we introduce between edge nodes. Coupled with eventual consistency, this can cause problems for some applications but might be acceptable to others.


The downside is this inherently introduces a single point of failure characterized by the client-server model. One solution might be to promote another node when the host fails and balance the tree.

Once again, this framework was mostly academic and acted as a way for me to test-drive ZeroMQ, although there are some other interesting applications of it. Since the framework supports multicast message delivery via push-pull or publish-subscribe mechanisms, one such use case is autonomous load balancing.

Paired with something like ZooKeeper, etcd, or some other service-discovery protocol, clients would be capable of discovering hosts, who act as load balancers. Once a client has discovered a host, it can request to become a part of that host’s cluster. If the host accepts the request, the client can begin to send messages to the host (and, as a result, to the rest of the cluster) and, likewise, receive messages from the host (and the rest of the cluster). This enables clients and hosts to submit work to the cluster such that it’s processed in an evenly distributed way, and workers can determine whether to pass work on further down the tree or process it themselves. Clients can choose to participate in load-balancing clusters at their own will and when they become available, making them mostly autonomous. Clients could then be quickly spun-up and spun-down using, for example, Docker containers.

ZeroMQ is great for achieving reliable, fast, and scalable distributed messaging, but it’s equally useful for performing parallel computation on a single machine or several locally networked ones by facilitating in- and inter- process communication using the same patterns. It also scales in the sense that it can effortlessly leverage multiple cores on each machine. ZeroMQ is not a replacement for a message broker, but it can work in unison with traditional message-oriented middleware. Combined with Protocol Buffers and other serialization methods, ZeroMQ makes it easy to build extremely high-throughput messaging frameworks.

  1. ZeroMQ’s founder, iMatix, was responsible for moving JPMorgan Chase and the Dow Jones Industrial Average trading platforms to OpenAMQ []
  2. In systems where near real-time is sufficient, JeroMQ is adequate and benefits by not requiring any native linking. []

No More Ninjas

How does a software company attract talent? Compensation? That’s how they attract people. Free lunches and foosball tables? Keep guessing.

The most effective way for a company to bag top-tier engineers is simple: let developers be developers.

Engineers Drive Innovation

Software is a symbiotic creature, and there are two ways a product is built: from the top down and the bottom up. Top-down development means that requirements are curated by and flow from the business, product owners, and managers, downward to the engineers. With the latter, developers explore new ideas and use technology that may be outside the organization’s standard gamut. Ideas for new products or features are born and pushed up to product owners where they can be cultivated.

Both of these methods must strike a balance for a company and its products to be successful. Without a focused vision, a product will fail. Without embracing new ideas and technology, a company will become irrelevant.

Open Source is King

How an organization perceives OSS is paramount to developers. Companies that contribute and maintain open source projects tend to attract talented software engineers. Just look at some of the top repositories on GitHub. Unfortunately, there can sometimes be an eternal struggle between engineers and lawyers.

It’s equally important that a company supports the use of OSS in its own products. I’ve worked on projects where the use of open source code was discouraged in favor of homegrown solutions. This is dangerously senseless considering most mature, open source projects are battle tested and have large community support.

Developers want to work at companies that embrace open source, which is a push-pull dynamic.

Let Me Work, Dammit

I do all of my development on a MacBook Pro running OSX. Some of my coworkers use similar hardware configurations with various flavors of Linux. Back when I worked with .NET, I used Windows. The CLI is now my home, however, and Vim is Old Reliable sitting in the driveway. Some prefer IntelliJ, some PyCharm, and yet others Sublime Text—to each his own.

What I’m getting at is everyone has their preferred setup for building software. If your product does not have technical constraints resulting from the stack you’re using, don’t force me out of my element. Let me use the environment I’m comfortable with because it’s the environment I’m productive with.

No More Ninjas

Tech recruiters: no more code ninjas, no more Jedis, no more rock stars, no more gurus. These words are an unfortunate by-product of the startup culture. You may have the best of intentions, but to me, these words are red flags and marginally patronizing.

Perhaps I’m overly cynical, but I tend to translate job postings of this nature as “Seeking to exploit naive, young programmer to build a startup by working unsustainable hours. You’ll get equity!” It’s the hard sell.

We get it, you’re looking for skilled programmers, but no self-respecting programmer will call him or herself a rock star. And if they do, not only does it blow expectations out of proportion, it makes them look like a dick. Let your company, its products, and its culture speak for themselves. If you aren’t finding the “rock stars” you were looking for, it wasn’t meant to be.

Let Developers Be Developers

Yes, I’m biased. Developers are opinionated—I certainly am. Nonetheless, I would hazard to guess many top tech companies tend to address several of the points I’ve touched on. Call it headstrong, obstinate, whatever, but these are things I like to get a feel for when interviewing with a company. Others may disagree, but I suspect you wouldn’t be too hard-pressed to find many who take the craft seriously share similar views.

The Art of Software

I stared at my coffee as a friend asked a mildly profound question: “what is your greatest passion in life?” Strangely, I thought of an exchange that occurred just a few months earlier while I was at the bank, opening an account for my software consulting company.

“What is the nature of your company’s business?” the banker asked.

“Building software,” my partner and I managed to respond with. Now suddenly fascinated, she looked away from her computer and at us, as if we had abruptly sprouted wings.

“Like Facebook?”

I did my best to suppress a laugh and simply nodded. I could hardly blame her. The 30-something-year-old banker’s knowledge of the domain likely consisted of the horrendous software they used at the bank, Angry Birds, and whatever insight she gleaned from watching The Social Network. About equivalent to my knowledge of banking.

I looked up from my coffee. Naturally, my answer would immediately gravitate towards software, but how to frame it in a way that made sense?

Building something that impacts people—making their lives easier, more enjoyable, or something to that effect—that’s really the goal of any engineer, but it’s an amazing ability that software developers possess. The modern-day alchemist, quite literally turning electrons into social media and flight simulators and bank transactions—things that delight or captivate; things that help us exist. We’re the janitors of data, or is it plumbers? But we’re also artists.

Software is a craft, an art, and not one that everyone is capable of doing (but certainly capable of learning). But what is art? Something that exhibits beauty in aesthetics, perhaps visually, perhaps audibly. We think of Michelangelo’s painting of the Sistine Chapel or Beethoven’s Fifth Symphony. We probably don’t think of Twitter or Facebook and certainly not the ATM you withdrew money from.

Web designers may argue that websites can be beautiful, and they’re right. Software can be both a visual and aural stimulus, but these are really just superficial observations. The same way I say “Huh, that’s pretty cool” after seeing the Sistine Chapel. Others may appreciate it more, or less.

Many would argue that the purpose of art is to evoke emotion from its audience. There’s probably a lot of truth to that, but to me, art exists on several different levels. There’s the immediate aesthetic beauty—what the eyes see and what the ears hear. There’s the emotions that it evokes from its audience—this makes me happy, this makes me sad, this makes me intrigued. And then there’s the appreciation of the craft: painters admiring a painting, not because of its obvious beauty, but because of the technical prowess it demonstrates. Software is no different.

I read something once that said software developers are never impressed. I don’t think that’s entirely true, or at least the entirety of the truth. Beyond the superficial veneer of beauty, there are the inner workings of incredibly complex systems, riddled with algorithms, design patterns, and all the things that make programmers swoon. And they run like well-oiled machines. That is art, but art that only others who share the trade can enjoy. There is elegance in code like there is elegance in the prose of a novel. The difference is in those who are able to marvel at it.

The great thing about engineering, software or otherwise, is the problem solving involved. There is no end to the amount of exceptionally difficult problems that need solving, and the way in which those problems are solved is an art in itself.

I doubt many people associate the words “art” and “software,” but there is as much creativity and craftsmanship in building software as there is engineering. It’s this drive, to build great products that affect people, that I consider my passion and my art.

Discipline in Prototyping

Writing software doesn’t require discipline, but writing good software does. I would argue that the vast majority of tech debt in projects results from PoCs/prototypes/spikes. The code from these typically aren’t intended to make it into production, but they almost invariably do in some capacity.

“I won’t bother writing unit tests for this code, it’s purely exploratory.”

The code grows…

“It’s just a rough proof-of-concept.”

…and grows…

“It won’t make it to production!”

…and grows, until pressure from above or other factors bulldoze it into a release.

This problem is not difficult to avoid, it’s entirely disciplinary. Spikes should always be timeboxed, but truthfully, this problem rarely occurs because of spikes—it happens at the onset of projects. Nearly every project gets its start as a proof-of-concept, which becomes a prototype, which becomes a product.

It’s during this process, as a project reaches its infancy, where tech debt has a tendency to accumulate like a cancerous growth. Less seasoned developers might skip out on writing unit tests or forgo code reviews because, well, it’s a prototype. This is especially tempting when you’re working in a codebase by yourself. I myself have been caught in this rut before.

I’m not saying you need to use TDD for every line of code you write (or, for that matter, at all), but, for the love of god, write unit tests around your code. If you’re writing code that’s in source control, it’s set in stone for everyone to see and, more important, maintain (and if it’s not in source control, it doesn’t exist). Write code like it will end up in production (because frankly, it probably will).

“Minimum viable product” is a popular buzzword these days but holds merit when used in the right context. What’s important to note is that an MVP is not a proof-of-concept nor a prototype. A PoC is a great way to sketch out an early draft, but an MVP is not a draft. Building something under the guise of an MVP and renouncing standard development procedures is a wildly mistaken approach.

It takes discipline. It might even take getting yourself caught in a tech-debt-riddled project before realizing just how crucial it is, but it will make you a better software developer and certainly make your projects more sustainable.

Productivity Over Process

It seems like every software company you talk to will boast about how they use the latest development process du jour—Agile, Lean, XP, Kanban—pick your poison. What’s interesting is that the people evangelizing their chosen methodology are typically managers, not developers, almost emphasizing the process more than the product. Startups and other young tech companies seem to be particularly guilty of this (after all, every time someone utters the words “lean startup”, an angel investor gets his wings).

I’ve worked on a number of different teams, mostly Agile, ranging in size from 4 to 20 developers. I’ve used everything from TFS and JIRA to Pivotal Tracker and Trello to manage stories and track bugs. A process, the way in which you produce something, is often seen as necessary by project managers and a necessary evil by developers.

Are software developers just cowboys who want to build something, guns blazing? To an extent, probably, but the conclusion that I’ve drawn is that there is no silver bullet when it comes to crafting software. It’s not one size fits all. Agile is not the be-all, end-all solution, nor is anything else.

I’ve seen teams that said they were Agile when, in fact, they weren’t in any sense of the word. To some, it’s just a buzzword used to attract talent. I’ve also worked for companies where Agile was used across all teams, no exceptions. This led to problems with the way some groups operated. It worked great for feature teams who were engaged in completing user stories, but for some of the component teams, it just didn’t make sense for the type of work they were doing. My team, very much a backend architecture group of about six developers, was merely going through the Agile motions—a bad sign indeed. We switched to a less structured Kanban process as a result because it worked for us.

Too much emphasis is placed on the process and not the productivity of a team. Am I saying that Agile is bad? Absolutely not. In most cases, an Agile team is a productive team. When an Agile team hits its stride and really grooves, the results are impeccable, but a process needs to be peripheral. It should get out of your way as fast as possible so you can get work done. Don’t just go through the motions.

How is Software Valued?

I was talking to a friend a few weeks ago who was putting together a business presentation for potential investors. He was developing a plan for a campground kiosk system that would rely on GIS data to allow guests to view and check in to camp sites. The plan was reasonable enough and mostly feasible. He carefully considered all the costs—licensing for a third-party GIS, kiosk hardware, line trenching—and then there was software.

He allocated a mere $8,000 for the kiosk software, a low-ball figure by any definition of the word, and he estimated it to take about four weeks to complete from scratch.

“Where did you get that figure?” I asked him. The answer basically boiled down to “thin air.”

I didn’t have any kind of sudden realization, but this exchange did reinforce something many others have already observed: software is remarkably undervalued.

All too often clients say something along the lines of “You want me to pay you $X per hour to sit and type on your computer?!” What’s not obvious to many is that software engineers create extraordinary value for businesses. It’s almost ironic considering just about everything these days is driven by software, to the extent that it’s almost taken for granted, and it doesn’t somehow materialize out of thin air.

So why is this the case? Is it because software isn’t a physical good? Maybe. However, I think the issue is largely attributed to the disparate levels of productivity between software engineers and other areas of industry. A developer might write an accounting system that leaves a large number of accountants redundant or automate a process that otherwise takes a dozen employees to complete. Is it fair that they are compensated accordingly? Again, it’s about creating value, but the fact is, most developers aren’t paid in proportion to the value they create or their productivity. Consultant John Cook explains why this is the case:

A salesman who sells 10x as much as his peers will be noticed, and compensated accordingly. Sales are easy to measure, and some salesmen make orders of magnitude more money than others. If a bricklayer were 10x more productive than his peers this would be obvious too, but it doesn’t happen: the best bricklayers cannot lay 10x as much brick as average bricklayers. Software output cannot be measured as easily as dollars or bricks. The best programmers do not write 10x as many lines of code and they certainly do not work 10x longer hours.

It may also be due, at least in part, to software being endlessly enigmatic to non-software people. Is this auto mechanic ripping us off on our car? Is this developer ripping us off on our point-of-sale system? It’s easy for people to see what it takes to build a bridge—designing it, performing simulated load tests, pouring the concrete, assembling the steel, laying the superstructure—these are all tangible overheads.

What does it take to build software? It’s just some bit-twiddling, right? There’s no inventory that needs to be accounted for; there’s no manufacturing labor. As developers, we know it’s a lot more involved than that. The problem with software is that it’s a living thing. After you build a home, you don’t decide to move the bathroom to the other side of the house. The same cannot be said of software.

Product owners are fickle creatures. They don’t know what they want, except that Feature X needs to be changed to Feature Y and still ship in time. I’ve been on projects where this had become so problematic that developers started leaving Feature X implemented. That way, when NPD ultimately decided X was correct in the first place, we would be on schedule, but that’s tangential to this conversation.

What I’m getting at is that there’s a lot more to building software than what may be perceived. There’s still planning, and designing, and prototyping, and implementing, and testing. But unlike the bridge or the house, the process doesn’t stop when the software ships.

No self-respecting (or sane) software engineer would agree to build a complete system in four weeks’ time for $8,000. It’s almost insulting. But to someone which software is completely foreign to—and it is to most—it might not sound so outlandish. The problem is finding the appropriate level of value. It’s easy if you’re an independent consultant, but if you’re one of several hundred developers at a company, how is your value measured? As Cook explains, output, in terms of lines of code, is not a reliable metric. In fact, one could argue it’s inversely proportional to a developer’s ability.

The romantic image of an über-programmer is someone who fires up Emacs, types like a machine gun, and delivers a flawless final product from scratch. A more accurate image would be someone who stares quietly into space for a few minutes and then says “Hmm. I think I’ve seen something like this before.”

It’s for this reason, combined with the fact that programmer salaries don’t really vary dramatically, that many developers do consulting as a profession. They know exactly what their time is worth and the value they add to a business. Coming back to the problem described earlier, the downside of consulting is that many customers don’t recognize that value. As a consultant, it’s also your job to establish what it is and why.

I took on a contract last month to build some mobile software for a small engineering firm. They needed an Android application but didn’t have the resources in-house to do it. They met with a few software shops in the area but none of them specialized in mobile development. I build Android apps. This raised my value, and I already had a pretty good idea what the app would do for their business. At that point, it’s just letting economics work itself out.