Engineers love engineering things. The reason is self-evident (and maybe self-fulfilling—why else would you be an engineer?). We like to think we’re pretty good at solving problems. Unfortunately, this mindset can, on occasion, yield undesirable consequences which might not be immediately apparent but all the while damaging.
Developers are all in tune with the idea of “don’t reinvent the wheel,” but it seems to be eschewed sometimes, deliberately or otherwise. People don’t generally write their own merge sort, so why would they write their own consensus protocol? Anecdotally speaking, they do.
Not-Invented-Here Syndrome is a very real thing. In many cases, consciously or not, it’s a cultural problem. In others, it’s an engineering one. Camille Fournier’s blog post on ZooKeeper helps to illustrate this point and provide some context. In it, she describes why some distributed systems choose to rely on external services, such as ZooKeeper, for distributed coordination, while others build in their own coordination logic.
We draw a parallel between distributed systems and traditional RDBMSs, which typically implement their own file system and other low-level facilities. Why? Because it’s their competitive advantage. SQL databases sell because they offer finely tuned performance, and in order to do that, they need to control these things that the OS otherwise provides. Distributed databases like Riak sell because they own the coordination logic, which helps promote their competitive advantage. This follows what Joel Spolsky says about NIH Syndrome in that “if it’s a core business function—do it yourself, no matter what.”
If you’re developing a computer game where the plot is your competitive advantage, it’s OK to use a third party 3D library. But if cool 3D effects are going to be your distinguishing feature, you had better roll your own.
This makes a lot of sense. My sorting algorithm is unlikely to provide me with a competitive edge, but something else might, even if it’s not particularly novel.
So in some situations, homegrown is justifiable, but that’s not always the case. Redis’ competitive advantage is its predictably low latencies and data structures. Does it make sense for it to implement its own clustering and leader election protocols? Maybe, but this is where NIH can bite you. If what you’re doing is important and there’s precedent, lean on existing research and solutions. Most would argue write safety is important, and there is certainly precedent for leader election. Why not leverage that work? Things like Raft, Paxos, and Zab provide solutions which are proven using formal methods and are peer reviewed. That doesn’t mean new solutions can’t be developed, but they generally require model checking and further scrutiny to ensure correctness. Otherwise, you’ll inevitably run into problems. Implementing our own solutions can provide valuable insight, but leave them at home if they’re not rigorously approached. Rolling your own and calling it “good enough” is dishonest to your users if it’s not properly communicated.
Elasticsearch is another interesting case to look at. You might say Elasticsearch’s competitive advantage is its full-text search engine, but it’s not. Like Solr, it’s built on Lucene. Elasticsearch was designed from the ground-up to be distributed. This is what gives it a leg up over Solr and other similar search servers where horizontal scaling and fault tolerance were essentially tacked on. In a way, this resembles what happened with Redis, where failover and clustering were introduced as an afterthought. However, unlike Redis, which chose to implement its own failover coordination and cluster-membership protocol, Solr opted to use ZooKeeper as an external coordinator.
We see that Elasticsearch’s core advantage is its distributed nature. Following that notion, it makes sense for it to own that coordination, which is why its designers chose to implement their own internal cluster membership, ZenDisco. But it turns out writing cluster-membership protocols is really fucking hard, and unless you’ve written proofs for it, you probably shouldn’t do it at all. The analogy here would be writing your own encryption algorithm—there’s tons of institutional knowledge which has laid the groundwork for solutions which are well-researched and well-understood. That knowledge should be embraced in situations like this.
I don’t mean to pick on Redis and Elasticsearch. They’re both excellent systems, but they serve as good examples for this discussion. The problem is that users of these systems tend to overlook the issues exposed by this mentality. Frankly, few people would know problems exist unless they are clearly documented by vendors (and not sales people) and even then, how many people actually read the docs cover-to-cover? It’s essential we know a system’s shortcomings and edge cases so we can recognize which situations to apply it and, more important, which we should not.
You don’t have to rely on an existing third-party library or service. Believe it or not, this isn’t a sales pitch for ZooKeeper. If it’s a core business function, it probably makes sense to build it yourself as Joel describes. What doesn’t make sense, however, is to build out whatever that is without being cognizant of conventional wisdom. I’m amazed at how often people are willing to throw away institutional knowledge, either because they don’t seek it out or they think they can do better (without formal verification). If I have seen further, it is by standing on the shoulders of giants.
Follow @tyler_treat
This is a huge problem in the developers’ communities. Having coded myself and managed coders, I can relate to how things are decided. I have always been a big fan of reusable code. I love the idea of taking classes, functions, procedures, snippets, etc. and reusing them. When I did that, I often found I was modifying the “reusable code” because it wasn’t up to my standards or wouldn’t produce exactly the results we were looking for.
This has been going on between operating systems and programs for years. Among other things, the operating system is supposed to provide the basic / underlying interface functions to the hardware.
Gaming coders in particular have bypassed the display functions and features in favor of rolling their own because they claim they get better performance and can do things that the O/S can’t do. This can take a lot of time and is rarely tested on a sufficient number of different systems to shake it out.
Then when their game finally hits the market, it tends to work fine on many systems, but on some systems it causes real problems for a variety of reasons like hardware differences or even driver conflicts with other applications.
Coders are too close to the project to be making many of these coding decisions. I have that often their logic they use to decide how to approach a coding project is very opinionated and based on some convoluted logic. Interface issues are my personal pet peeve and I feel this is where coders are at their worst.
Another problematic area is using relational data in an application. Very coders know how to properly work with relational data and consequently use brute force programming to make the data do what they want. Little do they know that they are breaking all sorts of database rules which can result in a whole host of problems including data loss, inaccuracies in the information derived from the database(s) and severe performance degradation.
The bottom line is that there needs to be a layer of professionals directly involved in the development of any commercial and most internally developed applications. These professionals need to be well versed in the aspect of the application they are responsible for. For instance, the UIX pro needs to be expert in the design and implementation of the interface aspects of the software. If a relational database is used, a DB pro is needed to assure the database(s) are properly integrated into the application. A pro that is well versed on the business rules that need to be implemented and followed in the application should also be included to make sure the final product meets the requirements originally spec’d.
As you’re reading my paragraph on the use of pros above, I’m sure you thought “well, in an ideal world, that would be nice, but …” and you’re right for thinking that. In reality, depending on the size and complexity of the project, you may have a lead for a group of coders. There may be one or several such leads that report to a manager. These leads are typically other coders that have moved up through the ranks and may be the best coders at the organization. The manager gets nearly all of his/her information from the leads and is interested in metrics that can be used to support his business requirements which include when will the product be done so that’s it’s reliable enough to foist on the consumer (and fix/update post release), how much is this costing us in time and money, etc.
This model leaves very little room for the time and resources needed to properly develop the team and system to do things right! It’s sad, too. When I was in the Air National Guard in the 80’s, I learned that pilots preferred to use recapped tires on their jets. I asked why in the world would you want recaps on your jet (KC-135 tanker). After all, recaps for cars usually suck.
I was told that the carcass (main body) of the aircraft’s tire had proven to be a reliable component. When aircraft tires are recapped, they go through a very thorough and rigorous testing and “rebuilding” process that leaves no doubt the tire is the best that can be on an aircraft.
Reusable code could be dealt with in the same way. As a matter of fact, we, as a society, should have modules of code that are standard, very well tested and agreed upon at the minimum foundation code that every organization should use much in the same way that all the many standards settings organizations have done with a myriad of things like protocols, drugs, measurements, engineering standards, etc, etc.