Liftbridge 1.0

Liftbridge has evolved a lot since making the first commit in October 2017, but the vision has remained the same: provide a message-streaming solution with a focus on simplicity and usability. This is demonstrated through many of the design and implementation decisions. A few examples include the use of NATS as the messaging backbone, avoiding heavy dependencies on runtimes like the JVM and external coordination systems like ZooKeeper, compiling down to a small, single static binary, opting for a gRPC-based API, and relying on plain YAML configuration. Liftbridge is written in Go, and the code is structured with the hopes that it’s relatively easy for someone to hop in and contribute to the project.

The goal of Liftbridge is to bridge the gap between sophisticated but complex log-based messaging systems like Apache Kafka and Apache Pulsar and simpler, cloud-native solutions. If you’re not familiar with the project, the introduction post sheds some light. It’s been nearly two years since I open-sourced Liftbridge, and I’m pleased to announce the project has now reached a 1.0 release. In practical terms, what this means is that the API has reached a point of stability suitable for production use and will provide a backward-compatibility commitment going forward. Liftbridge will continue to follow a semantic versioning scheme.

A lot of great features have landed since the project was first conceived in 2016 and started in 2017—replication, log compaction and retention rules, stream partitioning, activity events, and stream pausing to name a few. An official Java client has been implemented and is quickly evolving. Python will follow shortly after. There’s also a lot of exciting stuff on the roadmap ahead including auto-pausing of sparsely used partitions, durable and fault-tolerant consumer groups, a better stream re-partitioning story, and broader client support.

If you’re already using Liftbridge today or are thinking about using it, I’d love to hear from you. Be sure to follow Liftbridge on Twitter and join the community Slack channel to stay up-to-date on the latest developments.

Introducing Liftbridge: Lightweight, Fault-Tolerant Message Streams

Last week I open sourced Liftbridge, my latest project and contribution to the Cloud Native Computing Foundation ecosystem. Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.

I’ve been working on Liftbridge for the past couple of months, but it’s something I’ve been thinking about for over a year. I sketched out the design for it last year and wrote about it in January. It was largely inspired while I was working on NATS Streaming, which I’m currently still the second top contributor to. My primary involvement with NATS Streaming was building out the early data replication and clustering solution for high availability, which has continued to evolve since I left the project. In many ways, Liftbridge is about applying a lot of the things I learned while working on NATS Streaming as well as my observations from being closely involved with the NATS community for some time. It’s also the product of scratching an itch I’ve had since these are the kinds of problems I enjoy working on, and I needed something to code.

At its core, Liftbridge is a server that implements a durable, replicated message log for the NATS messaging system. Clients create a named stream which is attached to a NATS subject. The stream then records messages on that subject to a replicated write-ahead log. Multiple consumers can read back from the same stream, and multiple streams can be attached to the same subject.

The goal is to bridge the gap between sophisticated log-based messaging systems like Apache Kafka and Apache Pulsar and simpler, cloud-native systems. This meant not relying on external coordination services like ZooKeeper, not using the JVM, keeping the API as simple and small as possible, and keeping client libraries thin. The system is written in Go, making it a single static binary with a small footprint (~16MB). It relies on the Raft consensus algorithm to do coordination. It has a very minimal API (just three endpoints at the moment). And the API uses gRPC, so client libraries can be generated for most popular programming languages (there is a Go client which provides some additional wrapper logic, but it’s pretty thin). The goal is to keep Liftbridge very lightweight—in terms of runtime, operations, and complexity.

However, the bigger goal of Liftbridge is to extend NATS with a durable, at-least-once delivery mechanism that upholds the NATS tenets of simplicity, performance, and scalability. Unlike NATS Streaming, it uses the core NATS protocol with optional extensions. This means it can be added to an existing NATS deployment to provide message durability with no code changes.

NATS Streaming provides a similar log-based messaging solution. However, it is an entirely separate protocol built on top of NATS. NATS is an implementation detail—the transport—for NATS Streaming. This means the two systems have separate messaging namespaces—messages published to NATS are not accessible from NATS Streaming and vice versa. Of course, it’s a bit more nuanced than this because, in reality, NATS Streaming is using NATS subjects underneath; technically messages can be accessed, but they are serialized protobufs. These nuances often get confounded by firsttime users as it’s not always clear that NATS and NATS Streaming are completely separate systems. NATS Streaming also does not support wildcard subscriptions, which sometimes surprises users since it’s a major feature of NATS.

As a result, Liftbridge was built to augment NATS with durability rather than providing a completely separate system. To be clear, it’s still a separate server, but it merely acts as a write-ahead log for NATS subjects. NATS Streaming provides a broader set of features such as durable subscriptions, queue groups, pluggable storage backends, and multiple fault-tolerance modes. Liftbridge aims to have a relatively small API surface area.

The key features that differentiate Liftbridge are the shared message namespace, wildcards, log compaction, and horizontal scalability. NATS Streaming replicates channels to the entire cluster through a single Raft group, so adding servers does not help with scalability and actually creates a head-of-line bottleneck since everything is replicated through a single consensus group (n.b. NATS Streaming does have a partitioning mechanism, but it cannot be used in conjunction with clustering). Liftbridge allows replicating to a subset of the cluster, and each stream is replicated independently in parallel. This allows the cluster to scale horizontally and partition workloads more easily within a single, multi-tenant cluster.

Some of the key features of Liftbridge include:

  • Log-based API for NATS
  • Replicated for fault-tolerance
  • Horizontally scalable
  • Wildcard subscription support
  • At-least-once delivery support
  • Message key-value support
  • Log compaction by key (WIP)
  • Single static binary (~16MB)
  • Designed to be high-throughput (more on this to come)
  • Supremely simple

Initially, Liftbridge is designed to point to an existing NATS deployment. In the future, there will be support for a “standalone” mode where it can run with an embedded NATS server, allowing for a single deployable process. And in support of the “cloud-native” model, there is work to be done to make Liftbridge play nice with Kubernetes and generally productionalize the system, such as implementing an Operator and providing better instrumentation—perhaps with Prometheus support.

Over the coming weeks and months, I will be going into more detail on Liftbridge, including the internals of it—such as its replication protocol—and providing benchmarks for the system. Of course, there’s also a lot of work yet to be done on it, so I’ll be continuing to work on that. There are many interesting problems that still need solved, so consider this my appeal to contributors. :)

On Hireability and Recruiting

Developers deal with recruiter emails on a daily basis. It’s frustrating because it’s almost always a shotgun approach, but occasionally you get something that strays from the path and, delightfully, it stands out. It’s refreshing to have someone who has actually taken the time to determine your skill set, technology background, and how you spend your free time with respect to software. You feel like they at least have an inkling of who you are. On the other hand, it’s irritating to be bombarded by contract-to-hire offers which aren’t even remotely tailored to you. What’s worse is when it’s ten different emails from ten different people working for the same recruiting agency.

There are basically two types of software jobs: the ones which just need warm bodies—commodity developers who mostly need to code the spec handed down from the Powers That Be—and the ones which need individual contributors—people which are hired to solve hard problems. ((These would be what Google refers to as smart creatives.)) I seem to paint the first in a negative light, which isn’t actually my intention. I know plenty of developers who are perfectly content with these types of jobs. There’s nothing wrong with that, but sometimes the wires seemingly get crossed.

The problem, as I see it, is that the communications meant for one type of person are getting received by people who don’t identify. It’s a numbers game, so it boils down to lazy recruiting. Throw enough paper airplanes and eventually one will land in the lap of someone who cares—the rest will poke everyone else’s eyes out. This isn’t a humblebrag about how I’m such an Awesome Developer™ or a suggestion that we should paint everyone with a broad brush, it’s merely what I’ve observed. People are welcome to disagree and do so openly.

Many companies I would love to work at have pinged me after coming across my GitHub profile or reading my blog. Recruiters for jobs I would never work at have emailed me after seeing my LinkedIn. The split isn’t completely black and white—there are places I’d work who’ve contacted me through LinkedIn, and vice versa—but there’s definitely a strong correlation. ((Granted, the sample size of recruiters emailing me through LinkedIn is probably an order-of-magnitude larger.)) The difference is that the former almost always receive an enthusiastic, well-thought-out response, while the latter goes straight to the trash. Doing otherwise would simply be too demoralizing. I’ve talked to other developers about this, and I know I’m not the only one. Bad recruiting is just causing people to be more jaded and cynical. I don’t know about other industries, but in software it seems to be particularly magnified.

The upshot is how to minimize the “bad communications” while maximizing the “good.” You can’t avoid bad recruiters. I suspect there are lists of names with email addresses and phone numbers which get circulated around staffing agencies. ((I wouldn’t be terribly surprised if some universities sold contact information for new graduates either.)) You can remove all contact information from your online persona, but then you make it a lot harder for the “good communications” to get through.

The best way to make yourself hireable isn’t to have a good resume and interviewing skills, it’s to do cool stuff in the open and talk about it. It can be scary because sometimes you crash and burn, but that’s just as valuable. Some find it terrifying to expose their work to the world. Half the reason I blog and share it publicly is to open my work and myself up to criticism. It stings sometimes, but you get over it. I wouldn’t be where I am today without people kicking me in the ass to do better. Other times, the community can actually learn and gain insight from your viewpoint and mindset. People are too modest.

Good developers aren’t doing this to get hired either, they’re doing it because they think it’s interesting in its own right. That’s the other half of why I blog (as well as to keep a journal for myself). The self-marketing is a byproduct. My side projects and blog posts have led to far more interesting conversations and opportunities than my resume or LinkedIn profile.

People should be more open to the idea of sharing their knowledge and work with the community. The benefits go both ways.