Microservice Observability, Part 2: Evolutionary Patterns for Solving Observability Problems

In part one of this series, I described the difference between monitoring and observability and why the latter starts to become more important when dealing with microservices. Next, we’ll discuss some strategies and patterns for implementing better observability. Specifically, we’ll look at the idea of an observability pipeline and how we can start to iteratively improve observability in our systems.

To recap, observability can be described simply as the ability to ask questions of your systems without knowing those questions in advance. This requires capturing a variety of signals such as logs, metrics, and traces as well as tools for interpreting those signals like log analysis, SIEM, data warehouses, and time-series databases. A number of challenges surface as a result of this. Clint Sharp does a great job discussing the key problems, which I’ll summarize below along with some of my own observations.

Problem 1: Agent Fatigue

A typical microservice-based system requires a lot of different operational tooling—log and metric collectors, uptime monitoring, analytics aggregators, security scanners, APM runtime instrumentation, and so on. Most of these involve agents that run on every node in the cluster (or, in some cases, every pod in Kubernetes). Since vendors optimize for day-one experience and differentiating capabilities, they are incentivized to provide agents unique to their products rather than attempting to unify or standardize on tooling. This causes problems for ops teams who are concerned with the day-two costs of running and managing all of these different agents. Resource consumption alone can be significant, especially if you add in a service mesh like Istio into the mix. Additionally, since each agent is unique, the way they are configured and managed is different. Finally, from a security perspective, every agent added to a system introduces additional attack surface to hosts in the cluster. Each agent brings not just the vendor’s code into production but also all of its dependencies.

Problem 2: Capacity Anxiety

With the elastic microservice architectures I described in part one, capacity planning for things like logs and metrics starts to become a challenge. This point is particularly salient if, for example, you’ve ever been responsible for managing Splunk licensing. With microservices, a new deployment can now cause a spike in log volumes forcing back pressure on your log ingestion across all of your services. I’ve seen Splunk ingestion get backed up for days’ worth of logs, making it nearly impossible to debug production issues when logs are needed most. I’ve seen Datadog metric ingestion grind to a halt after someone added a high-cardinality dimension to classify a metric by user. And I’ve seen security teams turn on cloud audit log exporting to their SIEM only to get flooded with low-level minutiae and noise. Most tools prioritize gross data ingestion over fine-grained control like sampling, filtering, deduplicating, and aggregating. Using collectors such as Fluentd can help with this problem but add to the first problem. Elastic microservice architectures tend to require more control over data ingestion to avoid capacity issues.

Problem 3: Foresight Required

Unlike monitoring, observability is about asking questions that we hadn’t planned to ask in advance, but we can’t ask those questions if the necessary data was never collected in the first place! The capacity problem described above might cause us to under-instrument our systems, especially when the value of logs is effectively zero—until it’s not. Between monitoring, debugging, security forensics, and other activities, effective operations requires a lot of foresight. Unfortunately, this foresight tends to come from hindsight, which might be too late depending on the situation. Most dashboards are operational scar tissue, after all. Adding or reconfiguring instrumentation after the fact can have significant lag time, which can be the difference between prolonged downtime or a speedy remediation. Elastic microservice architectures benefit greatly from the ability to selectively and dynamically dial up the granularity of operational data when it’s needed and dial it back down when it’s not.

Problem 4: Tooling and Data Accessibility

Because of the problems discussed earlier, it’s not uncommon for organizations to settle on a limited set of operations tools like logging and analytics systems. This can pose its own set of challenges, however, as valuable operational data becomes locked up within certain systems in production environments. Vendor lock-in and high switching costs can make it difficult to use the right tool for the job.

There’s a wide range of data sources that provide high-value signals such as VMs, containers, load balancers, service meshes, audit logs, VPC flow logs, and firewall logs. And there’s a wide range of sinks and downstream consumers that can benefit from these different signals. The problem is that tool and data needs vary from team to team. Different tools or products are needed for different data and different use cases. The data that operations teams care about is different from the data that business analysts, security, or product managers care about. But if the data is siloed based on form or function or the right tools aren’t available, it becomes harder for these different groups to be effective. There’s an ever-changing landscape of tools, products, and services—particularly in the operations space—so the question is: how big of a lift is it for your organization to add or change tools? How easy is it to experiment with new ones? In addition to the data siloing, the “agent fatigue” problem described above can make this challenging when re-rolling host agents at scale.

Solution: The Observability Pipeline

Solving these problems requires a solution that offers the following characteristics:

  1. Allows capturing arbitrarily wide events
  2. Consolidates data collection and instrumentation
  3. Decouples data sources from data sinks
  4. Supports input-to-output schema normalization
  5. Provides a mechanism to encode routing, filtering, and transformation logic

When we implement these different concepts, we get an observability pipeline—a way to unify the collection of operational data, shape it, enrich it, eliminate noise, and route it to any tool in the organization that can benefit from it. With input-to-output schema normalization, we can perform schema-agnostic processing to enrich, filter, aggregate, sample, or drop fields from any shape and adapt data for different destinations. This helps to support a wider range of data collectors and agents. And by decoupling sources and sinks, we can easily introduce or change tools and reroute data without impacting production systems.

We’re starting to see the commercialization of this idea with products like Cribl, but there are ways to solve some of these problems yourself, incrementally, and without the use of commercial software. The remainder of this post will discuss patterns and strategies for building your own observability pipeline. While the details here will be fairly high level, part three of this series will share some implementation details and tactics through examples.

Pattern 1: Structured Data

A key part of improving system observability is being more purposeful in how we structure our data. Specifically, structured logging is critical to supporting production systems and aiding debuggability. The last thing you want to be doing when debugging a production issue is frantically grepping log files trying to pull out needles from a haystack. In the past, logs were primarily consumed by human operators. Today, they are primarily consumed by tools. That requires some adjustments at design time. For example, if we were designing a login system, historically, we might have a logging statement that resembles the following:

log.error(“User '{}' login failed”.format(user))

This would result in a log message like:

ERROR 2019-12-30 09:28.31 User ‘tylertreat' login failed

When debugging login problems, we’d probably use a combination of grep and regular expressions to track down the users experiencing issues. This might be okay for the time being, but as we introduce additional metadata, it becomes more and more kludgy. It also means our logs are extremely fragile. People begin to rely on the format of logs in ways that might even be unknown to the developers responsible for them. Unstructured logs become an implicit, undocumented API.

With structured logs, we make that contract more explicit. Our logging statement might change to something more like:

log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)

The actual format we use isn’t hugely important. I typically recommend JSON because it’s ubiquitous and easy to write and parse. With JSON, our log looks something like the following:

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}

With this, we can parse the structure, index it, query it, even transform or redact it, and we can add new pieces of metadata without breaking consumers. Our logs start to look more like events. Remember, observability is about being able to ask arbitrary questions of our systems. Events are like logs with context, and shifting towards this model helps with being able to ask questions of our systems.

Pattern 2: Request Context and Tracing

With elastic microservice architectures, correlating events and metadata between services becomes essential. Distributed tracing is one component of this. Another is tying our structured logs together and passing shared context between services as a request traverses the system. A pattern that I recommend to teams adopting microservices is to pass a context object to everything. This is actually a pattern that originated in Go for passing request-scoped values, cancelation signals, and deadlines across API boundaries. It turns out, this is also a useful pattern for observability when extended to service boundaries. While it’s contentious to explicitly pass context objects due to the obtrusiveness to APIs, I find it better than relying on implicit, request-local storage.

In its most basic form, a context object is simply a key-value bag that lets us track metadata as a request passes through a service and is persisted through the entire execution path. OpenTracing refers to this as baggage. You can include this context as part of your structured logs. Some suggest having a single event/structured-log-with-context emitted per hop, but I think this is more aspirational. For most, it’s probably easier to get started by adding a context object to your existing logging. Our login system’s logging from above would look something like this:

def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...

This adds rich metadata to our logs—great for debugging—as they start evolving towards events. The context is also a convenient way to propagate tracing information, such as a span ID, between services.

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“span_id”: “34fe6cbf9556424092fb230eab6f4ea6”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}

You might be wondering what to put on the context versus just putting on our structured logs. It’s a good question and, like most things, the answer is “it depends.” A good rule of thumb is what can you get for “free” and what do you need to pass along? These should typically be things specific to a particular request. For instance, CPU utilization and memory usage can be pulled from the environment, but a user or correlation ID are request-specific and must be propagated. This decision starts to become more obvious the deeper your microservice architectures get. Just be careful not to leak sensitive data into your logs! While we can introduce tooling into our observability pipeline to help with this risk, I believe code reviews are the best line of defense here.

Pattern 3: Data Schema

With our structured data and context, we can take it a step further and introduce schemas for each data type we collect, such as logs, metrics, and traces. Schemas provide a standard shape to the data and allow consumers to rely on certain fields and types. They might validate data types and enforce required fields like a user ID, license, or trace ID. These schemas basically take the explicit contract described above and codify it into a specification. This is definitely the most organization-dependent pattern, so it’s hard to provide specific advice. The key thing is having structured data that can be easily evolved and relied on for debugging or exploratory purposes.

These schemas also need libraries which implement the specifications and make it easy for developers to actually instrument their systems. There is a plethora of existing libraries available for structured logging. For tracing and metrics, OpenTelemetry has emerged as a vendor-neutral API and forthcoming data specification.

Pattern 4: Data Collector

So far, we’ve talked mostly about development practices that improve observability. While they don’t directly address the problems described above, later, we’ll see how they also help support other parts of the observability pipeline. Now we’re going to look at some actual infrastructure patterns for building out a pipeline.

Recall that two of the characteristics we desire in our observability solution are the ability to consolidate data collection and instrumentation and decouple data sources from data sinks. One of the ways we can reduce agent fatigue is by using a data collector to unify the collection of key pieces of observability data—namely logs (or events), metrics, and traces. This component collects the data, optionally performs some transformations or filtering on it, and writes it to a data pipeline. This commonly runs as an agent on the host. In Kubernetes, this might be a DaemonSet with an instance running on each node. From the application or container side, data is written to stdout/stderr or a Unix domain socket which the collector reads. From here, the data gets written to the pipeline, which we’ll look at next.

Moving data collection out of process can be important if your application emits a significant amount of logs or you’re doing anything at a large enough scale. I’ve seen cases where applications were spending more time writing logs than performing actual business logic. Writing logs to disk can easily take down a database or other I/O-intensive workload just by sharing a filesystem with its logging. Rather than sacrificing observability by reducing the volume and granularity of logs, offload it and move it out of the critical execution path. Logging can absolutely affect the performance and reliability of your application.

For this piece, I generally recommend using either Fluentd or Logstash along with the Beats ecosystem. I usually avoid putting too much logic into the data collector due to the way it runs distributed and at scale. If you put a lot of processing logic here, it can become difficult to manage and evolve. I find it works better to have the collector act as a dumb pipe for getting data into the system where it can be processed offline.

Pattern 5: Data Pipeline

Now that we have an agent running on each host collecting our structured data, we need a scalable, fault-tolerant data stream to handle it all. Even at modestly sized organizations, I’ve seen upwards of about 1TB of logs indexed daily with elastic microservice architectures. This volume can be much greater for larger organizations, and it can burst dramatically with the introduction of new services. As a result, decoupling sources and sinks becomes important for reducing capacity anxiety. This data pipeline is often something that can be partitioned for horizontal scalability. In doing this, we might just end up shifting the capacity anxiety from one system to another, but depending on the solution, this can be an easier problem to solve or might not be a problem at all if using a managed cloud service. Finally, a key reason for decoupling is that it also allows us to introduce or change sinks without impacting our production cluster. A benefit of this is that we can also evaluate and compare tools side-by-side. This helps reduce switching costs.

There are quite a few available solutions for this component, both open source and managed. On the open source side, examples include Apache Kafka, Apache Pulsar, and Liftbridge. On the cloud-managed services side, Amazon Kinesis, Google Cloud Pub/Sub, and Azure Event Hubs come to mind. I tend to prefer managed solutions since they allow me to focus on things that directly deliver business value rather than surrounding operational concerns.

Note that there are some important nuances depending on the pipeline implementation you use or which might determine the implementation you choose. For example, questions like how long do you need to retain observability data, do you need the ability to replay data streams, and do you need strict, in-order delivery of messages? Replaying operational data can be useful for retraining ML models or testing monitoring changes, for instance. For systems that are explicitly sharded, there’s also the question of how to partition the data. Random partitioning is usually easiest from a scaling and operations perspective, but it largely depends on how you intend to consume it.

Pattern 6: Data Router

The last pattern and component of our observability pipeline is the data router. With our operational data being written to a pipeline such as Kafka, we need something that can consume it, perform processing, and write it to various backend systems. This is also a great place to perform dynamic sampling, filtering, deduplication, aggregation, or data enrichment. The schema mentioned earlier becomes important here since the shape of the data determines how it gets handled. If you’re dealing with data from multiple sources, you’ll likely need to normalize to some common schema, either at ingestion time or processing time, in order to execute shared logic and perform schema-agnostic processing. Data may also need to be reshaped before writing to destination systems.

This piece can be as sophisticated or naive as you’d like, depending on your needs or your organization’s observability and operations maturity. A simple example is merely looking at the record type and sending logs to Splunk and Amazon Glacier cold storage, sending traces to Stackdriver, sending metrics to Datadog, and sending high-cardinality events to Honeycomb. More advanced use cases might involve dynamic sampling to dial up or down the granularity on demand, dropping values to reduce storage consumption or eliminate noise, masking values to implement data loss prevention, or joining data sources to create richer analytics.

Ultimately, this is a glue component that’s reading data in, parsing the shape of it, and writing it out to assorted APIs or other topics/streams for further downstream processing. Depending on the statefulness of your router logic, this can be a good fit for serverless solutions like AWS Lambda, Google Cloud Functions, Google Cloud Run, Azure Functions, or OpenFaaS. If using Kafka, Kafka Streams might be a good fit.

The Journey to Better Observability

Observability with elastic microservice architectures introduces some unique challenges like agent fatigue, capacity anxiety, required foresight, and tooling and data accessibility. Solving these problems requires a solution that can capture arbitrarily wide events, consolidate data collection and instrumentation, decouple data sources and sinks, support input-to-output schema normalization, and encode routing, filtering, and transformation logic. When we implement this, we get an observability pipeline, which is really just a fancy name for a collection of observability patterns and best practices.

An observability pipeline should be an evolutionary or iterative process. You shouldn’t waste time building out a sophisticated pipeline early on; you should be focused on delivering value to your customers. Instead, start small with items that add immediate value to the observability of your systems.

Something you can begin doing today that adds a ton of value with minimal lift is structured logging. Another high-leverage pattern is passing a context object throughout your service calls to propagate request metadata which can be logged and correlated. Use distributed tracing to understand and identify issues with performance. Next, move log collection out of process using Fluentd or Logstash. If you’re not already, use a centralized logging system—Splunk, Elasticsearch, Sumo Logic, Graylog—there are a bunch of options here, both open source and commercial, SaaS or self-managed. With the out-of-process collector, you can then introduce a data pipeline to decouple log producers from consumers. Again, there are managed options like Amazon Kinesis or Google Cloud Pub/Sub and self-managed ones like Apache Kafka. With this, you can now add, change, or compare consumers and log sinks without impacting production systems. Evaluate a product like Honeycomb for storing high-cardinality events. At this point, you can start to unify the collection of other instrumentation such as metrics and traces and evolve your logs towards context-rich events.

Each of these things will incrementally improve the observability of your systems and can largely be done in a stepwise fashion. Whether you’re just beginning your transition to microservices or have fully adopted them, the journey to better observability doesn’t have to require a herculean effort. Rather, it’s done one step at a time.

In part three of this series, I’ll demonstrate a few implementation details through examples to show some of these observability patterns in practice.

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

“Pets versus cattle” has become something of a standard vernacular for describing the shift in how we build systems. It alludes to the elastic and dynamic nature of these (typically, but not necessarily) container-based systems with on-demand scaling and more transparent fault-tolerance. I’ve talked before about this transition before and specifically how it relates to monitoring. In particular, with these more dynamic, microservice-based systems, the conversation starts to shift away from traditional monitoring toward observability. In this series, I’ll describe that distinction, explain why it matters, and share some concrete tactical items for implementing observability in a microservice environment.

In the past, I’ve used the term “cloud-native” to describe these types of systems, but this buzzword has conflated so many different concepts that it’s been relegated to the likes of “DevOps”—entirely arbitrary and context-dependent. Depending on who you ask, cloud-native means containers, microservices, Kubernetes, elasticity, serverless, automation, or any number of other ideas. The truth, however, is that you can do many of these things on-prem just as much as in the cloud, the difference being largely CapEx versus OpEx. I think the spirit of “cloud-native” really just means architecting systems to take advantage of cloud capabilities, namely higher-level managed services (which may not even have on-prem equivalents), improved elasticity and fault-tolerance (which may or may not mean containers), and reduced operations investment (in part by leveraging managed services).

Because there are so many confounding and interrelated-yet-different ideas, I’m going to focus this discussion on elastic microservice architectures. Elastic meaning services that automatically scale up and down as needed (in contrast to static infrastructures), and microservice simply meaning applications comprised of many different—usually smaller—services (in contrast to monoliths or systems comprising just a few coarse-grained services).

Static Monolithic Architectures

With static monolithic architectures, monitoring is a reasonably well-understood problem. With a monolith, the system is typically in one of two states, up or down, and we can conceivably correlate this to customer impact. Bugs aside, when the monolith is down, we likely have a good idea of how this behavior manifests itself to the user. We can set up Nagios checks and get some meaningful signals out of it. Uptime is mostly a single data point.

With a monolith, it’s not unreasonable for ops teams to manage the day-to-day operations of the system and do so effectively. These teams tend to quickly develop a good intuition and “muscle memory” for the application when it’s the only thing they are responsible for, especially when it’s a single deployable unit. Logs can be grepped from a single log file, and if something is wrong with the application, operators might simply SSH into the box to poke at it. Runbooks and standard operating procedures are also common here.

With a monolith, we likely have a single runtime such as the JVM, which makes it easier to collect rich telemetry in a centralized way, all the way down to the code level. Tools like Dynatrace and AppDynamics can instrument the JVM itself to collect information on busy and idle threads, garbage collection stats, and request metrics. And because we have just a single deployed artifact running on a handful of static servers, this data can actually be useful and correlated back to customer impact and business metrics.

Elastic Microservice Architectures

With elastic microservice architectures, things start to change dramatically. Applications consist of dozens of different microservices. The system is no longer in one of two states but more like one of n-factorial states. In reality, it’s much more because in production you might have different versions of the same service running at the same time as you introduce more sophisticated deployment strategies and rollbacks. Integration testing can’t possibly account for all of these combinations. We can no longer easily correlate system behavior to actual customer impact because system behavior is much more emergent. It can be difficult to pinpoint how the behavior of a given service affects the user’s experience as the system operates in varying states of partial failure and services interact in unique ways. If it’s slow, which part is slow? The frontend service? An upstream service? The database? Some combination of these? Uptime is no longer a single data point but rather a composite of many different data points, but more importantly, what does “up” even mean in the context of a complex microservice architecture?

With microservices, it becomes intractable for a single ops team to manage dozens of heterogeneous services beyond anything but in a first-responder, incident-router capacity. There is too much context and specific knowledge needed since microservices are literally the embodiment of the specialization of teams.

With microservices, it’s no longer practical or even feasible to grep log files or SSH into the box to debug a problem. There might not even be a box to SSH into if it’s a container that has since been descheduled or a managed serverless runtime. With heterogeneous services, we might have half a dozen languages and runtimes to support, each with differing types of runtime instrumentation. Moreover, because we now have dozens or even hundreds of nodes running many different instances of our services, the value of this low-level, summarized data starts to diminish. It makes for pretty dashboards and can help in answering very specific, predefined questions, but that’s about it. It’s no use for proactive monitoring because it’s too much noise, and it’s no use for reactive debugging because it’s pre-aggregated. There’s not much you can do when all you have are rolled-up time-series metrics, and it’s just as difficult to correlate this data back to customer impact.

Monitoring and Observability

With a complex system, relying on this type of data along with logs can often lead to a deadend when tracking down a particularly insidious bug. And this is where observability comes into play. It picks up where monitoring leaves off.

While monitoring and observability have been getting conflated a lot lately, there’s actually an important distinction to make. Monitoring tends to focus on the overall health of system and business metrics—questions we know in advance. Observability is about providing more granular insights into the behavior of systems and richer context. It’s the difference between “post hoc” versus “ad hoc.”

In the top-right corner, we have known knowns. These are things of which we have a high degree of understanding and a large amount of data on, i.e. the things we are aware of and understand. For example, “the system has a 1GB memory limit.” As the designers of this system, this is something that we’re acutely aware of and understand. We know that we know how much memory the system can use before it moves outside of its operating boundaries and bad things happen.

In the bottom-right corner, we have known unknowns. These are things we are generally aware of but don’t necessarily understand. For example, “the system exceeded its memory limit and crashed, causing an outage.” As system designers, memory usage is something we know is important and affects system behavior. We can monitor it in production in order to gather lots of data on it, but just having that data often doesn’t help us to understand why memory is being consumed or even how that data manifests itself as system behavior.

In the top-left corner, we have unknown knowns, which are things we understand but are not completely aware of. This sounds like a strange, almost oxymoron-like categorization, but it’s basically the things that are gut instinct or intuition. It’s often things we know or think we know without even consciously realizing it. For example, “we implemented an orchestrator to ensure the system is always running.” Intuition tells us that if the process isn’t running, the system isn’t available, so we make sure that it gets restarted when something goes wrong. We might, however, be unaware of the unintended side effects of this decision, and it might be based more on theory and conjecture than data.

Which leads us to the bottom-left corner: unknown unknowns. These are the things we are neither aware of nor understand. The events we can’t even predict or foresee happening because if we could foresee them, they wouldn’t be unknown unknowns, they’d be known unknowns. For example, “instances churn because the orchestrator restarts the process when it approaches its memory limit, causing sporadic failures and slowdowns.” This was an unforeseen consequence of our orchestrator implementation. As a result, we could not have tested for it or looked for it with our monitoring tools. Instead, it’s something that happens, we learn from it, and quickly classify it as a known unknown—something we know to look for going forward.

In a sense, the known knowns are facts, the known unknowns are hypotheses, the unknown knowns are assumptions, and the unknown unknowns are discoveries. Through this lens, the distinction between observability and monitoring becomes clear. Monitoring is about testing hypotheses and observability is about exploring new discoveries. We monitor known unknowns because these are the things we know to look for, but unknown unknowns are, by definition, unpredictable. We cannot monitor them because we do not know to even look for them in the first place! Instead, we ask questions of our systems in order to understand and categorize these unknown unknowns. Observability is the ability to interrogate our systems after the fact in a data-rich, high-fidelity way. Monitoring, on the other hand, is before the fact and much lower fidelity. These are the dashboards and alerts we set up which usually consist of pre-aggregated metrics. This is what I mean by post hoc versus ad hoc. Observability allows us to ask arbitrary questions of our systems, not questions predefined in advance.

With this definition, monitoring is a subset of observability, and observability encompasses many different types of data. For example, things like distributed traces, application logs, system logs, audit logs, and application metrics are all important observability signals. But when we boil it all down, it turns out everything is really just events, of which we want different lenses to view. Some of this data provides context for the event itself, such as logs and metrics, and some of it describes relationships between events, such as traces. It’s important we have a way to collect all this context and store it such that we can query and analyze it using these different lenses. Aggregated metrics alone aren’t enough—they don’t have the granularity nor the context needed. Dashboards are simply answers to specific questions known in advance. Observability needs to go much deeper than this.

In part two of this series, we’ll revisit the concept of an observability pipeline as a tactical approach to implementing observability in a microservice environment. As part of this, we’ll discuss some steps that can be taken to incrementally improve observability while iterating toward this pattern.

What’s Going on with GKE and Anthos?

GCP’s Slippery Slide into Enterprise

When former Oracle exec Thomas Kurian took over for Diane Greene as Google Cloud’s CEO, a lot of people expressed concern about what this meant for the future of GCP. Vendor lock-in is already at the forefront of the minds of many cloud adopters, and Oracle is notorious for locking customers into expensive and prolonged contracts. However, I thought the move was smart on Google’s part.

Google has never been a customer-first company. While it has always been a technology leader, it struggles immensely with enterprise sales and support. It continues to have issues dogfooding its own products (Google’s products are typically built on internal versions of services not available to customers, then there are the external GCP versions that their customers actually use). This means its engineers don’t feel the same pain points that its customers experience and their products lose out on a critical feedback loop (contrast this with Amazon where AWS is treated as a separate company to Amazon.com, and there is a mandate to build with the same services Amazon’s customers use). Customer empathy matters.

Now, most people probably wouldn’t characterize Oracle as a customer-first company, but it knows how to meet customers where they are and to sell in a way that resonates with enterprise decision makers. Historically, Google has approached sales engineering in a way that has failed to resonate with customers by attempting to map its superior technology offerings onto actual customer problems. Nothing could be more off-putting to a decision maker with a round hole than a sales engineer with a square peg telling them their hole is wrong.

Thomas Kurian was brought in to address these glaring issues for Google Cloud. Through restructuring and growing its sales organization, key leadership hires, and strategic acquisitions and partnerships, it’s clear he’s serious about fixing Google Cloud’s enterprise perception problem. Slowly but surely, Google is attempting to shift its culture from being technology-obsessed to customer-obsessed. And while Oracle is notorious when it comes to vendor lock-in, all signs thus far have pointed to Google more strategically embracing open APIs with things like GKE (Kubernetes), Traffic Director (Istio), ML Engine (Tensorflow), and Dataflow (Apache Beam). They are also starting to meet customers where they are with things like Dataproc (Apache Spark and Hadoop), Memorystore (Redis), and Cloud SQL (MySQL, PostgreSQL, and Microsoft SQL Server). Hell, they’ll even run Microsoft Active Directory for you now! Who says Google can’t do enterprise? So the future is bright for GCP, right? Maybe. What follows is speculation based on my own observations and anecdotal information.

There’s one thing that could change the outlook on all of this: Anthos. Anthos is GCP’s answer to hybrid-cloud solutions like Pivotal Cloud Foundry (PCF), AWS Outposts, or Azure Stack. It allows organizations to build and manage workloads across public clouds and on-prem by extending GKE. If multi-cloud is your thing and you hate money, these platforms all sound like pretty good things. But here’s the disconcerting thing about Anthos in particular: it’s becoming clear that GCP is deliberately blurring the lines between Anthos and GKE.

I received an email yesterday from GCP announcing that Binary Authorization is now generally available (GA). Binary Authorization is a neat security feature that ensures only trusted container images can be deployed to GKE. It’s been in beta for some time and now it’s GA with a six-month free trial starting today. Great! How much will it cost after the trial? Contact your sales representative. Wait, what? That’s because starting on March 16, 2020, GKE clusters will need to be part of an Anthos-subscribed organization to enable Binary Authorization. If you choose not to upgrade to Anthos, starting March 16, 2020, you will not be able to turn on Binary Authorization on new clusters.

This is a slippery slope for GCP. I can already foresee other features requiring an Anthos subscription just to use them in GKE, where GKE basically becomes an Anthos subscription funnel. Which features go into Anthos and which go into GKE? Now this is something I’d come to expect from Oracle. If GCP starts to roll differentiating features into Anthos instead of GKE, it could mark the beginning of the end.

While the lines between Anthos and GKE are becoming increasingly fuzzy, Google is clear about this particular feature:

Binary Authorization is a feature of the Anthos platform and use of Binary Authorization is included in the Anthos subscription.

That wasn’t clear, however, when I started using it with GKE and started to advise clients to use it there, completely irrespective of Anthos. This sets a very dangerous precedent.

What’s more alarming is the marketing and product language on a number of GCP services and features have quietly replaced “GKE” with “Anthos” or, worse yet, “Anthos GKE.” For example, Cloud Run—which is still in beta—now says it can “run stateless containers on a fully managed environment or on Anthos.” Will I need an Anthos subscription to use Cloud Run with GKE once it goes GA? Based on the Binary Authorization move and the language updates, it seems likely. And looking at the GKE cluster setup wizard, it appears managed Istio might also.

Anthos features listed in GKE cluster setup wizard

Which of these features is going to require a subscription next? We know Binary Authorization already does.

Security features listed in GKE cluster setup wizard

And how much does Anthos even cost? Contact sales. Not a good look for Kurian’s vision of openness and customer choice. As AWS CEO Andy Jassy puts it, no longer does the process of buying technology involve the purchase of heavy proprietary software with multi-year contracts that include annual maintenance fees. Now it’s about choice and ease of use, including letting customers turn things off if they’re not working. But choice also means not bundling all of your differentiating features into a massive contract. List prices for Anthos start at $10,000 per month per 100 virtual CPUs with a minimum one-year commitment. This is just for the software layer. It doesn’t include any of the underlying GCP infrastructure. Again, fine for organizations willing to throw similar sums of money at things like PCF or Outposts, but are plain old GKE users really going to get roped in to this nonsense? Are they going to lose out on value-added features?

Either GCP has a well-thought-out strategy for GKE and Anthos (which, given Google’s history, is frankly unlikely) and is simply tone deaf to how it would be perceived by people already skittish about a former Oracle exec taking the reigns as CEO or this will end in disaster. It’s entirely possible this is all just a misunderstanding and they are, in a misguided fashion, rebranding GKE to Anthos (it’s been renamed once already and GCP has a history of rebranding existing products), but requiring a subscription hidden behind a sales contact form in order to use basic features is spooky.

My hope is that there is some longer-term strategy at play and GCP is not moving to an enterprise-subscription model for what should be GKE features. Best case, Google is just muddying the waters as they’ve done in the past. Worst case, they’re steamrolling their entire platform strategy to make way for enterprise sales. That would be tragic for Google given GKE is still by far and away the best managed Kubernetes service available. So what’s going on with GKE and Anthos?

Serverless on GCP

Like many other marketing buzzwords, the concept of “serverless” has taken on a life of its own, which can make it difficult to understand what serverless actually means. What it really means is that the cloud provider fully manages server infrastructure all the way up to the application layer. For example, GCE isn’t serverless because, while Google manages the physical server infrastructure, we still have to deal with patching operating systems, managing load balancers, configuring firewall rules, and so on. Serverless means we merely worry about our application code and business logic and nothing else. This concept extends beyond pure compute though, including things like databases, message queues, stream processing, machine learning, and other types of systems.

There are several benefits to the serverless model. First, it allows us to focus on building products, not managing infrastructure. These operations-related tasks, while important, are not generally things that differentiate a business. It’s just work that has to be done to support the rest of the business. With cloud—and serverless in particular—many of these tasks are becoming commoditized, freeing us up to focus on things that matter to the business.

Another benefit related to the first is that serverless systems provide automatic scaling and fault-tolerance across multiple data centers or, in some cases, even globally. When we leverage GCP’s serverless products, we also leverage Google’s operational expertise and the experience of an army of SREs. That’s a lot of leverage. Few companies are able to match the kind of investment cloud providers like Google or Amazon are able to make in infrastructure and operations, nor should they. If it’s not your core business, leverage economies of scale.

Finally, serverless allows us to pay only for what we use. This is quite a bit different from what traditional IT companies are used to where it’s more common to spend several millions of dollars on a large solution with a contract. It’s also different from what many cloud-based companies are used to where you typically provision some baseline capacity and pay for bursts of additional capacity as needed. With serverless, VMs are eschewed and we pay only for the resources we use to serve the traffic we have. This means no more worrying about over-provisioning or under-provisioning.

GCP’s Compute Options

GCP has a comprehensive set of compute options ranging from minimally managed VMs all the way to highly managed serverless backends. Below is the full spectrum of GCP’s compute services at the time of this writing. I’ll provide a brief overview of each of these services just to get the lay of the land. We’ll start from the highest level of abstraction and work our way down, and then we’ll hone in on the serverless solutions.

GCP compute platforms

Firebase is Google’s managed Backend as a Service (BaaS) platform. This is the highest level of abstraction that GCP offers (short of SaaS like G Suite) and allows you to build mobile and web applications quickly and with minimal server-side code. For example, it can implement things like user authentication and offline data syncing for you. This is often referred to as a “backend as a service” because there is no server code. The trade-off is you have less control over the system, but it can be a great fit for quickly prototyping applications or building a proof of concept with minimal investment. The primary advantage is that you can focus most of your development effort on client-side application code and user experience. Note that some components of Firebase can be used outside of the Firebase platform, such as Cloud Firestore and Firebase Authentication

Cloud Functions is a serverless Functions as a Service (FaaS) offering from GCP. You upload your function code and Cloud Functions handles the runtime of it. Because it’s a sandboxed environment, there are some restrictions to the runtime, but it’s a great choice for building event-driven services and connecting systems together. While you can develop basic user-facing APIs, the operational tooling is not sufficient for complex systems. The benefit is Cloud Functions are highly elastic and have minimal operational overhead since it is a serverless platform. They are an excellent choice for dynamic, event-driven plumbing such as moving data between services or reacting to log events. They work well for basic APIs, but can rapidly become operationally complex for more than a few endpoints.

App Engine is Google’s Platform as a Service (PaaS). Like Cloud Functions, it’s an opinionated but fully managed runtime that lets you upload your application code while handling the operational aspects such as autoscaling and fault-tolerance. App Engine has two modes: Standard, which is the opinionated PaaS runtime, and Flexible, which allows providing a custom runtime using a container—this is colloquially referred to as a Container as a Service (CaaS). For stateless applications with quick instance start-up times, it is often an excellent choice. It offers many of the benefits of Cloud Functions but simplifies operational aspects since larger components are easy to deploy and manage. App Engine allows developers to focus most of their effort on business logic. Standard is a great fit for greenfield applications where server-side processing and logic is required. Flex can be easier for migrating existing workloads because it is less opinionated.

Cloud Run is a new offering in GCP that provides a managed compute platform for stateless containers. Essentially, Google manages the underlying compute infrastructure and all you have to do is provide them an application container. Like App Engine, they handle scaling instances up and down, load balancing, and fault-tolerance. Cloud Run actually has two modes: the Google-managed version, which runs your containers on Google’s internal compute infrastructure known as Borg, and the GKE version, which allows running workloads on your own GKE cluster. This is because Cloud Run is built on an open source Kubernetes platform for serverless workloads called Knative.

Cloud Run and App Engine Flex are similar to each other, but there are some nuanced differences. One key difference is Cloud Run has very fast instance start-up time due to its reliance on the gVisor container runtime. Flex instances, on the other hand, usually take minutes to start because they involve provisioning GCE instances, load balancers, and other GCP-managed infrastructure. Flex is also more feature-rich than Cloud Run, supporting things like traffic splitting, deployment rollbacks, WebSocket connections, and VPC connections.

Kubernetes Engine, or GKE, is Google’s managed Kubernetes service. GKE effectively adds a container orchestration layer on top of GCE, putting it somewhere between IaaS (Infrastructure as a Service) and CaaS. This is typically the lowest level of abstraction most modern applications should require. There is still a lot of operational overhead involved with using a managed Kubernetes service.

Lastly, Compute Engine, or GCE, is Google’s VM offering. GCE VMs are usually run on multi-tenant hosts, but GCP also offers sole-tenant nodes where a physical Compute Engine server is dedicated to hosting a single customer’s VMs. This is the lowest level of infrastructure that GCP offers and the lowest common denominator generally available in the public clouds, usually referred to as IaaS. This means there are a lot of operational responsibilities that come with using it. There are generally few use cases that demand a bare VM.

Choosing a Serverless Option

Now that we have an overview of GCP’s compute services, we can focus in on the serverless options.

GCP serverless compute platforms

GCP currently has four serverless compute options (emphasis on computebecause there are other serverless offerings for things like databases, queues, and so forth, but these are out of scope for this discussion).

  • Cloud Run: serverless containers (CaaS)
  • App Engine: serverless platforms (PaaS)
  • Cloud Functions: serverless functions (FaaS)
  • Firebase: serverless applications (BaaS)

With four different serverless options to choose from, how do we decide which one is right? The first thing to point out is that we don’t necessarily need to choose a single solution. We might end up using a combination of these services when building a system. However, I’ve provided some criteria below on selecting solutions for different types of problems.

Firebase

If you’re looking to quickly prototype an application or focus only on writing code, Firebase can be a good fit. This is especially true if you’re wanting to focus most of your investment and time on the client-side application code and user experience. Likewise, if you want to build a mobile-ready application and don’t want to implement things like user authentication, it’s a good option. 

Firebase is obviously the most restrictive and opinionated solution, but it’s great for rapid prototyping and accelerating development of an MVP. You can also complement it with services like App Engine or Cloud Functions for situations that require server-side compute.

Good Fit Characteristics

  • Mobile-first (or ready) applications
  • Rapidly prototyping applications
  • Applications where most of the logic is (or can be) client-side
  • Using Firebase components on other platforms, such as using Cloud Firestore or Firebase Authentication on App Engine, to minimize investment in non-differentiating work

Bad Fit Characteristics

  • Applications requiring complex server-side logic or architectures
  • Applications which require control over the runtime

Cloud Functions

If you’re looking to react to real-time events, glue systems together, or build a simple API, Cloud Functions are a good choice provided you’re able to use one of the supported runtimes (Node.js, Python, and Go). If the runtime is a limitation, check out Cloud Run.

Good Fit Characteristics

  • Event-driven applications and systems
  • “Glueing” systems together
  • Deploying simple APIs

Bad Fit Characteristics

  • Highly stateful systems
  • Deploying large, complex APIs
  • Systems that require a high level of control or need custom runtimes or binaries

App Engine

If you’re looking to deploy a full application or complex API, App Engine is worth looking at. Standard is good for greenfield applications which are able to fit within the constraints of the runtime. It can scale to zero and deploys take seconds. Flexible is easier for existing applications where you’re unwilling or unable to make changes fitting them into Standard. Deploys to Flex can take minutes, and you must have a minimum of one instance running at all times.

Good Fit Characteristics

  • Stateless applications
  • Rapidly developing CRUD-heavy applications
  • Applications composed of a few services
  • Deploying complex APIs

Bad Fit Characteristics

  • Stateful applications that require lots of in-memory state to meet performance or functional requirements
  • Applications built with large or opinionated frameworks or applications that have slow start-up times (this can be okay with Flex)
  • Systems that require protocols other than HTTP

Cloud Run

If you’re looking to react to real-time events but need custom runtimes or binaries not supported by Cloud Functions, Cloud Run is a good choice. It’s also a good option for building stateless HTTP-based web services. It’s trimmed down compared to App Engine Flex, which means it has fewer features, but it also has faster instance start-up times, can scale to zero, and is billed only by actual request-processing time rather than instance time. 

Good Fit Characteristics

  • Stateless services that are easily containerized
  • Event-driven applications and systems
  • Applications that require custom system and language dependencies

Bad Fit Characteristics

  • Highly stateful systems or systems that require protocols other than HTTP
  • Compliance requirements that demand strict controls over the low-level environment and infrastructure (might be okay with the Knative GKE mode)

Finally, Google also provides a decision tree for choosing a serverless compute platform.

* App Engine standard environment supports Node.js, Python, Java, Go, PHP
* Cloud Function supports Node.js, Python, Go

Summary

Going serverless can provide a lot of efficiencies by freeing up resources and investment to focus on things that are more strategic and differentiating for a business rather than commodity infrastructure. There are trade-offs when using managed services and serverless solutions. We lose some control and visibility. At certain usage levels there can be a premium, so eventually renting VMs might be the more cost-effective solution once you crack that barrier. However, it’s important to consider not just operational costs involved in managing infrastructure, but also opportunity costs. These trade-offs have to be weighed carefully against the benefits they bring to the business.

One thing worth pointing out is that it’s often easier to move down a level of abstraction than up. That is, there’s typically less friction involved in moving from a more opinionated platform to a less opinionated one than vice versa. This is why we usually suggest starting with the highest level of abstraction possible and dropping down if and when needed.

Planting Perennials Next to Potholes

Silos, bikesheds, and focusing on what matters

If you’ve ever flown into Des Moines then you’ve had the privilege of driving on what might be the most decrepit major road in the metro area. An important artery, Fleur Drive is the only way to get to and from the airport, and the pavement is marginally better than that of a dirt road. Cars weave back and forth to dodge potholes and massive cracks in the asphalt as people race to catch their flights. There always appears to be some kind of construction going on somewhere along the six mile stretch of road, and yet, it never seems to actually improve. The road is also located in a major floodplain, so sometimes the city just closes it when the nearby river rises too much. It’s basically what you’d get if you agiled your way through urban planning.

Typically, you’ll see the Public Works Department planting flowers or otherwise maintaining the landscaping of the medians. It goes down to one lane when they have to water the flowers. Over the past month, they tore up and poured new concrete to replace the medians altogether, again bringing the road down to one lane in the process. The tulips look nice though.

It’s interesting because a lot of companies build software this way. They quickly pave the road by iterating their way there, ignoring nearby flood hazards or the anticipated traffic that’s going to be traversing it. They plant some flowers along the way to make it look nice and then move on to the next thing. Over time, the road deteriorates. Fleur is a main thoroughfare, so you can’t just close it and repave. The city doesn’t have the budget to repave it all at once anyway. So you patch up a few potholes and plant some new flowers.

There are a few different facets to this depending on what vantage point you look at it from. As it turns out, however, they all dovetail into the same thing. At the individual level, what you often see is bikeshedding. That is, engineers focusing time and energy on technical minutiae that, in the grand scheme of things, don’t really matter. Often it’s fixating on aesthetics and what you can see rather than function or things that truly move the needle forward in a meaningful way. Sometimes we get caught up in the details and plant flowers. When you’re up to your neck in alligators, it’s hard to remember that your initial objective was to drain the swamp. This often comes from a lack of direction for the team, and it’s the manager’s job to ensure we’re focusing on what matters.

At the team level, we start to run into siloing issues. This happens when we have different functions of the business focusing on their little parts of the world, more or less neglecting the other parts. Development focuses on development. Operations focuses on operations. Security focuses on security. What you get is gridlock, an utter inability to make progress because everyone is uncompromisingly fastened to their silo. Worse yet, what does manage to get done is a patchwork of competing goals and agendas. It’s building new medians as the roads crumble. And silos are not limited to pure business functions like development, operations, and security. There are silos within silos—Product Team X and Product Team Y, for example. Silos are recursive. They are a natural team dynamic that occurs as organizations grow in accordance with Dunbar’s number, especially at companies that rigidly specialize by function. This is why a cohesive vision is critical.

At the organization level, we see large-scale strategy problems and what I call “WIP-lash”—lots of WIP (Work In Progress), lots of shifting priorities, and lots of “high-priority” items. Priorities change at the drop of a hat or everything is a priority all of the time or the work is planned 12 months in advance and by the time we execute, the goalposts have moved. Executives make knee-jerk mandates in absolute terms to respond to the newest fire. Tech debt piles up as things are added to the never-ending priority queue (that’s at least one thing that doesn’t get equal priority as everything else!), but the infrastructure is in a constant state of ruin and the potholes don’t stop. WIP-lash is just strategic bikeshedding. This is a prioritization and planning issue through and through. We can’t close the entire road and repave it. Instead, we do it in phases. Managing tech debt works the same way. We have to pay it down periodically, but not with constant band-aids and chewing gum and not by stopping the world. We have to prioritize the work like everything else we do, and sometimes that means saying no to other things we deem important.

OKRs can be a useful way to force those difficult decisions and provide teams a shared vision. Specifically, they are the strategy to balance out the iterative tactics of agile. If you don’t have some kind of mile markers you’re working towards, you’re just iterating your way to nowhere. OKRs are not intended to be a waterfall approach, they are about providing strategic guidance. That doesn’t mean companies don’t screw it up though, especially when consultants get their hooks into things. They don’t need to be a large, scary, expensive process with fancy tools—just a Word document and real discussions about what needs to happen and dialogues about what is actually possible. OKRs are hard to get right though and, like anything, require iteration. A key part of good OKR processes is using them to drive discussions and negotiations up and down the organization. It surfaces conflicts and alignment issues earlier in the process. It provides line managers a mechanism to push back and force hard decisions and open a dialogue between groups. The discussions on what really matters and the negotiation about what is really possible is the major value.

“Do you want this or that? I only have resources for this.”
“Oh, I actually have engineers I can lend this quarter. Maybe that will help?”
“Sure, but we can only accomplish part of that.”
“We can make that work.”

OKRs are a vehicle for strategic discussions, not tactical status updates, task lists, or waterfall plans. Without some sort of guiding vision that you’re working towards, you’re just doing stuff. That might look and feel productive but only on the surface. It must be a negotiation if you want results and not just activity.

It really comes down to prioritization and alignment. At the individual level, we have tactical bikeshedding—focusing on items that are largely inconsequential. This is a prioritization problem. It falls on managers to keep teams focused, but it also flows from broader organizational issues. It’s particularly insidious in companies that separate product management (“the business”) from product development (“engineering”). At the organization level, we have strategic bikeshedding—being unable to make hard decisions and focus in on what matters to the business right now, resulting in WIP-lash. This is also a prioritization problem, and it leads to the tactical bikeshedding mentioned earlier. In between, at the team level, we have siloing. This causes all sorts of issues ranging from gridlock and broken customer experiences to duplication of effort. It’s an alignment problem.

There is not a simple, quick solution to these problems, but it starts at the top. If management is not in alignment and unable to prioritize what matters, no one else will. Work will happen, and to a passerby that can look reassuring, but is it work that matters? OKRs are not a silver bullet, and they are difficult to do and take time to get right. But when executed well, they can be a powerful lens to focus on what matters and provide a shared vision. As Intel co-founder and former CEO Andy Grove said, the most powerful tool of all is the word “no.”

Real Kinetic is committed to helping clients develop great engineering organizations. Learn more about working with us.