Structuring a Cloud Infrastructure Organization

Real Kinetic often works with companies just beginning their cloud journey. Many come from a conventional on-prem IT organization, which typically looks like separate development and IT operations groups. One of the main challenges we help these clients with is how to structure their engineering organizations effectively as they make this transition. While we approach this problem holistically, it can generally be looked at as two components: product development and infrastructure. One might wonder if this is still the case with the shift to DevOps and cloud, but as we’ll see, these two groups still play important and distinct roles.

We help clients understand and embrace the notion of a product mindset as it relates to software development. This is a fundamental shift from how many of these companies have traditionally developed software, in which development was viewed as an IT partner beholden to the business. This transformation is something I’ve discussed at length and will not be the subject of this conversation. Rather, I want to spend some time talking about the other side of the coin: operations.

Operations in the Cloud

While I’ve talked about operations in the context of cloud before, it’s only been in broad strokes and not from a concrete, organizational perspective. Those discussions don’t really get to the heart of the matter and the question that so many IT leaders ask: what does an operations organization look like in the cloud?

This, of course, is a highly subjective question to which there is no “right” answer. This is doubly so considering that every company and culture is different. I can only humbly offer my opinion and answer with what I’ve seen work in the context of particular companies with particular cultures. Bear this in mind as you think about your own company. More often than not, the cultural transformation is more arduous than the technology transformation.

I should also caveat that—outside of being a strategic instrument—Real Kinetic is not in the business of simply helping companies lift-and-shift to the cloud. When we do, it’s always with the intention of modernizing and adapting to more cloud-native architectures. Consequently, our clients are not usually looking to merely replicate their current org structure in the cloud. Instead, they’re looking to tailor it appropriately.

Defining Lines of Responsibility

What should developers need to understand and be responsible for? There tend to be two schools of thought at two different extremes when it comes to this depending on peoples’ backgrounds and experiences. Oftentimes, developers will want more control over infrastructure and operations, having come from the constraints of a more siloed organization. On the flip side, operations folks and managers will likely be more in favor of having a separate group retain control over production environments and infrastructure for various reasons—efficiency, stability, and security to name a few. Not to mention, there are a lot of operational concerns that many developers are likely not even aware of—the sort of unsung, unglamorous bits of running software.

Ironically, both models can be used as an argument for “DevOps.” There are also cases to be made for either. The developer argument is better delivery velocity and innovation at a team level. The operations argument is better stability, risk management, and cost control. There’s also likely more potential for better consistency and throughput at an organization level.

The answer, unsurprisingly, is a combination of both.

There is an inherent tension between empowering developers and running an efficient organization. We want to give developers the flexibility and autonomy they need to develop good solutions and innovate. At the same time, we also need to realize the operational efficiencies that common solutions and standardization provide in order to benefit from economies of scale. Should every developer be a generalist or should there be specialists?

Real Kinetic helps clients adopt a model we refer to as “Developer Enablement.” The idea of Developer Enablement is shifting the focus of ops teams from being “masters” of production to “enablers” of production by applying a product lens to operations. In practical terms, this means less running production workloads on behalf of developers and more providing tools and products that allow developers to run workloads themselves. It also means thinking of operations less as a task-driven service model and more as a strategic enabler. However, Developer Enablement is not about giving full autonomy to developers to do as they please, it’s about providing the abstractions they need to be successful on the platform while realizing the operational efficiencies possible in a larger organization. This means providing common tooling, products, and patterns. These are developed in partnership with product teams so that they meet the needs of the organization. Some companies might refer to this as a “platform” team, though I think this has a slightly different meaning. So how does this map to an actual organization?

Mapping Out an Engineering Organization

First, let’s mentally model our engineering organization as two groups: Product Development and Infrastructure and Reliability. The first is charged with developing products for end users and customers. This is the stuff that makes the business money. The second is responsible for supporting the first. This is where the notion of “developer enablement” comes into play. And while this group isn’t necessarily doing work that is directly strategic to the business, it is work that is critical to providing efficiencies and keeping the lights on just the same. This would traditionally be referred to as Operations.

As mentioned above, the focus of this discussion is the green box. And as you might infer from the name, this group is itself composed of two subgroups. Infrastructure is about enabling product teams, and Reliability is about providing a first line of defense when it comes to triaging production incidents. This latter subgroup is, in and of itself, its own post and worthy of a separate discussion, so we’ll set that aside for another day. We are really focused on what a cloud infrastructure organization might look like. Let’s drill down on that piece of the green box.

An Infrastructure Organization Model

When thinking about organization structure, I find that it helps to consider layers of operational concern while mapping the ownership of those concerns. The below diagram is an example of this. Note that these do not necessarily map to specific team boundaries. Some areas may have overlap, and responsibilities may also shift over time. This is mostly an exercise to identify key organizational needs and concerns.

We like to model the infrastructure organization as three teams: Developer Productivity, Infrastructure Engineering, and Cloud Engineering. Each team has its own charter and mission, but they are all in support of the overarching objective of enabling product development efficiently and at scale. In some cases, these teams consist of just a handful of engineers, and in other cases, they consist of dozens or hundreds of engineers depending on the size of the organization and its needs. These team sizes also change as the priorities and needs of the company evolve over time.

Developer Productivity

Developer Productivity is tasked with getting ideas from an engineer’s brain to a deployable artifact as efficiently as possible. This involves building or providing solutions for things like CI/CD, artifact repositories, documentation portals, developer onboarding, and general developer tooling. This team is primarily an engineering spend multiplier. Often a small Developer Productivity team can create a great deal of leverage by providing these different tools and products to the organization. Their core mandate is reducing friction in the delivery process.

Infrastructure Engineering

The Infrastructure Engineering team is responsible for making the process of getting a deployable artifact to production and managing it as painless as possible for product teams. Often this looks like providing an “opinionated platform” on top of the cloud provider. Completely opening up a platform such as AWS for developers to freely use can be problematic for larger organizations because of cost and time inefficiencies. It also makes security and compliance teams’ jobs much more difficult. Therefore, this group must walk the fine line between providing developers with enough flexibility to be productive and move fast while ensuring aggregate efficiencies to maintain organization-wide throughput as well as manage costs and risk. This can look like providing a Kubernetes cluster as a service with opinions around components like load balancing, logging, monitoring, deployments, and intra-service communication patterns. Infrastructure Engineering should also provide tooling for teams to manage production services in a way that meets the organization’s regulatory requirements.

The question of ownership is important. In some organizations, the Infrastructure Engineering team may own and operate infrastructure services, such as common compute clusters, databases, or message queues. In others, they might simply provide opinionated guard rails around these things. Most commonly, it is a combination of both. Without this, it’s easy to end up with every team running their own unique messaging system, database, cache, or other piece of infrastructure. You’ll have lots of architecture astronauts on your hands, and they will need to be able to answer questions around things like high availability and disaster recovery. This leads to significant inefficiencies and operational issues. Even if there isn’t shared infrastructure, it’s valuable to have an opinionated set of technologies to consolidate institutional knowledge, tooling, patterns, and practices. This doesn’t have to act as a hard-and-fast rule, but it means teams should be able to make a good case for operating outside of the guard rails provided.

This model is different from traditional operations in that it takes a product-mindset approach to providing solutions to internal customers. This means it’s important that the group is able to understand and empathize with the product teams they serve in order to identify areas for improvement. It also means productizing and automating traditional operations tasks while encouraging good patterns and practices. This is a radical departure from the way in which most operations teams normally operate. It’s closer to how a product development team should work.

This group should also own standards around things like logging and instrumentation. These standards allow the team to develop tools and services that deal with this data across the entire organization. I’ve talked about this notion with the Observability Pipeline.

Cloud Engineering

Cloud Engineering might be closest to what most would consider a conventional operations team. In fact, we used to refer to this group as Cloud Operations but have since moved away from that vernacular due to the connotation the word “operations” carries. This group is responsible for handling common low-level concerns, underlying subsystems management, and realizing efficiencies at an aggregate level. Let’s break down what that means in practice by looking at some examples. We’ll continue using AWS to demonstrate, but the same applies across any cloud provider.

One of the low-level concerns this group is responsible for is AMI and base container image maintenance. This might be the AMIs used for Kubernetes nodes and the base images used by application pods running in the cluster. These are critical components as they directly relate to the organization’s security and compliance posture. They are also pieces most developers in a large organization are not well-equipped to—or interested in—dealing with. Patch management is a fundamental concern that often takes a back seat to feature development. Other examples of this include network configuration, certificate management, logging agents, intrusion detection, and SIEM. These are all important aspects of keeping the lights on and the company’s name out of the news headlines. Having a group that specializes in these shared operational concerns is vital.

In terms of realizing efficiencies, this mostly consists of managing AWS accounts, organization policies (another important security facet), and billing. This group owns cloud spend across the organization and, as a result, is able to monitor cumulative usage and identify areas for optimization. This might look like implementing resource-tagging policies, managing Reserved Instances, or negotiating with AWS on committed spend agreements. Spend is one of the reasons large companies standardize on a single cloud platform, so it’s essential to have good visibility and ownership over this. Note that this team is not responsible for the spend itself, rather they are responsible for visibility into the spend and cost allocations to hold teams accountable.

The unfortunate reality is that if the Cloud Engineering team does their job well, no one really thinks about them. That’s just the nature of this kind of work, but it has a massive impact on the company’s bottom line.

Summary

Depending on the company culture, words like “standards” and “opinionated” might be considered taboo. These can be especially unsettling for developers who have worked in rigid or siloed environments. However, it doesn’t have to be all or nothing. These opinions are more meant to serve as a beaten path which makes it easier and faster for teams to deliver products and focus on business value. In fact, opinionation will accelerate cloud adoption for many organizations, enable creativity on the value rather than solution architecture, and improve efficiency and consistency at a number of levels like skills, knowledge, operations, and security. The key is in understanding how to balance this with flexibility so as to not overly constrain developers.

We like taking a product approach to operations because it moves away from the “ticket-driven” and gatekeeper model that plagues so many organizations. By thinking like a product team, infrastructure and operations groups are better able to serve developers. They are also better able to scale—something that is consistently difficult for more interrupt-driven ops teams who so often find themselves becoming the bottleneck.

Notice that I’ve entirely sidestepped terms like “DevOps” and “SRE” in this discussion. That is intentional as these concepts frequently serve as a distraction for companies who are just beginning their journey to the cloud. There are ideas encapsulated by these philosophies which provide important direction and practices, but it’s imperative to not get too caught up in the dogma. Otherwise, it’s easy to spin your wheels and chase things that, at least early on, are not particularly meaningful. It’s more impactful to focus on fundamentals and finding some success early on versus trying to approach things as town planners.

Moreover, for many companies, the organization model I walked through above was the result of evolving and adapting as needs changed and less of a wholesale reorg. In the spirit of product mindset, we encourage starting small and iterating as opposed to boiling the ocean. The model above can hopefully act as a framework to help you identify needs and areas of ownership within your own organization. Keep in mind that these areas of responsibility might shift over time as capabilities are implemented and added.

Lastly, do not mistake this framework as something that might preclude exploration, learning, and innovation on the part of development teams. Again, opinionation and standards are not binding but rather act as a path of least resistance to facilitate efficiency. It’s important teams have a safe playground for exploratory work. Ideally, new ideas and discoveries that are shown to add value can be standardized over time and become part of that beaten path. This way we can make them more repeatable and scale their benefits rather than keeping them as one-off solutions.

How has your organization approached cloud development? What’s worked? What hasn’t? I’d love to hear from you.

Microservice Observability, Part 2: Evolutionary Patterns for Solving Observability Problems

In part one of this series, I described the difference between monitoring and observability and why the latter starts to become more important when dealing with microservices. Next, we’ll discuss some strategies and patterns for implementing better observability. Specifically, we’ll look at the idea of an observability pipeline and how we can start to iteratively improve observability in our systems.

To recap, observability can be described simply as the ability to ask questions of your systems without knowing those questions in advance. This requires capturing a variety of signals such as logs, metrics, and traces as well as tools for interpreting those signals like log analysis, SIEM, data warehouses, and time-series databases. A number of challenges surface as a result of this. Clint Sharp does a great job discussing the key problems, which I’ll summarize below along with some of my own observations.

Problem 1: Agent Fatigue

A typical microservice-based system requires a lot of different operational tooling—log and metric collectors, uptime monitoring, analytics aggregators, security scanners, APM runtime instrumentation, and so on. Most of these involve agents that run on every node in the cluster (or, in some cases, every pod in Kubernetes). Since vendors optimize for day-one experience and differentiating capabilities, they are incentivized to provide agents unique to their products rather than attempting to unify or standardize on tooling. This causes problems for ops teams who are concerned with the day-two costs of running and managing all of these different agents. Resource consumption alone can be significant, especially if you add in a service mesh like Istio into the mix. Additionally, since each agent is unique, the way they are configured and managed is different. Finally, from a security perspective, every agent added to a system introduces additional attack surface to hosts in the cluster. Each agent brings not just the vendor’s code into production but also all of its dependencies.

Problem 2: Capacity Anxiety

With the elastic microservice architectures I described in part one, capacity planning for things like logs and metrics starts to become a challenge. This point is particularly salient if, for example, you’ve ever been responsible for managing Splunk licensing. With microservices, a new deployment can now cause a spike in log volumes forcing back pressure on your log ingestion across all of your services. I’ve seen Splunk ingestion get backed up for days’ worth of logs, making it nearly impossible to debug production issues when logs are needed most. I’ve seen Datadog metric ingestion grind to a halt after someone added a high-cardinality dimension to classify a metric by user. And I’ve seen security teams turn on cloud audit log exporting to their SIEM only to get flooded with low-level minutiae and noise. Most tools prioritize gross data ingestion over fine-grained control like sampling, filtering, deduplicating, and aggregating. Using collectors such as Fluentd can help with this problem but add to the first problem. Elastic microservice architectures tend to require more control over data ingestion to avoid capacity issues.

Problem 3: Foresight Required

Unlike monitoring, observability is about asking questions that we hadn’t planned to ask in advance, but we can’t ask those questions if the necessary data was never collected in the first place! The capacity problem described above might cause us to under-instrument our systems, especially when the value of logs is effectively zero—until it’s not. Between monitoring, debugging, security forensics, and other activities, effective operations requires a lot of foresight. Unfortunately, this foresight tends to come from hindsight, which might be too late depending on the situation. Most dashboards are operational scar tissue, after all. Adding or reconfiguring instrumentation after the fact can have significant lag time, which can be the difference between prolonged downtime or a speedy remediation. Elastic microservice architectures benefit greatly from the ability to selectively and dynamically dial up the granularity of operational data when it’s needed and dial it back down when it’s not.

Problem 4: Tooling and Data Accessibility

Because of the problems discussed earlier, it’s not uncommon for organizations to settle on a limited set of operations tools like logging and analytics systems. This can pose its own set of challenges, however, as valuable operational data becomes locked up within certain systems in production environments. Vendor lock-in and high switching costs can make it difficult to use the right tool for the job.

There’s a wide range of data sources that provide high-value signals such as VMs, containers, load balancers, service meshes, audit logs, VPC flow logs, and firewall logs. And there’s a wide range of sinks and downstream consumers that can benefit from these different signals. The problem is that tool and data needs vary from team to team. Different tools or products are needed for different data and different use cases. The data that operations teams care about is different from the data that business analysts, security, or product managers care about. But if the data is siloed based on form or function or the right tools aren’t available, it becomes harder for these different groups to be effective. There’s an ever-changing landscape of tools, products, and services—particularly in the operations space—so the question is: how big of a lift is it for your organization to add or change tools? How easy is it to experiment with new ones? In addition to the data siloing, the “agent fatigue” problem described above can make this challenging when re-rolling host agents at scale.

Solution: The Observability Pipeline

Solving these problems requires a solution that offers the following characteristics:

  1. Allows capturing arbitrarily wide events
  2. Consolidates data collection and instrumentation
  3. Decouples data sources from data sinks
  4. Supports input-to-output schema normalization
  5. Provides a mechanism to encode routing, filtering, and transformation logic

When we implement these different concepts, we get an observability pipeline—a way to unify the collection of operational data, shape it, enrich it, eliminate noise, and route it to any tool in the organization that can benefit from it. With input-to-output schema normalization, we can perform schema-agnostic processing to enrich, filter, aggregate, sample, or drop fields from any shape and adapt data for different destinations. This helps to support a wider range of data collectors and agents. And by decoupling sources and sinks, we can easily introduce or change tools and reroute data without impacting production systems.

We’re starting to see the commercialization of this idea with products like Cribl, but there are ways to solve some of these problems yourself, incrementally, and without the use of commercial software. The remainder of this post will discuss patterns and strategies for building your own observability pipeline. While the details here will be fairly high level, part three of this series will share some implementation details and tactics through examples.

Pattern 1: Structured Data

A key part of improving system observability is being more purposeful in how we structure our data. Specifically, structured logging is critical to supporting production systems and aiding debuggability. The last thing you want to be doing when debugging a production issue is frantically grepping log files trying to pull out needles from a haystack. In the past, logs were primarily consumed by human operators. Today, they are primarily consumed by tools. That requires some adjustments at design time. For example, if we were designing a login system, historically, we might have a logging statement that resembles the following:

log.error(“User '{}' login failed”.format(user))

This would result in a log message like:

ERROR 2019-12-30 09:28.31 User ‘tylertreat' login failed

When debugging login problems, we’d probably use a combination of grep and regular expressions to track down the users experiencing issues. This might be okay for the time being, but as we introduce additional metadata, it becomes more and more kludgy. It also means our logs are extremely fragile. People begin to rely on the format of logs in ways that might even be unknown to the developers responsible for them. Unstructured logs become an implicit, undocumented API.

With structured logs, we make that contract more explicit. Our logging statement might change to something more like:

log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)

The actual format we use isn’t hugely important. I typically recommend JSON because it’s ubiquitous and easy to write and parse. With JSON, our log looks something like the following:

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}

With this, we can parse the structure, index it, query it, even transform or redact it, and we can add new pieces of metadata without breaking consumers. Our logs start to look more like events. Remember, observability is about being able to ask arbitrary questions of our systems. Events are like logs with context, and shifting towards this model helps with being able to ask questions of our systems.

Pattern 2: Request Context and Tracing

With elastic microservice architectures, correlating events and metadata between services becomes essential. Distributed tracing is one component of this. Another is tying our structured logs together and passing shared context between services as a request traverses the system. A pattern that I recommend to teams adopting microservices is to pass a context object to everything. This is actually a pattern that originated in Go for passing request-scoped values, cancelation signals, and deadlines across API boundaries. It turns out, this is also a useful pattern for observability when extended to service boundaries. While it’s contentious to explicitly pass context objects due to the obtrusiveness to APIs, I find it better than relying on implicit, request-local storage.

In its most basic form, a context object is simply a key-value bag that lets us track metadata as a request passes through a service and is persisted through the entire execution path. OpenTracing refers to this as baggage. You can include this context as part of your structured logs. Some suggest having a single event/structured-log-with-context emitted per hop, but I think this is more aspirational. For most, it’s probably easier to get started by adding a context object to your existing logging. Our login system’s logging from above would look something like this:

def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...

This adds rich metadata to our logs—great for debugging—as they start evolving towards events. The context is also a convenient way to propagate tracing information, such as a span ID, between services.

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“span_id”: “34fe6cbf9556424092fb230eab6f4ea6”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}

You might be wondering what to put on the context versus just putting on our structured logs. It’s a good question and, like most things, the answer is “it depends.” A good rule of thumb is what can you get for “free” and what do you need to pass along? These should typically be things specific to a particular request. For instance, CPU utilization and memory usage can be pulled from the environment, but a user or correlation ID are request-specific and must be propagated. This decision starts to become more obvious the deeper your microservice architectures get. Just be careful not to leak sensitive data into your logs! While we can introduce tooling into our observability pipeline to help with this risk, I believe code reviews are the best line of defense here.

Pattern 3: Data Schema

With our structured data and context, we can take it a step further and introduce schemas for each data type we collect, such as logs, metrics, and traces. Schemas provide a standard shape to the data and allow consumers to rely on certain fields and types. They might validate data types and enforce required fields like a user ID, license, or trace ID. These schemas basically take the explicit contract described above and codify it into a specification. This is definitely the most organization-dependent pattern, so it’s hard to provide specific advice. The key thing is having structured data that can be easily evolved and relied on for debugging or exploratory purposes.

These schemas also need libraries which implement the specifications and make it easy for developers to actually instrument their systems. There is a plethora of existing libraries available for structured logging. For tracing and metrics, OpenTelemetry has emerged as a vendor-neutral API and forthcoming data specification.

Pattern 4: Data Collector

So far, we’ve talked mostly about development practices that improve observability. While they don’t directly address the problems described above, later, we’ll see how they also help support other parts of the observability pipeline. Now we’re going to look at some actual infrastructure patterns for building out a pipeline.

Recall that two of the characteristics we desire in our observability solution are the ability to consolidate data collection and instrumentation and decouple data sources from data sinks. One of the ways we can reduce agent fatigue is by using a data collector to unify the collection of key pieces of observability data—namely logs (or events), metrics, and traces. This component collects the data, optionally performs some transformations or filtering on it, and writes it to a data pipeline. This commonly runs as an agent on the host. In Kubernetes, this might be a DaemonSet with an instance running on each node. From the application or container side, data is written to stdout/stderr or a Unix domain socket which the collector reads. From here, the data gets written to the pipeline, which we’ll look at next.

Moving data collection out of process can be important if your application emits a significant amount of logs or you’re doing anything at a large enough scale. I’ve seen cases where applications were spending more time writing logs than performing actual business logic. Writing logs to disk can easily take down a database or other I/O-intensive workload just by sharing a filesystem with its logging. Rather than sacrificing observability by reducing the volume and granularity of logs, offload it and move it out of the critical execution path. Logging can absolutely affect the performance and reliability of your application.

For this piece, I generally recommend using either Fluentd or Logstash along with the Beats ecosystem. I usually avoid putting too much logic into the data collector due to the way it runs distributed and at scale. If you put a lot of processing logic here, it can become difficult to manage and evolve. I find it works better to have the collector act as a dumb pipe for getting data into the system where it can be processed offline.

Pattern 5: Data Pipeline

Now that we have an agent running on each host collecting our structured data, we need a scalable, fault-tolerant data stream to handle it all. Even at modestly sized organizations, I’ve seen upwards of about 1TB of logs indexed daily with elastic microservice architectures. This volume can be much greater for larger organizations, and it can burst dramatically with the introduction of new services. As a result, decoupling sources and sinks becomes important for reducing capacity anxiety. This data pipeline is often something that can be partitioned for horizontal scalability. In doing this, we might just end up shifting the capacity anxiety from one system to another, but depending on the solution, this can be an easier problem to solve or might not be a problem at all if using a managed cloud service. Finally, a key reason for decoupling is that it also allows us to introduce or change sinks without impacting our production cluster. A benefit of this is that we can also evaluate and compare tools side-by-side. This helps reduce switching costs.

There are quite a few available solutions for this component, both open source and managed. On the open source side, examples include Apache Kafka, Apache Pulsar, and Liftbridge. On the cloud-managed services side, Amazon Kinesis, Google Cloud Pub/Sub, and Azure Event Hubs come to mind. I tend to prefer managed solutions since they allow me to focus on things that directly deliver business value rather than surrounding operational concerns.

Note that there are some important nuances depending on the pipeline implementation you use or which might determine the implementation you choose. For example, questions like how long do you need to retain observability data, do you need the ability to replay data streams, and do you need strict, in-order delivery of messages? Replaying operational data can be useful for retraining ML models or testing monitoring changes, for instance. For systems that are explicitly sharded, there’s also the question of how to partition the data. Random partitioning is usually easiest from a scaling and operations perspective, but it largely depends on how you intend to consume it.

Pattern 6: Data Router

The last pattern and component of our observability pipeline is the data router. With our operational data being written to a pipeline such as Kafka, we need something that can consume it, perform processing, and write it to various backend systems. This is also a great place to perform dynamic sampling, filtering, deduplication, aggregation, or data enrichment. The schema mentioned earlier becomes important here since the shape of the data determines how it gets handled. If you’re dealing with data from multiple sources, you’ll likely need to normalize to some common schema, either at ingestion time or processing time, in order to execute shared logic and perform schema-agnostic processing. Data may also need to be reshaped before writing to destination systems.

This piece can be as sophisticated or naive as you’d like, depending on your needs or your organization’s observability and operations maturity. A simple example is merely looking at the record type and sending logs to Splunk and Amazon Glacier cold storage, sending traces to Stackdriver, sending metrics to Datadog, and sending high-cardinality events to Honeycomb. More advanced use cases might involve dynamic sampling to dial up or down the granularity on demand, dropping values to reduce storage consumption or eliminate noise, masking values to implement data loss prevention, or joining data sources to create richer analytics.

Ultimately, this is a glue component that’s reading data in, parsing the shape of it, and writing it out to assorted APIs or other topics/streams for further downstream processing. Depending on the statefulness of your router logic, this can be a good fit for serverless solutions like AWS Lambda, Google Cloud Functions, Google Cloud Run, Azure Functions, or OpenFaaS. If using Kafka, Kafka Streams might be a good fit.

The Journey to Better Observability

Observability with elastic microservice architectures introduces some unique challenges like agent fatigue, capacity anxiety, required foresight, and tooling and data accessibility. Solving these problems requires a solution that can capture arbitrarily wide events, consolidate data collection and instrumentation, decouple data sources and sinks, support input-to-output schema normalization, and encode routing, filtering, and transformation logic. When we implement this, we get an observability pipeline, which is really just a fancy name for a collection of observability patterns and best practices.

An observability pipeline should be an evolutionary or iterative process. You shouldn’t waste time building out a sophisticated pipeline early on; you should be focused on delivering value to your customers. Instead, start small with items that add immediate value to the observability of your systems.

Something you can begin doing today that adds a ton of value with minimal lift is structured logging. Another high-leverage pattern is passing a context object throughout your service calls to propagate request metadata which can be logged and correlated. Use distributed tracing to understand and identify issues with performance. Next, move log collection out of process using Fluentd or Logstash. If you’re not already, use a centralized logging system—Splunk, Elasticsearch, Sumo Logic, Graylog—there are a bunch of options here, both open source and commercial, SaaS or self-managed. With the out-of-process collector, you can then introduce a data pipeline to decouple log producers from consumers. Again, there are managed options like Amazon Kinesis or Google Cloud Pub/Sub and self-managed ones like Apache Kafka. With this, you can now add, change, or compare consumers and log sinks without impacting production systems. Evaluate a product like Honeycomb for storing high-cardinality events. At this point, you can start to unify the collection of other instrumentation such as metrics and traces and evolve your logs towards context-rich events.

Each of these things will incrementally improve the observability of your systems and can largely be done in a stepwise fashion. Whether you’re just beginning your transition to microservices or have fully adopted them, the journey to better observability doesn’t have to require a herculean effort. Rather, it’s done one step at a time.

In part three of this series, I’ll demonstrate a few implementation details through examples to show some of these observability patterns in practice.

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

“Pets versus cattle” has become something of a standard vernacular for describing the shift in how we build systems. It alludes to the elastic and dynamic nature of these (typically, but not necessarily) container-based systems with on-demand scaling and more transparent fault-tolerance. I’ve talked before about this transition before and specifically how it relates to monitoring. In particular, with these more dynamic, microservice-based systems, the conversation starts to shift away from traditional monitoring toward observability. In this series, I’ll describe that distinction, explain why it matters, and share some concrete tactical items for implementing observability in a microservice environment.

In the past, I’ve used the term “cloud-native” to describe these types of systems, but this buzzword has conflated so many different concepts that it’s been relegated to the likes of “DevOps”—entirely arbitrary and context-dependent. Depending on who you ask, cloud-native means containers, microservices, Kubernetes, elasticity, serverless, automation, or any number of other ideas. The truth, however, is that you can do many of these things on-prem just as much as in the cloud, the difference being largely CapEx versus OpEx. I think the spirit of “cloud-native” really just means architecting systems to take advantage of cloud capabilities, namely higher-level managed services (which may not even have on-prem equivalents), improved elasticity and fault-tolerance (which may or may not mean containers), and reduced operations investment (in part by leveraging managed services).

Because there are so many confounding and interrelated-yet-different ideas, I’m going to focus this discussion on elastic microservice architectures. Elastic meaning services that automatically scale up and down as needed (in contrast to static infrastructures), and microservice simply meaning applications comprised of many different—usually smaller—services (in contrast to monoliths or systems comprising just a few coarse-grained services).

Static Monolithic Architectures

With static monolithic architectures, monitoring is a reasonably well-understood problem. With a monolith, the system is typically in one of two states, up or down, and we can conceivably correlate this to customer impact. Bugs aside, when the monolith is down, we likely have a good idea of how this behavior manifests itself to the user. We can set up Nagios checks and get some meaningful signals out of it. Uptime is mostly a single data point.

With a monolith, it’s not unreasonable for ops teams to manage the day-to-day operations of the system and do so effectively. These teams tend to quickly develop a good intuition and “muscle memory” for the application when it’s the only thing they are responsible for, especially when it’s a single deployable unit. Logs can be grepped from a single log file, and if something is wrong with the application, operators might simply SSH into the box to poke at it. Runbooks and standard operating procedures are also common here.

With a monolith, we likely have a single runtime such as the JVM, which makes it easier to collect rich telemetry in a centralized way, all the way down to the code level. Tools like Dynatrace and AppDynamics can instrument the JVM itself to collect information on busy and idle threads, garbage collection stats, and request metrics. And because we have just a single deployed artifact running on a handful of static servers, this data can actually be useful and correlated back to customer impact and business metrics.

Elastic Microservice Architectures

With elastic microservice architectures, things start to change dramatically. Applications consist of dozens of different microservices. The system is no longer in one of two states but more like one of n-factorial states. In reality, it’s much more because in production you might have different versions of the same service running at the same time as you introduce more sophisticated deployment strategies and rollbacks. Integration testing can’t possibly account for all of these combinations. We can no longer easily correlate system behavior to actual customer impact because system behavior is much more emergent. It can be difficult to pinpoint how the behavior of a given service affects the user’s experience as the system operates in varying states of partial failure and services interact in unique ways. If it’s slow, which part is slow? The frontend service? An upstream service? The database? Some combination of these? Uptime is no longer a single data point but rather a composite of many different data points, but more importantly, what does “up” even mean in the context of a complex microservice architecture?

With microservices, it becomes intractable for a single ops team to manage dozens of heterogeneous services beyond anything but in a first-responder, incident-router capacity. There is too much context and specific knowledge needed since microservices are literally the embodiment of the specialization of teams.

With microservices, it’s no longer practical or even feasible to grep log files or SSH into the box to debug a problem. There might not even be a box to SSH into if it’s a container that has since been descheduled or a managed serverless runtime. With heterogeneous services, we might have half a dozen languages and runtimes to support, each with differing types of runtime instrumentation. Moreover, because we now have dozens or even hundreds of nodes running many different instances of our services, the value of this low-level, summarized data starts to diminish. It makes for pretty dashboards and can help in answering very specific, predefined questions, but that’s about it. It’s no use for proactive monitoring because it’s too much noise, and it’s no use for reactive debugging because it’s pre-aggregated. There’s not much you can do when all you have are rolled-up time-series metrics, and it’s just as difficult to correlate this data back to customer impact.

Monitoring and Observability

With a complex system, relying on this type of data along with logs can often lead to a deadend when tracking down a particularly insidious bug. And this is where observability comes into play. It picks up where monitoring leaves off.

While monitoring and observability have been getting conflated a lot lately, there’s actually an important distinction to make. Monitoring tends to focus on the overall health of system and business metrics—questions we know in advance. Observability is about providing more granular insights into the behavior of systems and richer context. It’s the difference between “post hoc” versus “ad hoc.”

In the top-right corner, we have known knowns. These are things of which we have a high degree of understanding and a large amount of data on, i.e. the things we are aware of and understand. For example, “the system has a 1GB memory limit.” As the designers of this system, this is something that we’re acutely aware of and understand. We know that we know how much memory the system can use before it moves outside of its operating boundaries and bad things happen.

In the bottom-right corner, we have known unknowns. These are things we are generally aware of but don’t necessarily understand. For example, “the system exceeded its memory limit and crashed, causing an outage.” As system designers, memory usage is something we know is important and affects system behavior. We can monitor it in production in order to gather lots of data on it, but just having that data often doesn’t help us to understand why memory is being consumed or even how that data manifests itself as system behavior.

In the top-left corner, we have unknown knowns, which are things we understand but are not completely aware of. This sounds like a strange, almost oxymoron-like categorization, but it’s basically the things that are gut instinct or intuition. It’s often things we know or think we know without even consciously realizing it. For example, “we implemented an orchestrator to ensure the system is always running.” Intuition tells us that if the process isn’t running, the system isn’t available, so we make sure that it gets restarted when something goes wrong. We might, however, be unaware of the unintended side effects of this decision, and it might be based more on theory and conjecture than data.

Which leads us to the bottom-left corner: unknown unknowns. These are the things we are neither aware of nor understand. The events we can’t even predict or foresee happening because if we could foresee them, they wouldn’t be unknown unknowns, they’d be known unknowns. For example, “instances churn because the orchestrator restarts the process when it approaches its memory limit, causing sporadic failures and slowdowns.” This was an unforeseen consequence of our orchestrator implementation. As a result, we could not have tested for it or looked for it with our monitoring tools. Instead, it’s something that happens, we learn from it, and quickly classify it as a known unknown—something we know to look for going forward.

In a sense, the known knowns are facts, the known unknowns are hypotheses, the unknown knowns are assumptions, and the unknown unknowns are discoveries. Through this lens, the distinction between observability and monitoring becomes clear. Monitoring is about testing hypotheses and observability is about exploring new discoveries. We monitor known unknowns because these are the things we know to look for, but unknown unknowns are, by definition, unpredictable. We cannot monitor them because we do not know to even look for them in the first place! Instead, we ask questions of our systems in order to understand and categorize these unknown unknowns. Observability is the ability to interrogate our systems after the fact in a data-rich, high-fidelity way. Monitoring, on the other hand, is before the fact and much lower fidelity. These are the dashboards and alerts we set up which usually consist of pre-aggregated metrics. This is what I mean by post hoc versus ad hoc. Observability allows us to ask arbitrary questions of our systems, not questions predefined in advance.

With this definition, monitoring is a subset of observability, and observability encompasses many different types of data. For example, things like distributed traces, application logs, system logs, audit logs, and application metrics are all important observability signals. But when we boil it all down, it turns out everything is really just events, of which we want different lenses to view. Some of this data provides context for the event itself, such as logs and metrics, and some of it describes relationships between events, such as traces. It’s important we have a way to collect all this context and store it such that we can query and analyze it using these different lenses. Aggregated metrics alone aren’t enough—they don’t have the granularity nor the context needed. Dashboards are simply answers to specific questions known in advance. Observability needs to go much deeper than this.

In part two of this series, we’ll revisit the concept of an observability pipeline as a tactical approach to implementing observability in a microservice environment. As part of this, we’ll discuss some steps that can be taken to incrementally improve observability while iterating toward this pattern.

Operations in the World of Developer Enablement

NewOps is not a replacement for DevOps, it’s an evolution of it by looking at Operations through the lens of product. It’s what I’ve come to call “Developer Enablement” because the goal is to shift the focus of Ops teams from being masters of production to enablers of production. Through Developer Enablement, teams are enabled—and tasked with the responsibility—to control their own destiny. This extends far beyond just the responsibility of building products. It includes how we build, test, secure, deploy, monitor, and operate systems.

For some, this might come naturally. Many startups don’t have the privilege of siloing up their organizations (although you’d be surprised!). For others, this can be a major shift in how we build software. Especially in large, established organizations with more specialized roles, responsibilities can be so siloed people aren’t even aware they’re happening. Basic “ilities” like scalability, reliability, and even security become someone else’s responsibility. “Good Operations” means no one even knows you’re there, unless something goes wrong.

So when this is turned on its ear, and these responsibilities are placed on the dev team’s shoulders, how do they adapt? In many cases, teams are eager to take on these new responsibilities but also blissfully unaware of what that actually entails. DBAs are a good example of this. Often a staple of enterprise IT Ops, DBAs are tasked with—among other things—installing and patching DBMSs, performing backups, managing HA and DR strategies, balancing database workloads, managing resources, tuning performance, configuring security settings, and monitoring systems. Many of these responsibilities are invisible to developers.

With cloud and Developer Enablement, this can change in profound ways. However, in a typical lift-and-shift, the role of DBAs is widely unchanged. In this case, we’re just running the same stuff in someone else’s data center. There are still databases to be patched, replication to be managed, backups to be made, and so on. But pure lift-and-shifts, at least as an end goal, are largely a misstep. You throw away all that institutional memory—the knowledge and experience you have managing your own data center—for more expensive compute with which you have less experience administering. Things change when we start to rely on managed cloud services. We no longer run our own databases on VMs but instead rely on cloud-managed ones. This is where things become much more grey—but also much more interesting.

Developer Enablement in the Cloud

First, a quick aside. There are two different concepts we’re talking about here: cloud and Developer Enablement (DevOps for brevity). These are two distinct but related concepts. We can “do” DevOps on-prem, just as we can in the cloud. Likewise, we can also do traditional Operations in the cloud, just as we can on-prem. One of the benefits of cloud is it allows us to focus more investment on business-differentiating things, but it also makes implementing DevOps easier for two reasons. First, the cloud provider takes on more operational responsibilities (the stuff that supports—but doesn’t directly contribute to—business value). Second, it provides a lower barrier to self-service infrastructure. This means developers can, of their own accord, provision and manage supporting infrastructure like databases, caches, queues, and other things without a go-between or the customary “throw-it-over-the-wall” approach. This is a key part of Developer Enablement.

In the world of Developer Enablement in the cloud, what is the role of a DBA, or any other Ops person for that matter? When you start to map who is accountable for what, you quickly realize there is far too much nuance to cleanly map responsibilities. Which cloud provider are we talking about? Within that cloud provider, which database offering? Proprietary NoSQL databases like Google’s Cloud Datastore? Relational databases like Amazon’s RDS? Globally-distributed databases like Spanner? How we handle things like HA and DR vary drastically depending on the service and service provider. In some cases, the vendor is entirely responsible, e.g. because the database has built-in replication. In other cases, the customer. Sometimes it’s a combination of both, such as a database that has automated backups which must first be enabled. It’s not as cut and dry as it used to be.

As we push more responsibility onto developers, how do we ensure they are actually tackling all of those responsibilities, especially the ones they might not even know about? How do we implement DevOps responsibly?

The goal of Developer Enablement is not to enable developers by giving them total control and free rein. Instead, it’s to empower them in a way that is “safe” for the business. People often misconstrue DevOps and automation as things that reduce lead times and increase deployment frequencies by simply pulling security out of the process. This is categorically not the purpose of DevOps. In fact, the intention is to improve security by integrating it more deeply and earlier into the process in a more reliable and repeatable way, i.e. “shift left.” Developer Enablement is about providing the tools, automation, services, and standards teams need to do just this.

So when we say we want to implement DevOps and Developer Enablement, we’re not saying we want to hand developers the keys to production with a pat on the back. We’re saying we want to pave a path to production which allows developers to release software in a way that is safe and secure with greater autonomy—because autonomy enables building more reliable software faster. In this world, Operations teams become increasingly Developer Enablement teams because there is simply less stuff to operate. It becomes more about supporting development teams and organizing around products than acting purely as a gatekeeper or service provider. It’s pretty amazing how things start to improve when you align yourself this way.

Responsibilities of Developer Enablement

Those Operations teams still have extremely valuable skill sets however. It’s just that they start to act more in an advisory role than the assembly-line-worker role converting Jira tickets into outputs. For instance, DBAs have deep expertise on the intricacies and operations of various database systems, but when Amazon is now responsible for installing the database, patching it, scaling it, monitoring it, performing backups, managing replication and failovers, and handling encryption and security, what do the DBAs do? They become domain experts and developer advocates. They make sure teams aren’t shooting themselves—or the company—in the foot and provide domain expertise and tooling in a supporting role. When a developer complains about a slow query, they are the ones who can help them identify, understand, and fix the problem. “It’s doing a full-table scan since you’re missing an index,” or “You have a hot partition because you’re using a timestamp as the partition key. Try using a more uniform ID to distribute workloads evenly.” These folks can often help developers better structure their data to improve application performance and scalability.

In addition to this supporting role, these Developer Enablement teams also help ensure dev teams are thinking about all the things they need to be considering. In the case of data, how is encryption handled? HA? DR? Data migrations? Rollbacks? Not that all of these things need to be handled by the teams themselves—again, often the cloud provider has it covered—but simply ensuring that they have been considered and can be spoken to is important. It’s vital to start this conversation early in the development process.

The Three Phases of Development

There are basically three phases of development to consider. There’s the “playground” phase, which is when teams are essentially exploring different technologies. At this stage, there can be little-to-no oversight outside of controlling cloud spend (which is important for when your intern accidentally starts a task bomb before leaving for the weekend). Teams are free to try out new ideas without worrying about production. Often this work happens in a separate “experimentation” cloud project.

Next, there’s the “green-light” phase. The thing being built is going to production, it’s part of the company’s strategic plan, people are talking about it, etc. At this point, we start an ongoing dialogue with the team and provide them with a list of the key things to be thinking about. This should not be a 10-page document. It should be a one-page document hitting the main areas. An example portion of this might look like the following:

  • How do you plan to implement HA?
  • What classifications of data will this system handle and how do you plan to secure that data in transit and at rest?
  • How much traffic do you expect the system to handle and how will you scale it?
  • How will the system handle authentication and authorization?
  • What are the integration points?
  • Who will support the system in production?
  • What is the CI/CD story for the system?
  • What is the testing strategy?

Depending on your company’s culture, this can sometimes be seen as an affront or threat to teams if they’re used to Ops or InfoSec groups gatekeeping. That is not the goal as it’s intended to be in an advisory capacity. This ends up having a couple benefits. First, it gets teams thinking about and planning for key operational items, and second, it uncovers any major gaps early in the process. The number of times I’ve heard someone ask, “What’s HA?” after reading this list is non-zero. The purpose of this isn’t to shame anyone, just to provide a way to start critical discussions between the team and Developer Enablement groups.

Finally, there’s the “ready-for-production” phase. The team is ready to ship what they’ve been building. This is where things get real. Typically, there are a few things that should happen here. When launching a new service or product, there should be a comprehensive review of the system. The team will sit down with a group of their peers, architects, and security engineers and walk them through the system. People hate the dreaded architecture review, so we call it a product technical walkthrough instead.

Operational Readiness and Change Management

About a month or so prior to the walkthrough, the team should be working through an “operational-readiness checklist” which is used to guide the walkthrough. This checklist is much more detailed than the previous one, enumerating items like what the deploy process consists of, configuration management, API versioning, incident-response procedures, system observability, etc. The checklist we commonly use with clients at Real Kinetic is about seven pages long and covers 10 areas: Deployment, Testing, Reliability/Failover, Architecture, Costs, Security, CI/CD, Infrastructure, Capacity/Performance Estimates, and Operations and Support. This checklist is used to probe different areas. If certain areas feel a little weak, this can lead to deeper discussions depending on the importance or severity. If a system is particularly critical to the business or high-risk, this process can veto a release. Having a sign-off process like this makes some people nervous, but it’s important to point out that this should only apply to new launches. It is not a general change-management process. It’s really about helping teams learn about running systems in production and understanding what that takes.

In addition to the product technical walkthrough, we also recommend doing a security assessment for new services. This usually encompasses a vulnerability and threat assessment, risk assessment, pen testing, the whole nine yards. I usually also like to see some sort of load profiling done on the service before putting it in production (though load and chaos testing should ideally be part of the normal development process, not saved for the very end).

When it comes to infrastructure, there’s also the question of how to manage changes. This is where infrastructure as code (IaC) becomes hugely important as it not only provides a way to automate infrastructure changes, but also a means to review those changes. We can treat infrastructure changes in the same way we treat application changes—storing them in source control, doing code reviews on them, running them through static analysis tools, and so forth. Infrastructure changes, like all changes, should go through a code review process. It cannot be overstated how essential code reviews are and how much they benefit your organization. And once again, this is where Developer Enablement comes into play. I recommend IaC changes be reviewed by a Developer Enablement team member. This provides a touchpoint where they can provide domain expertise and ensure changes are within acceptable parameters. If a developer is requesting a change which falls outside those parameters, such as a database instance with 1TB of RAM for example, it requires a conversation and sign-off process.

Conclusion

With Developer Enablement, what used to be Operations becomes primarily a product and advisory team. “Product” in the sense of providing systems and tools that help developers take on more responsibility, from day-to-day development to operations and support. “Advisory” in the sense of offering domain expertise and guidance. Through this approach, we get better alignment by giving engineers end-to-end ownership from development to on-call and improve efficiency by reducing handoffs. This also lets us scale more effectively. Through products and reduced hand-offs, a Developer Enablement group can empower far more engineers than any conventional Ops team could.

The Observability Pipeline

The rise of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. It has also led to new tools and services created to help us support our systems.

Many of the clients we work with at Real Kinetic are trying to navigate their way through this transformation and struggle to figure out where to begin with these solutions. Beau Lyddon, one of our partners, recently gave a talk on exactly this called What is Happening: Attempting to Understand Our Systems (as an aside, Honeycomb’s Charity Majors live-blogged the talk which is worth a read). In this post, I’m going to attempt to summarize some of the key ideas from Beau’s talk and introduce the concept of an observability pipeline, which we think is an essential component in today’s cloud-native, product-oriented world.

Observability Explosion

With traditional static deployments and monolithic architectures, monitoring is not too challenging (that’s not to say it’s easy, but, in relative terms, it’s uncomplicated). This is where tools like Nagios became very popular. When we have only a handful of servers and/or a single, monolithic application, it’s relatively straightforward to determine the health of the system and to correlate system behavior to actual customer or business impact. It’s also feasible to “see inside the box” and get meaningful code-level instrumentation. Once again, tools like AppDynamics and Dynatrace became popular here.

With cloud-native and container-based systems, instances tend to be highly elastic and ephemeral, and what used to comprise a single, monolithic application might now consist of dozens of different microservices and even different instances running different versions of the same service. Simply put, systems are more distributed, more dynamic, and more complex now than ever before—and users have even more expectations. This means many of the tools that were well-suited before might not be adequate now.

For example, the ability to “see inside the box” with intra-process, code-level tracing becomes largely impractical in a highly dynamic cloud environment. By the time you are debugging an issue, the container is gone. This is only exacerbated by the serverless or functions as a service (FaaS) movement. Similarly, it’s much more difficult to correlate the behavior of a single service to the user’s experience since partial failure becomes more of an everyday thing. Thus, many of these tools end up being better suited to static infrastructures where there is a small set of long-lived VMs with a limited number of services. That’s where most of them originated from anyway. Instead, service-level distributed tracing becomes a key part of microservice observability, as does structured logging. With this shift in how we build systems, there has been an explosion in new terms, new tools, and new services.

Of course, in addition to tools, there are also the cultural aspects of monitoring and incident response. Many companies traditionally rely on an operations team to monitor, triage, and—in some cases—even resolve issues. This model quickly becomes untenable as the number of services increases. A single operations team will not be able to maintain enough context for a non-trivial amount of services and systems to do this effectively. This model also leads to ineffective feedback loops if engineers are not on-call and responsible for the operation of their services—something I’ve talked about ad nauseum. My advice is to push ownership of systems onto the teams who built them. This includes on-call duty and general operational responsibilities. However, in order for development teams to take on this responsibility, they need to be empowered to act on it. With this model, which I’ve come to facetiously call NewOps, the operations team becomes responsible for providing the tools and data teams need to adequately operate their services. Some organizations take this even further with dedicated observability teams.

Observability” is a term that has emerged recently within the industry as a more nuanced take on traditional monitoring. While monitoring tends to focus more on the overall health of systems and business metrics, observability aims to provide more granular insights into the behavior of systems along with rich context useful for debugging and business purposes. Put another way, monitoring is about known-unknowns and actionable alerts; observability is about unknown-unknowns and empowering teams to interrogate their systems.

In a sense, observability encompasses all of the telemetry needed to gain insight into the behavior and state of a running system. This includes items like application logs, system logs, audit logs, application metrics, and distributed-tracing data. These are all valuable signals for diagnosing and debugging production issues, especially in a microservice environment where containers are largely ephemeral. In this environment, it is no longer practical to SSH into a machine to debug a problem or tail a log file. Distributed tracing becomes particularly important since a single application transaction may invoke multiple service functions.

Observability Pipeline

It’s important that you can really own your data and prevent it from being locked up inside a single vendor’s solution. Likewise, it’s important that data can be made available to the entire enterprise (or, in some cases, made not available to the entire enterprise). Since the number of tools and products can be quite large, tool and data needs vary from team to team, and the overall amount of data can be overwhelming, I suggest a decoupled approach. By building an observability pipeline, we can decouple the collection of this data from the ingestion of it into a variety of systems.

To illustrate, if we have log data going to Splunk, metrics and traces going to Datadog, client events going to Google Analytics and BigQuery, and everything going to Amazon Glacier for cold storage, the number of integrations quickly becomes large and grows for every additional service we add. It also probably means we are running an agent for many of these services on each host, and if any of these services are unavailable or behind, our application either blocks or we lose critical observability data. With the amount of data we end up collecting, it’s not uncommon to spend more time collecting it than actually performing business logic unless we find a way to efficiently get it out of the critical path.

Finally, as vendors in this space converge on features (which they are), differentiating capabilities are released (which they will need), or licensing/pricing issues arise (which they do), it’s likely that the business will need to add or remove SaaS solutions over time. If these are tightly integrated, this can be difficult to do. An observability pipeline, as we will later see, allows us to evaluate multiple solutions simultaneously or replace solutions transparently to applications and infrastructure. For example, perhaps we need to switch from Splunk to Sumo Logic or Datadog to New Relic or evaluate Honeycomb in addition to New Relic. How big of a lift would this be for your organization today? How easy is it to experiment with a new tool or service?

With an observability pipeline, we decouple the data sources from the destinations and provide a buffer. This makes the observability data easily consumable. We no longer have to figure out what data to send from containers, VMs, and infrastructure, where to send it, and how to send it. Rather, all the data is sent to the pipeline, which handles filtering it and getting it to the right places. This also gives us greater flexibility in terms of adding or removing data sinks, and it provides a buffer between data producers and consumers.

There are a few components to this pipeline which I will cover below. Many of the components can be implemented with existing open source tools or off-the-shelf services, so those I will touch on only briefly. Other parts require more involvement and some up-front thinking, so I’ll speak to them in more detail.

Data Specifications

Structured logging is hugely important to aiding debuggability. Anyone who’s shipped production code has been in the situation where they’re frantically trying to regex logs to pull out the information they need to debug a problem. It’s even worse when we’re debugging a request going through a series of microservices with haphazard logging. But structured logging isn’t just about creating better logs, it’s about creating a data pipeline that can feed the many tools you’ll need to leverage to understand, debug, and optimize complex systems, meet security and compliance requirements, and provide critical business intelligence.

In order to monitor systems, debug problems, make decisions, or automate processes, we need data. And we need the systems to give us data to provide necessary context. Aside from structured logging, one piece of advice we give every client is to pass a context object to basically everything. This context includes all of the important metadata flowing through a system—usually IDs that allow you to correlate events and piece together a story of what’s happening inside your system: user ID, account ID, trace ID, request ID, parent ID, and so on. What we want to avoid is the sort of murder-mystery debugging that often happens. A lone error log is the equivalent of finding a body. We know a crime occurred, but how do we piece together the clues to tell the right story? Observability—that is, being able to ask questions of your systems and truly explore them—requires access to pre-aggregate, raw data and support for high-cardinality dimensions.

The way to decide what goes on the context is to think about the data you wish you had while debugging an issue (this also highlights the importance of developers supporting their own systems). What is the data that would change the behavior of the system? Some examples include the user (or company), their license, time, machine stats (e.g. CPU and memory), software version, configuration data, the incoming request, downstream requests, etc. Of these, what can we get for “free” and what do we need to pass along? “Free” in this case would be things which are machine-provided, such as memory and CPU. The data we can’t get for free should go on the context, typically data that is request-specific. This context should be included on every log message.

This brings us back to the importance of structuring your data. To do this, I encourage creating standard specifications for each data type collected—logs, metrics, traces, events, etc. You can take this as far as you’d like—highly structured with a type system and rigid specification—but at a minimum, get logs into a standard format with property tags. JSON is fine for the actual structure, but be sure to version the spec so that it can evolve. For application events, one pattern that can work well is to create an inheritance structure with a base spec that applies across services (e.g. user context and tracing information are the same) and specialized specs that can be defined by services if needed. Just be careful not to leak sensitive data here—this is one area where code reviews are vital.

Specification Libraries

A key part of empowering developers is providing tools that align the “easy” path with the “right” path. If these aren’t aligned, pain-driven development creates problems. In order for developers to take advantage of structured data, specifications aren’t enough. We need libraries which implement the specs and make it easy for engineers to actually instrument their systems. For logging, there are many existing libraries. Just Google “structured logs” and your language of choice. For tracing and metrics, there are APIs like OpenTracing and OpenCensus. In practice, implementing the spec might be a combination of libraries and transformations made by the data collector described below.

Data Collector

This component is responsible for collecting data from hosts, containers, or other sources and writing it to the data pipeline. It may also perform transformations or filtering of data. A couple popular open source solutions for this are Fluentd and Logstash. Typically this runs as a sidecar or agent on the host, and data is written to stdout/stderr or a Unix domain socket, which it then pushes to the pipeline.

Data Pipeline

This component is a highly scalable data stream which can handle the firehose of observability data being generated and has high availability. This also provides a buffer for the data and decouples producers from consumers. Off-the-shelf solutions include Apache Kafka, Google Cloud Pub/Sub, Amazon Kinesis Data Streams, and Liftbridge.

Data Router

This component consumes data from the pipeline, performs filtering, and writes it to the appropriate backends. It may perform some transformations and processing of the data as well, but generally any heavy processing should be the responsibility of a backend system (e.g. alerting or aggregations). This is where the data specifications come into play. The data type will determine how routers handle incoming data, e.g. routing log data to Splunk and cold storage, routing traces to Google Stackdriver, and routing metrics and APM data to New Relic.

Like the specifications and libraries, this is a component that requires some more involvement. The downside of moving away from agent-based data collection is we now have to handle routing that data ourselves. The upside is most vendors provide good APIs and client libraries which make this easier.

Since this is typically a stateless service, it’s a good fit for “serverless” solutions like Google Cloud Functions or AWS Lambda.

Piecing It All Together

Putting all of these pieces together, the observability pipeline looks something like the following:

One caveat I want to point out is that this is not something you need to build out from day one. At most of the companies where we’ve implemented this, it was something that evolved over time. For instance, with some of the clients we work with who are attempting to move to the cloud and adopt DevOps practices, we typically would not advise making a significant upfront investment to architect this pipeline. This is an ideal goal to work towards that will become increasingly important as the amount of services, traffic, and data scales. Instead, architect your systems from the beginning to be able to adopt this approach more easily—use structured logging, keep collection out-of-process, and use a centralized logging system.

For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. Unlocking this data can also be a huge win for the business. It provides a layer of abstraction that allows you to get the data everywhere it needs to be without impacting developers and the core system. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.